In this notebook, we will create a Document Q&A system for the publicly available starter [rule book](https://www.chaosium.com/content/FreePDFs/CoC/CHA23131%20Call%20of%20Cthulhu%207th%20Edition%20Quick-Start%20Rules.pdf?srsltid=AfmBOooV65qRFBnJ-pYV7s86zOziehQukLg41ZcY5zmsB6gP2Jt-PCS1) for the Call of Cthulhu RPG. The goal is to answer basic questions about the game, targeting new players.

First, let's install ChromaDB and the Gemini API Python SDK. This might spit out some errors.

In [1]:
!pip uninstall -qqy jupyterlab kfp  # Remove unused conflicting packages
!pip install -qU "google-genai==1.7.0" "chromadb==0.6.3"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m62.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.9/100.9 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.2/95.2 kB[0m [31m5.6 MB/s[0

In [2]:
from google import genai
from google.genai import types

from IPython.display import Markdown

genai.__version__

'1.7.0'

**Set up your API key**

To run the following cell, your API key must be stored it in a Kaggle secret named GOOGLE_API_KEY.

In [3]:
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")

Let's explore available models that support text embeddings.

In [4]:
client = genai.Client(api_key=GOOGLE_API_KEY)

for m in client.models.list():
    if "embedContent" in m.supported_actions:
        print(m.name)

models/embedding-001
models/text-embedding-004
models/gemini-embedding-exp-03-07
models/gemini-embedding-exp


We will be using text-embedding-004 as the most recent generally-available embedding model.

Let's add a script that reads the content of the PDF into the string using PyPDF2 library.

In [5]:
pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Note: you may need to restart the kernel to use updated packages.


In [6]:
import PyPDF2
import requests
from io import BytesIO

def read_pdf_content_to_string(pdf_url):
    """
    Reads the text content from a PDF file hosted at a URL and returns it as a string.

    Args:
        pdf_url (str): The URL of the PDF file.

    Returns:
        str: The text content of the PDF, or an empty string if an error occurs.
    """

    text = ""
    try:
        response = requests.get(pdf_url)
        response.raise_for_status()

        with BytesIO(response.content) as pdf_file:
            reader = PyPDF2.PdfReader(pdf_file)
            for page in reader.pages:
                text += page.extract_text() or ""

    except requests.exceptions.RequestException as e:
        print(f"Error fetching PDF: {e}")
    except PyPDF2.errors.PdfReadError as e:
        print(f"Error reading PDF: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

    return text

Let's read the document and print the first 50 characters to get a glimpse of the content.

In [7]:
pdf_url = "https://www.chaosium.com/content/FreePDFs/CoC/CHA23131%20Call%20of%20Cthulhu%207th%20Edition%20Quick-Start%20Rules.pdf?srsltid=AfmBOooV65qRFBnJ-pYV7s86zOziehQukLg41ZcY5zmsB6gP2Jt-PCS1"
pdf_content = read_pdf_content_to_string(pdf_url)

if pdf_content:
    # 
    print(len(pdf_content))
    print("--- First 50 Characters of PDF Content ---")
    print(pdf_content[:50])
    print("--- End of First 50 Characters ---")
else:
    print("Could not retrieve or read PDF content.")

113302
--- First 50 Characters of PDF Content ---
S a n dy Petersen, Mike Maso n , 
P
a
u
l
 
F
r
i

--- End of First 50 Characters ---


Let's create the embedding database with ChromaDB. We are implementing a retrieval system, so the `task_type` for generating the document embeddings is `retrieval_document`. Later, we will use `retrieval_query` for the query embeddings. 

In [8]:
from chromadb import Documents, EmbeddingFunction, Embeddings
from google.api_core import retry

from google.genai import types


# Define a helper to retry when per-minute quota is reached.
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})


class GeminiEmbeddingFunction(EmbeddingFunction):
    # Specify whether to generate embeddings for documents, or queries
    document_mode = True

    @retry.Retry(predicate=is_retriable)
    def __call__(self, input: Documents) -> Embeddings:
        if self.document_mode:
            embedding_task = "retrieval_document"
        else:
            embedding_task = "retrieval_query"

        response = client.models.embed_content(
            model="models/text-embedding-004",
            contents=input,
            config=types.EmbedContentConfig(
                task_type=embedding_task,
            ),
        )
        return [e.values for e in response.embeddings]

Now let's create a Chroma database client that uses the GeminiEmbeddingFunction and populate the database with the Call Of Cthulhu documents we defined above. For better result, we will split the document into 20 sub documents. We could also explore splitting it py pages or sections.

In [9]:
import chromadb

DB_NAME = "call-of-cthulhu"

embed_fn = GeminiEmbeddingFunction()
embed_fn.document_mode = True

chroma_client = chromadb.Client()
db = chroma_client.get_or_create_collection(name=DB_NAME, embedding_function=embed_fn)

n_chunks = 20
chunk = len(pdf_content) / n_chunks
for i in range(n_chunks):
    start = int(i*chunk)
    finish = int((i+1)*chunk)
    db.add(documents=[pdf_content[start:finish]], ids=[str(i)])

Confirm that the data was inserted by looking at the database - we have inserted 20 documents.

In [10]:
db.count()

20

### Retrievals

Let's switch to query mode to retrieve answers for the questions from the stored documents. First, let's find out which documents has the answer. As the information is scattered between multiple sections, we can fetch 3 documents.

In [11]:
# Switch to query mode when generating embeddings.
embed_fn.document_mode = False

# Search the Chroma DB using the specified query.
query = "What is STR?"

result = db.query(query_texts=[query], n_results=3)
[all_passages] = result["documents"]
len(all_passages)

3

Now let's assemble a generation prompt to have the Gemini API generate a final answer using the retrieved document.

In [12]:
query = "What is STR?"
query_oneline = query.replace("\n", " ")

prompt = f"""You are a helpful and informative bot that answers questions using text from the reference passage included below.
Be sure to respond in a complete sentence, being comprehensive, including all relevant background information.
However, you are talking to an audience new to the game, so be sure to provide examples
and strike a friendly and conversational tone. If the passage is irrelevant to the answer, you may ignore it.
Use only information provided in the passage.

QUESTION: {query_oneline}"""
print(prompt)

# Add the retrieved documents to the prompt.
for passage in all_passages:
    passage_oneline = passage.replace("\n", " ")
    prompt += f"PASSAGE: {passage_oneline}\n"

#print(prompt)

You are a helpful and informative bot that answers questions using text from the reference passage included below.
Be sure to respond in a complete sentence, being comprehensive, including all relevant background information.
However, you are talking to an audience new to the game, so be sure to provide examples
and strike a friendly and conversational tone. If the passage is irrelevant to the answer, you may ignore it.
Use only information provided in the passage.

QUESTION: What is STR?


Now let's use the generate_content method to to generate an answer to the question.

In [13]:
answer = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt)

answer.text

'STR is short for Strength, and it represents the raw physical power that your investigator can bring to bear in the game. As an example, an investigator with a high STR might be better at lifting heavy objects or winning a physical fight.'

Let's assemble a function that repies to the question asked using the techniques described above:

In [14]:
def get_prompt(question: str) -> str:
    query_oneline = question.replace("\n", " ")
    
    return f"""You are a helpful and informative bot that answers questions using text from the reference passage included below.
    Be sure to respond in a complete sentence, being comprehensive, including all relevant background information.
    However, you are talking to an audience new to the game, so be sure to provide examples
    and strike a friendly and conversational tone. If the passage is irrelevant to the answer, you may ignore it.
    Use only information provided in the passage.
    
    QUESTION: {query_oneline}"""

def get_passages(question: str) -> list[str]:
    result = db.query(query_texts=[query], n_results=3)
    [all_passages] = result["documents"]

def reply(question: str) -> str:
    passages = get_passages(question)
    prompt = get_prompt(question)
    for passage in all_passages:
        passage_oneline = passage.replace("\n", " ")
        prompt += f"PASSAGE: {passage_oneline}\n"

    answer = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=prompt)
    return answer.text

Let's try a few examples:

In [15]:
Markdown(reply("What is STR?"))

STR, which is short for Strength, is one of the eight characteristics used to measure an investigator's attributes in the game Call of Cthulhu. In particular, STR measures the raw physical power your investigator can bring to bear. For example, if your investigator has STR 60, that means Strength 60%.

In [16]:
Markdown(reply("Can an investigator posses a weapon?"))

Yes, an investigator can possess weapons, and this is noted in the weapons section of their investigator sheet. The investigator sheet notes each weapon's Regular, Hard, and Extreme skill values, the damage it can inflict (usually a dice roll), and the number of attacks per round it can be used. For firearms, it also includes the range, ammunition, and its malfunction number.


In [17]:
Markdown(reply("What do I need to bring for the game?"))

To get started with the Call of Cthulhu roleplaying game, you'll need a few things. You'll definitely want the Quick-Start Rules guide, which introduces you to the game. In addition, you'll need a set of polyhedral dice, or a dice-rolling app if you're playing online, notepaper, pencils, at least one other person to play with, and a quiet place to play for two to four hours. If you're playing online, you can use an online dice roller and share investigator sheets as PDFs. You'll also need a video conferencing platform so everyone can see and hear each other.

In [18]:
Markdown(reply("Is there magic in the game?"))

Yes, there is magic in the game, and it is connected to Magic Points. Magic Points, or MP, are used to cast spells or produce some other magical effect. When a player uses magic points, they regenerate at a rate of one point per hour. If a character uses up all of their magic points, any further expenditure is taken from their hit points, which causes physical damage. For an example of how magic points are used, you can refer to Corbitt's Spells in The Haunting adventure.
