# RAG and AI

## Embeddings

Embeddings can be generated locally using a number of python modules (most commonly `sentence-transformers`), however depending on device this can be slow. For now we'll use Gemini's embedding generation functions.

This workbook requires a Google Gemini API key:
- Google provide Gemini with a free tier available to anyone with a Google account.
- To get your own key visit [aistudio.google.com/app/apikey](https://aistudio.google.com/app/apikey) and click "Create API key"

This API key needs to be saved to your project's .env file. Open your .env and add *GEMINI_API_KEY="MY-API-KEY-HERE"* or run
```
# Linux
echo 'GEMINI_API_KEY="MY-API-KEY-HERE"' >> .env

# Windows
echo GEMINI_API_KEY="MY-API-KEY-HERE"> .env
```

This workbook also requires the Gemini python module, google-generativeai

In [None]:
import os
import google.generativeai as genai
from dotenv import load_dotenv                  # Allow us to load environment variables

In [None]:
load_dotenv()
gemini_api_key = os.environ["GEMINI_API_KEY"]
genai.configure(api_key=gemini_api_key)

### Generating a single embedding

Google provide a free embeddings API, rate limited to 1500 calls per minute. For our purposes this is perfect.

In [None]:
result = genai.embed_content(
    model="models/text-embedding-004",          # Embedding model to use
    content="What is the meaning of life?",     # Document text to create an embedding of
    task_type="retrieval_document",             # Task type ("retreival_query", "retreival_document", "clustering", "semantic_similarity", "classification")
    title="Embedding of a single string"        # Parameter for retreival_document tasks. Ostensibly the document title
)

In [None]:
result["embedding"][:10]    # Shortened to the first 10 elements

# Loading and chunking documents

### Loading pdfs

This can be acomplished with a number of tools, including OCR, but we'll be using `pypdf`

In [None]:
from pypdf import PdfReader    # `pip install pypdf`
from os import listdir
from os.path import isfile, join
import re

#### Loading a single document

In [None]:
reader = PdfReader('./data/sample/ResearchBazaarQueensland2024.pdf')
print(f"File contains {str(len(reader.pages))} pages")
for page in reader.pages:
    print(page.extract_text(0))

#### Approaching cleaning

In [None]:
def clean_text(text: str) -> str:
    """
    Clean the given input text. Removes extra whitespace and unwanted characters.
    """
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    # Remove special characters but keep periods and common punctuation
    text = re.sub(r'[^\w\s.,!?-]', '', text)
    # Remove dashes that might occur at the wrapping of text
    text = re.sub(r'\s*-\s*', '', text)
    return text.strip()

#### Loading all documents

Load all documents and clean the text in preperation for chunking

In [None]:
data_dir = './data/ethics/'

In [None]:
documents = {}
for document in listdir(data_dir):
    working_document = join(data_dir, document)
    if isfile(working_document):
        reader = PdfReader(working_document)
        document_text = ""
        for i, page in enumerate(reader.pages):
            document_text += clean_text(page.extract_text())
        documents[document] = document_text

In [None]:
# Sample of cleaned document text
documents[list(documents.keys())[0]]

#### Perform Chunking

We're going to address segmentation/chunking based on the following assumptions
- A new paragraph means a new chunk*
- A new sentence within a paragraph should be added to the paragraph chunk if it will fit
- A new page is a new chunk (this is more a limitation of not being able to differentiate a new page and a new paragraph)

*Determining paragraphs from a PDF is [*hard*](https://pypdf.readthedocs.io/en/stable/user/extract-text.html#why-text-extraction-is-hard). We're going to assume two sucessive newlines means a new paragraph.<br>
The paragraph problem is more easily solvable with OCR, however that is beyond the scope of this workbook.

Determining suitable chunk size is almost as hard, and just as situation specific. Embeddings models also enforce limits on the size of the input, further complicating things.
For our purposes we'll use 128 tokens as a chunk size, as this represents a moderately sized paragraph; roughly the resolution we want to be able to trace back to.

Embedding quality can be improved by actions such as prepending the tail of the previous chunk to the new one in order to provide more context, or even including the title of the document. As Gemini's text embedding API provides a specific parameter for passing the title of a document, we won't worry about providing additional context for the moment.

In [None]:
chunk_token_limit = 128

In [None]:
chunks = {}
for document, text in documents.items():
    chunks[document] = []
    current_words = []
    for word in text.split(' '):
        current_words.append(word)
        if len(current_words) >= chunk_token_limit:
            chunks[document].append(" ".join(current_words))
            current_words = []

In [None]:
# An example of the chunked output of the first document
chunks[list(chunks.keys())[0]]

# Generating and Storing embeddings

#### Generating a single embedding
As shown earlier, generating a single embedding is a simple task, so how do we scale?

In [None]:
genai.embed_content(model="models/text-embedding-004", content="What is the meaning of life?", task_type="retrieval_document", title="Embedding of a single string")

## How to store embeddings?

There are many vector-compatible databases, as well as databases aimed soley at storing and retrieving vectors. One of the most lightweight options is Chroma DB.

In [None]:
import chromadb
chroma = chromadb.PersistentClient(path="./chroma-db")

In [None]:
# Delete the collection if it already exists
chroma.delete_collection(name="gemini-embeddings")

#### Generating embeddings for all of our documents

Create a Chroma DB collection, with the embedding generation function set to use Gemini

In [None]:
from chromadb.utils.embedding_functions.google_embedding_function import GoogleGenerativeAiEmbeddingFunction
chroma.delete_collection(name="gemini-embeddings")
db = chroma.get_or_create_collection(name="gemini-embeddings", embedding_function=GoogleGenerativeAiEmbeddingFunction(api_key=gemini_api_key, model_name="models/text-embedding-004", task_type="RETRIEVAL_DOCUMENT"))

Add each chunk to the database (this may take a while)

In [None]:
for document in chunks.keys():
    print("Working on: " + document)
    document_id = re.sub(r'[^a-zA-Z]', '', document).rstrip('pdf')
    print(document_id)
    for i, chunk in enumerate(chunks[document]):
        db.add(
            documents=chunk,
            ids=f"{document_id}-{i}",
            metadatas={
                "document": document
            }
        )

#### Querying our documents
Now that we have a database full of embeddings, we can get to querying

What do we want to ask?

In [None]:
question = "What are user's perceptions of their twitter data usage?"
# question = "ownership of content"
# question = "What is responsible research?"
# question = "What is responsible research? Consider user concerns regarding data usage as well as animal ethics."
# question = "Is it ethical to collect user's data?"

Create an embedding of the question. Note the changed "task_type" from "retrieval_document" to "retrieval_query"

In [None]:
question_embedding = genai.embed_content(model="models/text-embedding-004", content=question, task_type="retrieval_query")['embedding']
results = db.query(query_embeddings=question_embedding, n_results=15)

We have results! The results dictionary provides different useful pieces of information, including the matched ids, the original text "documents", as well as distance between the matches (the higher the score the better the match).

In [None]:
results.keys()

In [None]:
results['ids']

#### Passing documents to Gemini

Now that we have our matched documents, how can we use them for discussions?

The code below defines the role of the AI. Try changing the prompt, you'll likely get quite different responses.

In [None]:
prompt = f"""
You are a informative bot that answers questions using only the document texts provided. You will be provided multiple, use all as needed.
If you do not know the answer from the provided context and documents, do not use your prior knowledge, and tell the user that you do not know.
Generate at least one paragraph.
Be comprehensive, helpful, human-readable, and provide detailed background information in your answer.
QUESTION: {question}
TEXTS: {results['documents'][0]}
"""

In [None]:
model = genai.GenerativeModel(model_name="gemini-1.5-flash")

Print out the result of our question

In [None]:
result = model.generate_content(prompt).text
print(result)

#### But which document did that come from?

We know it came from one of the (up to) 5 supplied documents, but can we get more detailed than that?

Rearrange the results data structure to include the document IDs

In [None]:
results_with_ids = {
    results['ids'][0][i]: results['documents'][0][i] for i in range(len(results['ids'][0]))
}

In [None]:
results_with_ids

In [None]:
question = "What are user's perceptions of their twitter data usage?"
# question = "ownership of content"
# question = "What is responsible research?"
# question = "What is responsible research? Consider user concerns regarding data usage as well as animal ethics."
# question = "Is it ethical to collect user's data?"

In [None]:
prompt = f"""
You are a informative bot that answers questions using only the document texts provided. You will be provided multiple, use all as needed.
If you do not know the answer from the provided context and documents, do not use your prior knowledge, and tell the user that you do not know.
Generate at least one paragraph.
Be comprehensive, helpful, human-readable, and provide detailed background information in your answer.
Finish each response with "SOURCE:" followed by a python list of all document IDs that you used to answer the question.
QUESTION: {question}
TEXTS: {results_with_ids}
"""

In [None]:
result = model.generate_content(prompt).text
print(result)

We have a new response, this time with the IDs of the documents used*. We can now look up the document containing the answer in our Chroma DB instance.

*Keep in mind AI is fallible, and may not have correctly identified the exact document it is using

In [None]:
containing_documents = result.rsplit("SOURCE:", 1)[1]
containing_documents = containing_documents.split(",")
containing_documents = [re.sub(r'[^a-zA-Z0-9\-]', '', containing_document) for containing_document in containing_documents]

In [None]:
containing_documents

In [None]:
db.get(ids=containing_documents)

#### What about follow up questions?

Chatting with Gemini

In [None]:
chat = model.start_chat(
    history=[
        {"role": "user", "parts": prompt},
        {"role": "model", "parts": result}
    ]
)

In [None]:
chat_response = chat.send_message("How do user perceptions change")
print(chat_response.text)

In [None]:
chat_response = chat.send_message("What are some recommendations on how to improve these perceptions?")
print(chat_response.text)

## Further improvements

- Include the page number and paragraph number in the Chroma DB entry. Alternatively include character offsets.
- Open a pdf viewer and highlight the matched text
- Provide additional documents as needed