# How RAG Works

This is the basic workflow for RAG, which we will implement step by step in this notebook.

1. Load the data \
    a. Tokenize the text in the data \
    b. Break the tokenized data into chunks 

2. Convert the chunked up data into a vector representation, ie: embedding the data \
    a. Use a pre-trained language model to generate the vector embeddings

3. Store the embeddings in a vector database 
4. Query the vector database by parsing the query 
5. Retrieve the relevant documents from the vector database 
6. Rank the documents based on their similarity scores 
7. Return the top-ranked documents 
8. Display the results

## 1. Loading the data

### 1a. Tokenize the text in the data

In [2]:
import fitz # PyMuPDF, a library for reading and writing PDF files
from tqdm import tqdm # a library for progress bars, visualizing progress of loops

In [3]:
pdf_path = "data/harry_potter_sorcerer's_stone.pdf"

# Formatting the text to remove newlines and extra spaces
def format_text(text:str) -> str:
    cleaned_text = text.replace("\n", " ").strip()
    return cleaned_text

# Reading the PDF file and returning a list of dictionaries containing the text and page number
def read_pdf(filepath:str) -> list[dict]:
    """Reads a PDF file and returns a list of dictionaries containing the text and page number."""
    doc = fitz.open(filepath) # Open the PDF file
    text_list = [] # Initialize an empty list to store the text and page number
    for page_num, page in tqdm(enumerate(doc)): # Loop through each page in the document
        text = page.get_text() # Get the text from the page
        text_list.append({"text": format_text(text), "page": page_num, "page_word_count": len(text.split(" ")), "page_token_count": len(text)/4}) # Append the text, page number, page word count, and page token count to the list
    return text_list

# Reading the PDF file and returning a list of dictionaries containing the text and page number
pages_and_texts = read_pdf(pdf_path)
pages_and_texts[:2] # Displaying the first 2 items in the list

250it [00:01, 125.55it/s]


[{'text': '', 'page': 0, 'page_word_count': 1, 'page_token_count': 0.0},
 {'text': "1 Harry Potter and the Sorcerer's Stone CHAPTER ONE THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it

In [4]:
import pandas as pd # Library used to create dataframes in Python
df = pd.DataFrame(pages_and_texts) # Creating a dataframe from the list of dictionaries

In [5]:
df.describe().round(3) # Displaying the summary statistics of the dataframe

Unnamed: 0,page,page_word_count,page_token_count
count,250.0,250.0,250.0
mean,124.5,284.252,437.725
std,72.313,53.953,77.05
min,0.0,1.0,0.0
25%,62.25,252.25,393.25
50%,124.5,286.0,436.5
75%,186.75,320.0,491.25
max,249.0,472.0,677.75


## 1b. Processing text on the sentence level.

Concatenate 10-15 sentences together to get a "chunk" of text

In [6]:
pages_and_texts[0] # Displaying the first item in the list

{'text': '', 'page': 0, 'page_word_count': 1, 'page_token_count': 0.0}

In [7]:
from spacy.lang.en import English # importing spacy library to allow us to create chunks of text using individual sentences.
nlp = English()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer 
nlp.add_pipe("sentencizer")


<spacy.pipeline.sentencizer.Sentencizer at 0x132bc2f50>

In [8]:

# make chunks of text by chunking sentences together
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    item['num_sents'] = len(item["sentences"])
    


100%|██████████| 250/250 [00:00<00:00, 335.27it/s]


In [9]:
chunk_size = 10 # Number of sentences per chunk
def get_chunks(input_list: list[str],
               slice_size: int = chunk_size) -> list[list[str]]:
    """Split a list into chunks of a specified size."""
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

In [10]:
# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = get_chunks(input_list=item["sentences"],
                                         slice_size=chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

100%|██████████| 250/250 [00:00<00:00, 195192.85it/s]


In [11]:
import random
random.sample(pages_and_texts, k=1)


[{'text': '227 Ron started to direct the black pieces. They moved silently wherever he sent them. Harry\'s knees were trembling. What if they lost? "Harry -- move diagonally four squares to the right." Their first real shock came when their other knight was taken. The white queen smashed him to the floor and dragged him off the board, where he lay quite still, facedown. "Had to let that happen," said Ron, looking shaken. "Leaves you free to take that bishop, Hermione, go on." Every time one of their men was lost, the white pieces showed no mercy. Soon there was a huddle of limp black players slumped along the wall. Twice, Ron only just noticed in time that Harry and Hermione were in danger. He himself darted around the board, taking almost as many white pieces as they had lost black ones. "We\'re nearly there," he muttered suddenly. "Let me think let me think..." The white queen turned her blank face toward him. "Yes..." said Ron softly, "It\'s the only way... I\'ve got to be taken." "

In [12]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts): 
    for sentence_chunk in item["sentence_chunks"]: 
        chunk_dict = {}
        chunk_dict["page_number"] = item["page"]

        # Join the sentences together into a paragraph-like structure, aka join the list of sentences into one paragraph
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" => ". A" (will work for any captial letter)

        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get some stats on our chunks
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 chars

        pages_and_chunks.append(chunk_dict) 

len(pages_and_chunks)

100%|██████████| 250/250 [00:00<00:00, 8749.22it/s]


753

In [13]:
random.sample(pages_and_chunks, k=1) # Displaying a random sample from the list


[{'page_number': 215,
  'sentence_chunk': 'Hermione, you\'d better do that." "Why me?" "It\'s obvious," said Ron. "You can pretend to be waiting for Professor Flitwick, you know."He put on a high voice, "\'Oh Professor Flitwick, I\'m so worried, I think I got question fourteen b wrong....\'" "Oh, shut up," said Hermione, but she agreed to go and watch out for Snape. "And we\'d better stay outside the third-floor corridor," Harry told Ron. "Come on."But that part of the plan didn\'t work. No sooner had they reached the door separating Fluffy from the rest of the school than Professor McGonagall turned up again and this time, she lost her temper. "I suppose you think you\'re harder to get past than a pack of enchantments!"',
  'chunk_char_count': 679,
  'chunk_word_count': 118,
  'chunk_token_count': 169.75}]

In [14]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)


Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,753.0,753.0,753.0,753.0
mean,125.85,578.85,103.09,144.71
std,72.09,275.52,49.85,68.88
min,1.0,12.0,2.0,3.0
25%,64.0,399.0,71.0,99.75
50%,127.0,563.0,99.0,140.75
75%,189.0,729.0,132.0,182.25
max,249.0,1877.0,349.0,469.25


In [15]:
df.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count
0,1,1 Harry Potter and the Sorcerer's Stone CHAPTE...,1327,235,331.75
1,1,The Dursleys knew that the Potters had a small...,788,142,197.0
2,2,2 because Dudley was now having a tantrum and ...,645,126,161.25
3,2,As Mr. Dursley drove around the corner and up ...,878,168,219.5
4,2,They were whispering excitedly together. Mr. D...,934,167,233.5


In [16]:
# Filtering out chunks that are too short
min_token_length = 20 # Minimum number of tokens a chunk must have to be included
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 3.0 | Text: he squeaked.
Chunk token count: 9.75 | Text: Finch-Fletchley, Justin!" "HUFFLEPUFF!"
Chunk token count: 3.25 | Text: GET OUT!OUT!"
Chunk token count: 17.75 | Text: A murmur ran through the crowd as Adrian Pucey dropped the Quaffle, too
Chunk token count: 8.5 | Text: Harry wished he would blink. Those


In [17]:
# Filter our DataFrame for rows with under 30 tokens
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': 1,
  'sentence_chunk': "1 Harry Potter and the Sorcerer's Stone CHAPTER ONE THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if a

In [18]:
random.sample(pages_and_chunks_over_min_token_len, k=1)

[{'page_number': 64,
  'sentence_chunk': 'Ah yes," said the man. "Yes, yes. I thought I\'d be seeing you soon. Harry Potter."It wasn\'t a question. "You have your mother\'s eyes. It seems only yesterday she was in here herself, buying her first wand. Ten and a quarter inches long, swishy, made of willow. Nice wand for charm work."Mr. Ollivander moved closer to Harry.',
  'chunk_char_count': 324,
  'chunk_word_count': 57,
  'chunk_token_count': 81.0}]

## 2a. Generating embeddings and 3. Storing the embeddings

In [19]:
# Getting the data ready to geenerate embeddings
text_chunks= [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]
text_chunks[2]

'2 because Dudley was now having a tantrum and throwing his cereal at the walls. "Little tyke," chortled Mr. Dursley as he left the house. He got into his car and backed out of number four\'s drive. It was on the corner of the street that he noticed the first sign of something peculiar -- a cat reading a map. For a second, Mr. Dursley didn\'t realize what he had seen -- then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn\'t a map in sight. What could he have been thinking of?It must have been a trick of the light. Mr. Dursley blinked and stared at the cat. It stared back.'

In [20]:
len(text_chunks)

735

In [31]:
# Import necessary libraries
import json
import os
import ollama

# Function to create and save embeddings for text chunks

###### 2a. Generating vector embeddings

def save_embeddings(filename, text_chunks):
    """
    Create embeddings for given text chunks and save them to a JSON file.

    Args:
    filename (str): Name of the file to save embeddings (without extension).
    text_chunks (list): List of text chunks to create embeddings for.

    Returns:
    None

    This function does the following:
    1. Creates a directory 'embeddings' if it doesn't exist.
    2. Checks if the embeddings file already exists. If it does, it skips creation.
    3. If the file doesn't exist, it creates embeddings for each text chunk using the 'nomic-embed-text' model.
    4. Saves the embeddings to a JSON file in the 'embeddings' directory.

    The embeddings are stored in a list, where each element corresponds to the embedding of a text chunk.
    """
    # Create 'embeddings' directory if it doesn't exist
    if not os.path.exists("embeddings"):
        os.makedirs("embeddings")
    
    # Check if embeddings file already exists
    file_path = f"embeddings/{filename}.json"
    if os.path.exists(file_path):
        print(f"Embeddings file {file_path} already exists. Skipping creation.")
        return
    
    # Create embeddings if file doesn't exist
    print("Creating embeddings...")
    embeddings = [
        ollama.embeddings(model="nomic-embed-text", prompt=chunk)["embedding"]
        for chunk in text_chunks
    ]

    ###### 3. Storing embeddings
    
    # Save embeddings to JSON file
    with open(file_path, "w") as f:
        json.dump(embeddings, f)
    print(f"Embeddings saved to {file_path}")

save_embeddings("sorcerer's_stone", text_chunks)

Embeddings file embeddings/sorcerer's_stone.json already exists. Skipping creation.


In [25]:
filename = "sorcerer's_stone"
with open(f"embeddings/{filename}.json", "r") as f:
    embedding_file = json.load(f) # loading the embeddings as embedding_file

## 4 and 5. Querying the vector database and retrieving the most similar chunks

In [26]:
# This cell contains the code for calculating the cosine similarity between two vectors and finding the most similar chunks to a given query.

from numpy.linalg import norm
import numpy as np

# Function to calculate cosine similarity between two vectors
def cosine_similarity(vec1, vec2):
    """
    Calculate the cosine similarity between two vectors.

    Args:
    vec1 (list): First vector.
    vec2 (list): Second vector.

    Returns:
    float: Cosine similarity between the two vectors.

    This function calculates the cosine similarity between two vectors using the dot product
    and the magnitudes of the vectors. The formula used is:
    cosine_similarity = dot_product(vec1, vec2) / (magnitude(vec1) * magnitude(vec2))

    The cosine similarity ranges from -1 to 1, where:
    1 indicates perfect similarity
    0 indicates no similarity
    -1 indicates perfect dissimilarity (opposite vectors)
    """
    # Calculate dot product of the two vectors
    dot_product = np.dot(vec1, vec2)
    
    # Calculate magnitudes of both vectors
    magnitude1 = norm(vec1)
    magnitude2 = norm(vec2)
    
    # Return cosine similarity
    return dot_product / (magnitude1 * magnitude2)

# Function to find the most similar chunks to a given query
# Parameters:
#   query: The input query string
#   embeddings: List of embeddings for all text chunks
#   pages_and_chunks_over_min_token_len: List of dictionaries containing page and chunk information
#   top_k: Number of most similar chunks to return (default: 5)
# Returns:
#   List of tuples containing similarity score and chunk information
def find_most_similar(query, embeddings, pages_and_chunks_over_min_token_len, top_k=5):
    # Generate query embedding
    query_embedding = ollama.embeddings(model="nomic-embed-text", prompt=query)["embedding"]
    query_norm = norm(query_embedding)
    
    # Calculate similarity scores
    similarity_scores = [
        (np.dot(query_embedding, embedding) / (query_norm * norm(embedding)), idx)
        for idx, embedding in enumerate(embeddings)
    ]
    
    # Sort by similarity score in descending order and return top K results with text
    most_similar = sorted(similarity_scores, reverse=True)[:top_k]
    
    return [(score, pages_and_chunks_over_min_token_len[idx]) for score, idx in most_similar]

## 6. Ranking the documents based on their similarity scores and 7. Returning the top-ranked documents

In [27]:
query = "When was quidditch played" # general query, can ask about anything
similar_chunks = find_most_similar(query, embedding_file, pages_and_chunks_over_min_token_len)
similar_chunks

[(0.6823017966216157,
  {'page_number': 143,
   'sentence_chunk': 'CHAPTER ELEVEN QUIDDITCH As they entered November, the weather turned very cold. The mountains around the school became icy gray and the lake like chilled steel. Every morning the ground was covered in frost. Hagrid could be seen from the upstairs windows defrosting broomsticks on the Quidditch field, bundled up in a long moleskin overcoat, rabbit fur gloves, and enormous beaverskin boots. The Quidditch season had begun. On Saturday, Harry would be playing in',
   'chunk_char_count': 463,
   'chunk_word_count': 74,
   'chunk_token_count': 115.75}),
 (0.6652738694498586,
  {'page_number': 60,
   'sentence_chunk': 'Play Quidditch at all?" "No," Harry said again, wondering what on earth Quidditch could be. "I do -- Father says it\'s a crime if I\'m not picked to play for my house, and I must say, I agree. Know what house you\'ll be in yet?" "No," said Harry, feeling more stupid by the minute. "Well, no one really knows unt

In [28]:
print("Text chunks: ", text_chunks[423])
print("Pages and chunks: ", pages_and_chunks_over_min_token_len[423])

Text chunks:  146 Harry left, before Snape could take any more points from Gryffindor. He sprinted back upstairs. "Did you get it?"Ron asked as Harry joined them. "What's the matter?"In a low whisper, Harry told them what he'd seen. "You know what this means?"he finished breathlessly. "He tried to get past that three-headed dog at Halloween!That's where he was going when we saw him -- he's after whatever it's guarding!
Pages and chunks:  {'page_number': 146, 'sentence_chunk': '146 Harry left, before Snape could take any more points from Gryffindor. He sprinted back upstairs. "Did you get it?"Ron asked as Harry joined them. "What\'s the matter?"In a low whisper, Harry told them what he\'d seen. "You know what this means?"he finished breathlessly. "He tried to get past that three-headed dog at Halloween!That\'s where he was going when we saw him -- he\'s after whatever it\'s guarding!', 'chunk_char_count': 408, 'chunk_word_count': 68, 'chunk_token_count': 102.0}


## 8. Displaying the results using the Mistral model

In [30]:
import textwrap
query = "Which month was quidditch played" # edit this line with your query and watch the magic unfold!

# Find the most similar chunks
similar_chunks = find_most_similar(query, embedding_file, text_chunks)

# Prepare the context snippets
context_snippets = "\n".join(f"Source {i+1}: {item[1]}" for i, item in enumerate(similar_chunks))

# Generate response from the Mistral model
SYSTEM_PROMPT = """You are a helpful reading assistant who answers questions 
                based on snippets of text provided in context. Answer only using the context provided, 
                being as concise as possible. If you're unsure, just say that you don't know.
                Context:
                """
response = ollama.chat(
    model="mistral",
        messages=[
            {
                "role": "system",
                "content": SYSTEM_PROMPT + context_snippets,
            },
            {"role": "user", "content": query},
        ],
    )

line_width = 80
# Print the response and the sources
print("Answer:", response["message"]["content"])
print("\nSources:")
for i, (score, snippet) in enumerate(similar_chunks):
    wrapped_text= textwrap.fill(snippet, width = line_width)
    print(f"Source {i+1} (Score: {score:.4f}):\n{wrapped_text}\n")


Answer:  The text does not specify a specific month when Quidditch is played, but it indicates that it starts in November, as they entered November, the weather turned cold and the Quidditch season had begun. However, since the story continues after November, it's possible that Quidditch matches could take place during other months as well.

Sources:
Source 1 (Score: 0.7072):
CHAPTER ELEVEN QUIDDITCH As they entered November, the weather turned very cold.
The mountains around the school became icy gray and the lake like chilled steel.
Every morning the ground was covered in frost. Hagrid could be seen from the
upstairs windows defrosting broomsticks on the Quidditch field, bundled up in a
long moleskin overcoat, rabbit fur gloves, and enormous beaverskin boots. The
Quidditch season had begun. On Saturday, Harry would be playing in

Source 2 (Score: 0.6373):
Play Quidditch at all?" "No," Harry said again, wondering what on earth
Quidditch could be. "I do -- Father says it's a crime if I

# Experimentation

In [28]:
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|edu|me)"
digits = "([0-9])"
multiple_dots = r'\.{2,}'

def split_into_sentences(text: str) -> list[str]:
    """
    Split the text into sentences.

    If the text contains substrings "<prd>" or "<stop>", they would lead 
    to incorrect splitting because they are used as markers for splitting.

    :param text: text to be split into sentences
    :type text: str

    :return: list of sentences
    :rtype: list[str]
    """
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = [s.strip() for s in sentences]
    if sentences and not sentences[-1]: sentences = sentences[:-1]
    return sentences

In [29]:
for item in tqdm(pages_and_texts):
    item["sentences"] = split_into_sentences(item["text"])

100%|██████████| 250/250 [00:00<00:00, 2885.47it/s]


In [31]:
pages_and_texts[58]

{'text': '58 Griphook held the door open for them. Harry, who had expected more marble, was surprised. They were in a narrow stone passageway lit with flaming torches. It sloped steeply downward and there were little railway tracks on the floor. Griphook whistled and a small cart came hurtling up the tracks toward them. They climbed in -- Hagrid with some difficulty -- and were off. At first they just hurtled through a maze of twisting passages. Harry tried to remember, left, right, right, left, middle fork, right, left, but it was impossible. The rattling cart seemed to know its own way, because Griphook wasn\'t steering. Harry\'s eyes stung as the cold air rushed past them, but he kept them wide open. Once, he thought he saw a burst of fire at the end of a passage and twisted around to see if it was a dragon, but too late - - they plunged even deeper, passing an underground lake where huge stalactites and stalagmites grew from the ceiling and floor. I never know," Harry called to Hag