# Building a Local RAG Pipeline Using ***Harry Potter: The Complete Collection*** PDF

## Introduction

In this project, the goal is to build a local **Retrieval Augmented Generation (RAG)** pipeline. This pipeline will process the **Harry Potter** PDF, allowing users to query the content of the novel and get responses based on the text. By using RAG, the project combines information retrieval with language generation, which helps generate more accurate and relevant answers by grounding the outputs in real text passages.

RAG can be particularly helpful in reducing hallucinations by providing fact-based information, ensuring the LLM generates accurate responses. The RAG pipeline will run locally in Google Colab, using Python-based tools for text extraction, embedding generation, and language modeling.

### What is RAG?

**Retrieval Augmented Generation (RAG)** is a three-step process:

1. **Retrieval**: Relevant information is retrieved from a document or dataset based on a query.
2. **Augmentation**: The retrieved information is combined with the user’s query to form an augmented prompt.
3. **Generation**: The augmented prompt is then passed to a large language model (LLM), which generates a response based on both the query and the retrieved information.

This process ensures that the generated outputs are more factually grounded and context-aware compared to just querying the LLM directly.

### Why Use RAG?

There are several reasons why RAG is beneficial:

- **Prevent Hallucinations**: LLMs are prone to generating incorrect or irrelevant information when not provided with context. By retrieving and augmenting the prompt with factual passages, the model’s response is more likely to be grounded in reality.
- **Work with Custom Data**: By using a RAG pipeline, the model can work with specific datasets (like internal documents or in this case, a Harry Potter novel), creating more specific and relevant answers.
- **Privacy and Local Setup**: Running everything locally in Colab ensures privacy and eliminates the need to send data to external APIs, which can also reduce cost and latency.

---

## Project Overview

The goal of this project is to develop a system that allows a user to interact with the **Harry Potter** PDF in a Q&A format, where the user can ask questions and receive answers based on the content of the novel. The pipeline will be divided into two main sections: **preprocessing and embedding creation** (steps 1-3) and **search and answer generation** (steps 4-6).

The workflow can be summarized as follows:

1. **Preprocess the PDF**: Extract the text from the Harry Potter PDF and prepare it for further processing.
2. **Create Text Embeddings**: Convert the chunks of text into vector embeddings for efficient retrieval.
3. **Build a Retrieval System**: Set up a vector search system to retrieve relevant text chunks based on a user’s query.
4. **Augment the Prompt**: Combine the retrieved text with the user’s query to create an augmented input for the LLM.
5. **Generate a Response**: Use a language model to generate an answer based on the augmented prompt.
6. **Return the Response**: Provide the generated answer to the user.


## Step 1: Download the PDF (if it doesn't already exist)

The following code checks if the specified PDF exists locally. If it does not exist, the code will download the PDF from the provided URL and save it with the specified filename. In this case, we're using the **Harry Potter** file stored on Google Drive, but you can modify this for other files as needed.


In [14]:
import os
import requests

# Path to save the downloaded PDF (modify this if needed)
pdf_path = "/content/drive/MyDrive/harrypotter.pdf"

# Check if the file already exists
if not os.path.exists(pdf_path):
    print("[INFO] File doesn't exist, downloading...")

    # Enter the URL of the PDF to download (change if using another document)
    url = "https://kvongcmehsanalibrary.wordpress.com/wp-content/uploads/2021/07/harrypotter.pdf"

    try:
        # Send a GET request to download the PDF
        response = requests.get(url)

        # If the download is successful (status code 200)
        if response.status_code == 200:
            # Save the file locally
            with open(pdf_path, "wb") as file:
                file.write(response.content)
            print(f"[INFO] The file has been downloaded and saved as {pdf_path}")
        else:
            print(f"[ERROR] Failed to download the file. Status code: {response.status_code}")

    except requests.exceptions.RequestException as e:
        print(f"[ERROR] An error occurred while downloading the file: {e}")

else:
    print(f"[INFO] File '{pdf_path}' already exists.")

[INFO] File '/content/drive/MyDrive/harrypotter.pdf' already exists.


## Step 2: Extract and Format Text from the PDF

This section uses **PyMuPDF** (also known as `fitz`) to open and read the PDF. Each page of the PDF is processed, and the text is extracted and cleaned up using a simple formatter. We also collect additional metadata like the word count, character count, and estimated token count (for use with models like GPT).

The `tqdm` library is used to display a progress bar as the PDF is processed page by page.


In [15]:
!pip install PyMuPDF tqdm

Collecting PyMuPDF
  Downloading PyMuPDF-1.24.10-cp310-none-manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting PyMuPDFb==1.24.10 (from PyMuPDF)
  Downloading PyMuPDFb-1.24.10-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.4 kB)
Downloading PyMuPDF-1.24.10-cp310-none-manylinux2014_x86_64.whl (3.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading PyMuPDFb-1.24.10-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m91.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDFb, PyMuPDF
Successfully installed PyMuPDF-1.24.10 PyMuPDFb-1.24.10


In [16]:
import fitz  # PyMuPDF, for working with PDFs
from tqdm.auto import tqdm  # For progress bars

# Function to clean and format the extracted text
def text_formatter(text: str) -> str:
    """Performs basic formatting on the extracted text."""
    # Replace newlines with spaces and strip leading/trailing spaces
    cleaned_text = text.replace("\n", " ").strip()

    # Further text cleaning functions can be added here if needed
    return cleaned_text

# Function to open the PDF and extract the text from each page
def open_and_read_pdf(pdf_path: str) -> list:
    """
    Opens a PDF, extracts text from each page, and returns
    a list of dictionaries containing page information and text.
    """
    # Open the PDF document using fitz
    doc = fitz.open(pdf_path)

    # List to store text and metadata for each page
    pages_and_texts = []

    # Loop through each page in the PDF with a progress bar
    for page_number, page in tqdm(enumerate(doc), total=len(doc)):
        # Extract text from the current page
        text = page.get_text()

        # Clean and format the extracted text
        text = text_formatter(text=text)

        # Append a dictionary containing the page number and text metadata
        pages_and_texts.append({
            "page_number": page_number + 1,  # Adjust to start
            "page_char_count": len(text),  # Character count
            "page_word_count": len(text.split(" ")),  # Word count
            "page_sentence_count_raw": len(text.split(". ")),  # Sentence count
            "page_token_count": len(text) // 4,  # Approximate token count (1 token ≈ 4 chars)
            "text": text  # The extracted and formatted text
        })

    # Return the list of page information and texts
    return pages_and_texts

# Extract text and metadata from the PDF
pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)

  0%|          | 0/3623 [00:00<?, ?it/s]

In [17]:
# Preview the first page
# pages_and_texts[0]
# Preview the chapter one (page 12)
pages_and_texts[11]

{'page_number': 12,
 'page_char_count': 1500,
 'page_word_count': 271,
 'page_sentence_count_raw': 19,
 'page_token_count': 375,
 'text': 'M   CHAPTER  ONE THE BOY WHO LIVED r. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebod

## Step 3: Analyze and Process Token Count and Sentences

In this step, we create a DataFrame from the extracted text data and analyze the token counts. Token counts are critical because both embedding models and language models (LLMs) have limitations on the number of tokens they can process at a time.

We also further process the text by splitting it into sentences using **spaCy**, an NLP library. This allows for better control of text chunks, ensuring that we don't exceed token limits when embedding the data or generating text.


In [18]:
import pandas as pd
from spacy.lang.en import English
from tqdm.auto import tqdm
import random

# Step 1: Create a DataFrame from the extracted pages and text
df = pd.DataFrame(pages_and_texts)

# Display a few rows of the DataFrame
df

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,1,0,1,1,0,
1,2,0,1,1,0,
2,3,0,1,1,0,
3,4,0,1,1,0,
4,5,0,1,1,0,
...,...,...,...,...,...,...
3618,3619,1771,306,17,442,"Rose, who was already wearing her brand-new Ho..."
3619,3620,1568,276,13,392,"“You’re right, sorry,” said Ron, but unable to..."
3620,3621,1652,292,20,413,“But you know Neville —” James rolled his eyes...
3621,3622,1558,276,24,389,“But just say —” “— then Slytherin House will ...


In [19]:
# Step 2: Get summary statistics about the pages, including token counts, word counts, etc.
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,3623.0,3623.0,3623.0,3623.0,3623.0
mean,1812.0,1731.71,310.02,22.82,432.56
std,1046.01,393.21,71.15,10.75,98.3
min,1.0,0.0,1.0,1.0,0.0
25%,906.5,1635.5,290.5,16.0,408.5
50%,1812.0,1814.0,324.0,21.0,453.0
75%,2717.5,1965.0,351.0,27.0,491.0
max,3623.0,2432.0,463.0,90.0,608.0


### Why Token Count is Important

- **Embedding models**: Most embedding models are designed to handle a fixed number of tokens. For example, some models like `sentence-transformers/all-mpnet-base-v2` are trained to embed sequences with a token limit (e.g., 384 tokens).
  
- **LLMs**: Large language models also have limits on how many tokens they can process within a single context window. Sending too many tokens could lead to inefficiency, higher costs, and longer processing times.

Thus, breaking down text into meaningful chunks (tokens and sentences) ensures that the data we send to LLMs is manageable, cost-effective, and relevant.


In [20]:
# Step 3: Initialize spaCy's NLP pipeline for sentence splitting
nlp = English()

# Add a sentencizer component to split text into sentences
nlp.add_pipe("sentencizer")

# Step 4: Process all pages and split the text into sentences
for item in tqdm(pages_and_texts):
    # Split the page text into sentences
    item["sentences"] = list(nlp(item["text"]).sents)

    # Convert spaCy sentence objects to strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the number of sentences on the page
    item["page_sentence_count_spacy"] = len(item["sentences"])

# Step 5: Sample a random page and display the sentence split result
random.sample(pages_and_texts, k=1)

  0%|          | 0/3623 [00:00<?, ?it/s]

[{'page_number': 483,
  'page_char_count': 1992,
  'page_word_count': 371,
  'page_sentence_count_raw': 22,
  'page_token_count': 498,
  'text': 'becoming untidier, as though he was hurrying to tell all he knew. “Of course I know about the Chamber of Secrets. In my day, they told us it was a legend, that it did not exist. But this was a lie. In my fifth year, the Chamber was opened and the monster attacked several students, finally killing one. I caught the person who’d opened the Chamber and he was expelled. But the headmaster, Professor Dippet, ashamed that such a thing had happened at Hogwarts, forbade me to tell the truth. A story was given out that the girl had died in a freak accident. They gave me a nice, shiny, engraved trophy for my trouble and warned me to keep my mouth shut. But I knew it could happen again. The monster lived on, and the one who had the power to release it was not imprisoned.” Harry nearly upset his ink bottle in his hurry to write back. “It’s happening agai

In [21]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,3623.0,3623.0,3623.0,3623.0,3623.0,3623.0
mean,1812.0,1731.71,310.02,22.82,432.56,24.12
std,1046.01,393.21,71.15,10.75,98.3,7.58
min,1.0,0.0,1.0,1.0,0.0,0.0
25%,906.5,1635.5,290.5,16.0,408.5,20.0
50%,1812.0,1814.0,324.0,21.0,453.0,25.0
75%,2717.5,1965.0,351.0,27.0,491.0,29.0
max,3623.0,2432.0,463.0,90.0,608.0,50.0


## Step 4: Chunking Sentences Together

In this step, we will chunk the sentences that were split in the previous step into smaller groups for easier processing. Chunking is the process of splitting larger pieces of text into smaller, more manageable groups.

We'll use a simple method of grouping sentences into chunks of 10 (you can customize this number). This helps in ensuring the text chunks are small enough to fit within the token limit of an embedding model and also allows more focused contexts when passing the data to an LLM.

### Why is Chunking Important?
1. **Ease of Filtering**: Smaller chunks of text are easier to inspect, filter, and work with.
2. **Embedding Model Limits**: Embedding models often have limits on the number of tokens they can process. For example, a model might accept up to 384 tokens in a single input.
3. **Specific Contexts**: Smaller, well-defined chunks help provide more specific contexts for the LLM, leading to more accurate and focused responses.


In [22]:
# Helper function to chunk a list into smaller groups
def split_list(input_list, slice_size):
    """Splits a list into chunks of a given size."""
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Specify the number of sentences per chunk
num_sentence_chunk_size = 10  # Customize this value based on your needs

# Loop through each page's text and split its sentences into chunks
for item in tqdm(pages_and_texts):
    # Chunk the sentences into groups of `num_sentence_chunk_size`
    item["sentence_chunks"] = split_list(input_list=item["sentences"], slice_size=num_sentence_chunk_size)

    # Count the number of chunks created
    item["num_chunks"] = len(item["sentence_chunks"])

# Sample one random page to inspect the result of chunking
random.sample(pages_and_texts, k=1)

  0%|          | 0/3623 [00:00<?, ?it/s]

[{'page_number': 2321,
  'page_char_count': 1641,
  'page_word_count': 300,
  'page_sentence_count_raw': 24,
  'page_token_count': 410,
  'text': 'eyes. “You’ve got him,” said Harry, ignoring the rising panic in his chest, the dread he had been fighting since they had first entered the ninety-seventh row. “He’s here. I know he is.” “The little baby woke up fwightened and fort what it dweamed was twoo,” said the woman in a horrible, mock-baby voice. Harry felt Ron stir beside him. “Don’t do anything,” he muttered. “Not yet —” The woman who had mimicked him let out a raucous scream of laughter. “You hear him? You hear him? Giving instructions to the other children as though he thinks of fighting us!” “Oh, you don’t know Potter as I do, Bellatrix,” said Malfoy softly. “He has a great weakness for heroics; the Dark Lord understands this about him. Now give me the prophecy, Potter.” “I know Sirius is here,” said Harry, though panic was causing his chest to constrict and he felt as though he

In [23]:
# Convert the updated pages and texts with sentence chunks into a DataFrame
df = pd.DataFrame(pages_and_texts)

# Display the statistical summary of the new data, including the chunk counts
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,3623.0,3623.0,3623.0,3623.0,3623.0,3623.0,3623.0
mean,1812.0,1731.71,310.02,22.82,432.56,24.12,2.87
std,1046.01,393.21,71.15,10.75,98.3,7.58,0.79
min,1.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,906.5,1635.5,290.5,16.0,408.5,20.0,2.0
50%,1812.0,1814.0,324.0,21.0,453.0,25.0,3.0
75%,2717.5,1965.0,351.0,27.0,491.0,29.0,3.0
max,3623.0,2432.0,463.0,90.0,608.0,50.0,5.0


## Step 5: Splitting Sentence Chunks into Separate Items

Now that we’ve chunked the sentences from the PDF into groups (e.g., 10 sentences per chunk), we want to further process each chunk into its own item. This will allow us to embed each chunk of sentences into a numerical representation for use in downstream tasks, such as vector retrieval or LLM input.

### Why Split Each Chunk into Separate Items?
1. **Granularity**: This allows us to work with more specific parts of the document, making it easier to retrieve and use relevant information.
2. **Embedding**: By splitting chunks into separate items, we can efficiently embed each chunk into a vector space for later retrieval.
3. **Chunk Statistics**: For each chunk, we gather important statistics like character count, word count, and token count, helping us monitor the size of the chunks and ensure they fit within model constraints.


In [24]:
import re
from tqdm.auto import tqdm
import pandas as pd

# Step 1: Initialize a list to store pages and chunks
pages_and_chunks = []

# Step 2: Loop through each page and its sentence chunks
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}  # Create a dictionary to hold chunk information

        # Store the page number
        chunk_dict["page_number"] = item["page_number"]

        # Join the list of sentences in the chunk into a paragraph-like structure
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()

        # Ensure sentences are properly spaced after periods (".A" => ". A")
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk)

        # Store the joined sentence chunk
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get some statistics on the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)  # Character count
        chunk_dict["chunk_word_count"] = len(joined_sentence_chunk.split(" "))  # Word count
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4  # Estimate token count (1 token = ~4 characters)

        # Add the chunk data to our pages_and_chunks list
        pages_and_chunks.append(chunk_dict)

# Step 3: Check the total number of chunks created
len(pages_and_chunks)


  0%|          | 0/3623 [00:00<?, ?it/s]

10396

In [25]:
# Step 4: Sample one random chunk and inspect its structure
random.sample(pages_and_chunks, k=1)

[{'page_number': 2732,
  'sentence_chunk': 'A stone taken from the stomach of a goat, which will protect from most poisons.”It was not an answer to the Golpalott problem, and had Snape still been their teacher, Harry would not have dared do it, but this was a moment for desperate measures. He hastened toward the store cupboard and rummaged within it, pushing aside unicorn horns and tangles of dried herbs until he found, at the very back, a small cardboard box on which had been scribbled the word BEZOARS. He opened the box just as Slughorn called, “Two minutes left, everyone!”Inside were half a dozen shriveled brown objects, looking more like dried-up kidneys than real stones. Harry seized one, put the box back in the cupboard, and hurried back to his cauldron. “Time’s . . . UP!”called Slughorn genially. “Well, let’s see how you’ve done!',
  'chunk_char_count': 804,
  'chunk_word_count': 139,
  'chunk_token_count': 201.0}]

In [26]:
# Step 5: Create a DataFrame from the processed chunks
df = pd.DataFrame(pages_and_chunks)

# Display statistical summary of the chunks (character count, word count, token count, etc.)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,10396.0,10396.0,10396.0,10396.0
mean,1794.17,601.16,106.35,150.29
std,1045.34,306.58,54.11,76.65
min,6.0,1.0,1.0,0.25
25%,888.0,400.0,71.0,100.0
50%,1760.5,579.0,103.0,144.75
75%,2700.0,779.0,138.0,194.75
max,3623.0,2336.0,399.0,584.0


In [27]:
# Display the first few rows of the DataFrame to inspect the chunked data
df.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count
0,6,CONTENTS Harry Potter and the Sorcerer’s Stone...,281,48,70.25
1,9,"FOR JESSICA, WHO LOVES STORIES, FOR ANNE, WHO ...",99,19,24.75
2,10,CONTENTS ONE The Boy Who Lived TWO The Vanishi...,292,50,73.0
3,11,The Mirror of Erised THIRTEEN Nicolas Flamel F...,176,26,44.0
4,12,M CHAPTER ONE THE BOY WHO LIVED r. and Mrs. D...,1289,230,322.25


## Step 6: Filter Chunks of Text for Minimum Token Length

After chunking the text into manageable pieces, it’s important to filter out very short chunks that may not contain much useful information. By setting a minimum token length, we can ensure that only chunks with sufficient content are retained for further processing.

### Why Filter Short Chunks?
1. **Relevance**: Short chunks might lack sufficient context or information, making them less useful for embedding or generating responses.
2. **Efficiency**: Removing short chunks reduces the volume of data we need to process, making downstream operations more efficient.

In this step, we will:
1. **Inspect**: Display random chunks with token counts under the specified minimum length.
2. **Filter**: Remove chunks with token counts below the threshold and keep only those with more substantial content.


In [28]:
# Step 1: Define the minimum token length for filtering
min_token_length = 30

# Step 2: Display random chunks with under the minimum token length
print("Random chunks with under 30 tokens in length:")
short_chunks = df[df["chunk_token_count"] <= min_token_length]
for _, row in short_chunks.sample(5).iterrows():
    print(f'Chunk token count: {row["chunk_token_count"]} | Text: {row["sentence_chunk"]}')

Random chunks with under 30 tokens in length:
Chunk token count: 25.0 | Text: And batteries. Got a very large collection of batteries. My wife thinks I’m mad, but there you are.”
Chunk token count: 13.5 | Text: said Hagrid, looking anxious. “See — it’s like I say —
Chunk token count: 29.25 | Text: But since when has Malfoy been one of the world’s great thinkers?”asked Harry. Neither Ron nor Hermione answered him.
Chunk token count: 8.25 | Text: Let’s just see what’s inside it!”
Chunk token count: 20.5 | Text: squeaked tiny Professor Flitwick, whose feet were dangling a foot from the ground.


In [29]:
# Step 3: Filter the DataFrame to include only chunks with token counts greater than the minimum length
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")

# Step 4: Display a sample of the filtered chunks
import random
print("Sample of filtered chunks:")
random.sample(pages_and_chunks_over_min_token_len, k=1)

Sample of filtered chunks:


[{'page_number': 3423,
  'sentence_chunk': 'Okay, Ron, come here so I can do you. . . .” “Right, but remember, I don’t like the beard too long —” “Oh, for heaven’s sake, this isn’t about looking handsome —” “It’s not that, it gets in the way!But I liked my nose a bit shorter, try and do it the way you did last time.”Hermione sighed and set to work, muttering under her breath as she transformed various aspects of Ron’s appearance. He was to be given a completely fake identity, and they were trusting to the malevolent aura cast by Bellatrix to protect him. Meanwhile Harry and Griphook were to be concealed under the Invisibility Cloak. “There,” said Hermione, “how does he look, Harry?”It was just possible to discern Ron under his disguise, but only, Harry thought, because he knew him so well. Ron’s hair was now long and wavy; he had a thick brown beard and mustache, no freckles, a short, broad nose, and',
  'chunk_char_count': 869,
  'chunk_word_count': 159,
  'chunk_token_count': 217.25}

In [30]:
# Step 5: Check the number of chunks remaining after filtering
len(pages_and_chunks_over_min_token_len)

9836

## Step 7: Embedding Our Text Chunks

Embeddings are crucial in natural language processing as they convert text into numerical representations that machines can understand. These embeddings capture the semantic meaning of the text and are essential for tasks like retrieval and text similarity.

### What Are Embeddings?
- **Definition**: Embeddings are vector representations of text that encode semantic meaning. Unlike raw text, which machines cannot process directly, embeddings turn text into numerical vectors that capture the context and relationships between words.
- **Purpose**: The goal is to transform text chunks into a format that can be used by machine learning models, enabling effective text retrieval and comparison.

### The Model: `all-mpnet-base-v2`
- **Description**: The `all-mpnet-base-v2` model from Sentence Transformers is designed to produce high-quality embeddings for sentences and text chunks.
- **Library**: `sentence-transformers` provides an easy-to-use interface for generating these embeddings.

### How to Use the Model

1. **Initialize the Model**: Load the model and specify the device (CPU or GPU) for computations.
2. **Encode Text**: Convert sentences or text chunks into embeddings.
3. **Batch Processing**: For large datasets, process text in batches to improve efficiency.



In [None]:
!pip install sentence-transformers

In [33]:
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm
import time

# Initialize the GPU model
embedding_model_gpu = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", device="cuda")

# Prepare a list of text chunks for embedding
text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]

# Function to batch process embeddings with tqdm
def batch_process_embeddings_gpu(model, text_chunks, batch_size):
    embeddings = []
    for i in tqdm(range(0, len(text_chunks), batch_size), desc="Processing Batches on GPU"):
        batch = text_chunks[i:i + batch_size]
        batch_embeddings = model.encode(batch, convert_to_tensor=True)
        embeddings.extend(batch_embeddings)
    return embeddings

# Time GPU embedding with tqdm progress
start_time = time.time()

# Process embeddings in batches using GPU
text_chunk_embeddings_gpu = batch_process_embeddings_gpu(embedding_model_gpu, text_chunks, batch_size=128)

gpu_time = time.time() - start_time
print(f"GPU Embedding Time: {gpu_time:.2f} seconds")

Processing Batches on GPU:   0%|          | 0/77 [00:00<?, ?it/s]

GPU Embedding Time: 126.83 seconds


In [34]:
# Check the embeddings
print("Sample GPU Embeddings:", text_chunk_embeddings_gpu[:1])

Sample GPU Embeddings: [tensor([ 4.8815e-02,  3.1361e-02, -1.0537e-02,  5.4614e-02, -3.0401e-02,
         6.9892e-02,  2.9673e-02, -3.7371e-03,  6.3727e-02, -1.1713e-02,
         1.8052e-02,  6.1999e-02,  5.3205e-02, -6.7199e-02,  8.7563e-02,
        -8.5908e-02,  2.5470e-02, -2.4550e-02, -8.8356e-02,  5.2592e-03,
        -3.6130e-02,  7.0244e-02, -2.2533e-02,  1.3000e-03, -5.0225e-03,
        -5.2841e-02, -2.7148e-02,  3.8182e-02, -3.0226e-03, -6.1530e-02,
         1.4622e-02, -6.0864e-02,  4.5739e-03,  5.1725e-02,  2.0775e-06,
        -2.4414e-02,  4.5324e-02, -1.4382e-03, -3.2586e-02, -2.8219e-02,
        -1.1079e-03,  4.9997e-02,  3.1503e-02, -6.7808e-02,  2.9233e-02,
        -2.5799e-02,  3.2630e-02,  5.9321e-02, -5.7071e-02,  1.9496e-02,
         1.6945e-02, -1.8168e-02, -4.3047e-02, -2.0095e-02,  9.7193e-02,
        -3.5849e-02, -1.8391e-02, -3.9925e-02,  2.7684e-02,  5.7206e-02,
         2.1256e-03,  4.3017e-03, -3.0411e-02, -3.7727e-02,  1.1235e-02,
         1.0681e-03, -2.785

In [35]:
# Check the shape of the embeding
text_chunk_embeddings_gpu[0].shape

torch.Size([768])

## Step 8: Save the Embeddings

After generating embeddings for our text chunks, the next step is to save these embeddings to a file. This allows for easy retrieval and use in future processes without needing to recompute embeddings.

### Why Save Embeddings?
- **Persistence**: Save embeddings so that you don’t have to recalculate them each time you run your code.
- **Efficiency**: Loading precomputed embeddings from a file is faster than recomputing them.
- **Reusability**: Allows you to reuse embeddings across different sessions or projects.



In [36]:
import pandas as pd
import numpy as np

# Convert tensor embeddings to NumPy arrays and add to the list
for i in range(len(pages_and_chunks_over_min_token_len)):
    pages_and_chunks_over_min_token_len[i]["embedding"] = text_chunk_embeddings_gpu[i].cpu().numpy()

# Convert the updated list to a DataFrame
df = pd.DataFrame(pages_and_chunks_over_min_token_len)

# Specify the path for saving the CSV file
embeddings_df_save_path = "/content/drive/MyDrive/text_chunks_and_embeddings_df.csv"

# Save the DataFrame to a CSV file
df.to_csv(embeddings_df_save_path, index=False)

print(f"Embeddings have been saved to {embeddings_df_save_path}")

Embeddings have been saved to /content/drive/MyDrive/text_chunks_and_embeddings_df.csv


In [37]:
# See the df
df

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,6,CONTENTS Harry Potter and the Sorcerer’s Stone...,281,48,70.25,"[0.048815023, 0.031361405, -0.010537495, 0.054..."
1,10,CONTENTS ONE The Boy Who Lived TWO The Vanishi...,292,50,73.00,"[0.04385911, 0.029304622, -0.015308299, 0.0698..."
2,11,The Mirror of Erised THIRTEEN Nicolas Flamel F...,176,26,44.00,"[0.060104646, 0.05541316, 0.0027609102, 0.0397..."
3,12,M CHAPTER ONE THE BOY WHO LIVED r. and Mrs. D...,1289,230,322.25,"[0.012113275, 0.018570924, -0.024781471, 0.041..."
4,12,The Dursleys knew that the Potters had a small...,208,39,52.00,"[-0.008543112, 0.009673865, -0.028602293, -0.0..."
...,...,...,...,...,...,...
9831,3621,What if I’m in Slytherin?”The whisper was for ...,605,109,151.25,"[0.019555904, -0.03222299, -0.022861393, 0.060..."
9832,3622,“But just say —” “— then Slytherin House will ...,671,117,167.75,"[-0.0060430956, -0.017671531, -0.0197197, 0.04..."
9833,3622,"A great number of faces, both on the train and...",585,104,146.25,"[-0.0049901884, -0.028134331, -0.023570377, 0...."
9834,3622,The train rounded a corner. Harry’s hand was s...,296,51,74.00,"[-0.004144725, 0.024385262, -0.019662349, -0.0..."


## Step 9: RAG - Search and Answer

The goal of **RAG** (Retrieval-Augmented Generation) is to retrieve relevant information from a large corpus of data and then use that information to generate responses. By augmenting a language model with relevant context, the output generated by the model becomes more specific and accurate.

In this case, we’ll perform **semantic search** over the textbook embeddings we’ve created to find relevant passages. This approach uses **embeddings** rather than keyword-based search, making it possible to retrieve results based on the meaning of the query, rather than the exact words.

### Similarity Search
- **Embeddings** can represent many types of data, including images, sounds, and text.
- **Similarity search** (or vector search) is comparing embeddings to find how similar two items are.
- **Semantic search** allows us to retrieve relevant passages based on a query, even if the query doesn't contain the exact keywords present in the passages.

For example, if we search for "macronutrient functions," the search will return relevant passages even if the text doesn't explicitly use the words "macronutrient" and "functions."

### Code Breakdown

1. **Query Embedding**: We first convert the user’s query into an embedding using the same model we used for the textbook.
2. **Similarity Calculation**: We compute the similarity between the query embedding and all the passage embeddings using dot product or cosine similarity.
3. **Ranking and Retrieval**: We retrieve the top-k most relevant passages based on their similarity scores and display the results.



In [2]:
import random
import torch
import numpy as np
import pandas as pd
from sentence_transformers import util, SentenceTransformer
import textwrap
import time
from timeit import default_timer as timer

In [3]:
# Step 1: Load data and prepare embeddings
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the texts and embeddings from the saved CSV file
text_chunks_and_embedding_df = pd.read_csv("/content/drive/MyDrive/text_chunks_and_embeddings_df.csv")

# Convert the 'embedding' column back to numpy arrays (CSV saves it as string)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(
    lambda x: np.fromstring(x.strip("[]"), sep=" ")
)

text_chunks_and_embedding_df.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,6,CONTENTS Harry Potter and the Sorcerer’s Stone...,281,48,70.25,"[0.0488150232, 0.0313614048, -0.0105374949, 0...."
1,10,CONTENTS ONE The Boy Who Lived TWO The Vanishi...,292,50,73.0,"[0.0438591093, 0.0293046217, -0.0153082991, 0...."
2,11,The Mirror of Erised THIRTEEN Nicolas Flamel F...,176,26,44.0,"[0.0601046458, 0.0554131605, 0.0027609102, 0.0..."
3,12,M CHAPTER ONE THE BOY WHO LIVED r. and Mrs. D...,1289,230,322.25,"[0.012113275, 0.0185709242, -0.0247814711, 0.0..."
4,12,The Dursleys knew that the Potters had a small...,208,39,52.0,"[-0.00854311232, 0.00967386458, -0.0286022928,..."


In [4]:
# Convert the embeddings into a torch tensor for faster computation
embeddings = torch.tensor(np.stack(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32).to(device)

# Convert the DataFrame into a list of dictionaries for easier access
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

# Step 2: Load the embedding model
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", device=device)
print("Embedding model ready!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Embedding model ready!


In [5]:
# Step 3: Define the query and embed it
query = "What did Harry Potter see in the Mirror of Erised?"
print(f"Query: {query}")

# Embed the query using the same model that was used for the passages
query_embedding = embedding_model.encode(query, convert_to_tensor=True).to(device)

# Step 4: Compute similarity scores between query embedding and text embeddings
start_time = timer()
dot_scores = util.dot_score(query_embedding, embeddings)[0]  # Dot product similarity
end_time = timer()

print(f"[INFO] Time taken to compute similarity for {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

Query: What did Harry Potter see in the Mirror of Erised?
[INFO] Time taken to compute similarity for 9836 embeddings: 0.02466 seconds.


In [6]:
# Step 5: Get top-k results (e.g., top 5)
top_results = torch.topk(dot_scores, k=5)

# Step 6: Display the top results with pretty formatting
def print_wrapped(text, wrap_length=80):
    """Helper function to wrap text for cleaner output display."""
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

# Print the top results
print(f"Query: '{query}'\n")
print("Top Results: \n")

# Loop through the top results and display them
for score, idx in zip(top_results.values, top_results.indices):
    print(f"Score: {score:.4f}")
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    print(f"Page number: {pages_and_chunks[idx]['page_number']}\n")


Query: 'What did Harry Potter see in the Mirror of Erised?'

Top Results: 

Score: 0.7464
Text:
“I — I didn’t see you, sir.” “Strange how nearsighted being invisible can make
you,” said Dumbledore, and Harry was relieved to see that he was smiling. “So,”
said Dumbledore, slipping off the desk to sit on the floor with Harry, “you,
like hundreds before you, have discovered the delights of the Mirror of Erised.”
“I didn’t know it was called that, sir.” “But I expect you’ve realized by now
what it does?” “It — well — it shows me my family —” “And it showed your friend
Ron himself as Head Boy.” “How did you know — ?” “I don’t need a cloak to become
invisible,” said Dumbledore gently. “Now, can you think what the Mirror of
Erised shows us all?”Harry shook his head. “
Page number: 192

Score: 0.7338
Text:
The dark shapes of desks and chairs were piled against the walls, and there was
an upturned wastepaper basket — but propped against the wall facing him was
something that didn’t look as if i

## Step 10: Working with Local LLMs

### Overview

This step involves evaluating GPU memory to determine which local Large Language Model (LLM) can be efficiently run on your hardware. Specifically, we'll use the Gemma model series from Google, and we will configure the model based on the available GPU memory.

### Code Explanation

1. **Get Available GPU Memory**:
   - We check the GPU memory to decide which model can be loaded onto the GPU.

2. **Select Model Based on GPU Memory**:
   - Depending on the GPU memory, we choose between different versions of the Gemma model, adjusting precision settings as needed.

3. **Load and Configure Model**:
   - Authenticate with Hugging Face, load the tokenizer and model, and configure them for efficient use with GPU acceleration.

4. **Generate Text**:
   - Use the model to generate text based on an input query, and analyze the model's size and parameters.


In [7]:
# Get GPU available memory
import torch
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2**30))
print(f"Available GPU memory: {gpu_memory_gb} GB")

Available GPU memory: 15 GB


In [8]:
# Note: the following is Gemma focused, however, there are more and more LLMs of the 2B and 7B size appearing for local use.
if gpu_memory_gb < 5.1:
    print(f"Your available GPU memory is {gpu_memory_gb}GB, you may not have enough memory to run a Gemma LLM locally without quantization.")
elif gpu_memory_gb < 8.1:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in 4-bit precision.")
    use_quantization_config = True
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb < 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.")
    use_quantization_config = False
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb > 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommend model: Gemma 7B in 4-bit or float16 precision.")
    use_quantization_config = False
    model_id = "google/gemma-7b-it"

print(f"use_quantization_config set to: {use_quantization_config}")
print(f"model_id set to: {model_id}")

GPU memory: 15 | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.
use_quantization_config set to: False
model_id set to: google/gemma-2b-it


In [9]:
# Step 1: Import necessary libraries
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Step 2: Hugging Face authentication
# Fetch the token from Colab user data
token = userdata.get('token')
print(f"Using Hugging Face token: {token}")

# Log in to Hugging Face using the token
login(token=token)

# Step 3: Load the tokenizer and model (Gemma-2.2B) with CUDA support
# This will load the model to the GPU (if available) for faster generation
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")

# Load the model with automatic device mapping and set it to use bf16 (bfloat16) precision for efficient memory usage on CUDA
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b-it",
    device_map="auto",  # Automatically load the model onto the GPU
    torch_dtype=torch.bfloat16,  # Use bfloat16 for efficiency
)

Using Hugging Face token: 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
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [10]:
# Step 4: Define input query and generate text
# Input text for the model to generate output
input_text = "Hi how is your day going?, Who are you?"

# Tokenize the input text and send it to the CUDA device
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

# Generate a response from the model with a maximum of 32 new tokens
outputs = model.generate(**input_ids, max_new_tokens=32)

# Step 5: Decode the output and print the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated text: {generated_text}")

Generated text: Hi how is your day going?, Who are you?

Hi there! My day is going pretty well, thanks for asking. 😊 

I'm Gemma, an AI assistant created by the Gemma team.


In [11]:
# Print model structure
model

Gemma2ForCausalLM(
  (model): Gemma2Model(
    (embed_tokens): Embedding(256000, 2304, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x Gemma2DecoderLayer(
        (self_attn): Gemma2SdpaAttention(
          (q_proj): Linear(in_features=2304, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2304, out_features=1024, bias=False)
          (v_proj): Linear(in_features=2304, out_features=1024, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2304, bias=False)
          (rotary_emb): Gemma2RotaryEmbedding()
        )
        (mlp): Gemma2MLP(
          (gate_proj): Linear(in_features=2304, out_features=9216, bias=False)
          (up_proj): Linear(in_features=2304, out_features=9216, bias=False)
          (down_proj): Linear(in_features=9216, out_features=2304, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma2RMSNorm((2304,), eps=1e-06)
        (post_attention_layernorm): Gemma2RMSNorm((2304,), 

In [12]:
# Analyze Model Parameters
def get_model_num_params(model: torch.nn.Module):
    return sum([param.numel() for param in model.parameters()])

get_model_num_params(model)

2614341888

In [13]:
# Analyze Model Memory Usage
def get_model_mem_size(model: torch.nn.Module):
    # Get model parameters and buffer sizes
    mem_params = sum([param.nelement() * param.element_size() for param in model.parameters()])
    mem_buffers = sum([buf.nelement() * buf.element_size() for buf in model.buffers()])

    # Calculate model sizes
    model_mem_bytes = mem_params + mem_buffers
    model_mem_mb = model_mem_bytes / (1024**2)
    model_mem_gb = model_mem_bytes / (1024**3)

    return {"model_mem_bytes": model_mem_bytes,
            "model_mem_mb": round(model_mem_mb, 2),
            "model_mem_gb": round(model_mem_gb, 2)}

get_model_mem_size(model)

{'model_mem_bytes': 5228697088, 'model_mem_mb': 4986.47, 'model_mem_gb': 4.87}

In [14]:
# Define a new input query in the "messages" format (template)
messages = [
    {"role": "user", "content": "Explain the importance of deep learning in modern AI."},
]

# Apply the chat template and convert it to input tokens
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")

# Generate a response from the model with a maximum of 256 new tokens
outputs = model.generate(**input_ids, max_new_tokens=256)

# Decode the output and print the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated text: \n {generated_text}")


Generated text: 
 user
Explain the importance of deep learning in modern AI.
* **What is deep learning?**
* **How does it work?**
* **What are its advantages and disadvantages?**
* **What are some real-world applications of deep learning?**

## Deep Learning: The Powerhouse of Modern AI

Deep learning is a subset of machine learning that uses artificial neural networks with multiple layers to learn complex patterns from data. It's like teaching a computer to think by mimicking the human brain's structure and function.

**How does it work?**

1. **Data Input:** Deep learning models are trained on massive datasets, which can be images, text, audio, or any other type of data.
2. **Feature Extraction:** The model analyzes the data and extracts relevant features, like edges in an image or words in a sentence.
3. **Hidden Layers:** These features are passed through multiple layers of interconnected nodes, each performing a specific transformation.
4. **Learning and Optimization:** The model 

## Step 11: Generating an Answer Using Contextual Information

In this step, we use the relevant context and user query to generate an answer with a language model. The process includes retrieving relevant resources, formatting a prompt, and generating the answer.

In [58]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def retrieve_relevant_resources(query: str, embeddings: torch.Tensor, model, top_k: int = 5) -> tuple[list[int], list[float]]:
    """
    Retrieve relevant resources based on the query by computing similarity scores.

    Parameters:
    - query (str): The user's query to be answered.
    - embeddings (torch.Tensor): Embeddings of the text chunks.
    - model (SentenceTransformer): The model used to encode the query.
    - top_k (int): Number of top results to return.

    Returns:
    - Tuple containing:
        - List of indices of the top_k most relevant resources.
        - List of scores for the top_k most relevant resources.
    """
    # Embed the query
    query_embedding = model.encode(query, convert_to_tensor=True).to(device)

    # Compute similarity scores
    dot_scores = util.dot_score(query_embedding, embeddings)[0]  # Dot product similarity

    # Get top-k results
    top_results = torch.topk(dot_scores, k=top_k)

    return top_results.indices.tolist(), top_results.values.tolist()

def format_prompt(query: str, context_items: list[dict]) -> str:
    """
    Format a prompt for an LLM based on a user query and relevant context items.

    Parameters:
    - query (str): The user's query to be answered.
    - context_items (list[dict]): A list of context items containing relevant passages.

    Returns:
    - str: The formatted prompt for the LLM.
    """
    context = "\n".join([f"- {item['sentence_chunk']}" for item in context_items])

    base_prompt = """Based on the following context items, please answer the query concisely.

    - Provide a direct and clear response.
    - Do not repeat information or include unnecessary details.
    - If the answer is not known, state "I don't know" rather than guessing.

    Context:
    {context}

    User Query:
    {query}

    Answer:
    """

    return base_prompt.format(context=context, query=query)


In [59]:
# Define the query
query = "What did Harry Potter see in the Mirror of Erised?"

# Retrieve relevant resources
indices, scores = retrieve_relevant_resources(query=query, embeddings=embeddings, model=embedding_model)

# Create a list of context items from the top results
context_items = [pages_and_chunks[i] for i in indices]

# Format the prompt with the query and context items
prompt = format_prompt(query=query, context_items=context_items)

# Prepare the input for the model
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate the response with specified settings
outputs = model.generate(
    input_ids=input_ids['input_ids'],
    attention_mask=input_ids['attention_mask'],
    max_new_tokens=256,
    temperature=0.2,
    do_sample=True
)

# Decode the output and print the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated Answer:\n{generated_text}")


Generated Answer:
Based on the following context items, please answer the query concisely. 

    - Provide a direct and clear response.
    - Do not repeat information or include unnecessary details.
    - If the answer is not known, state "I don't know" rather than guessing.

    Context:
    - “I — I didn’t see you, sir.” “Strange how nearsighted being invisible can make you,” said Dumbledore, and Harry was relieved to see that he was smiling. “So,” said Dumbledore, slipping off the desk to sit on the floor with Harry, “you, like hundreds before you, have discovered the delights of the Mirror of Erised.” “I didn’t know it was called that, sir.” “But I expect you’ve realized by now what it does?” “It — well — it shows me my family —” “And it showed your friend Ron himself as Head Boy.” “How did you know — ?” “I don’t need a cloak to become invisible,” said Dumbledore gently. “Now, can you think what the Mirror of Erised shows us all?”Harry shook his head. “
- The dark shapes of desks 

In [34]:
# Only print the answer
answer= generated_text.split("Answer:")[1].strip()
print(f"Generated Answer:\n{answer}")

Generated Answer:
Harry saw his family, for the first time in his life, in the Mirror of Erised. He saw his parents, smiling and waving at him, and he saw other people with green eyes, noses, and even a little old man who looked like him.


## Step 12: Creating an Interactive Query Answering System with Gradio

In this step, we integrate the query-answering functionality into an interactive web interface using Gradio. Here's a breakdown of the process:

1. **Generate the Answer Function:**
   - **Retrieve Relevant Resources:** For the given query, the system fetches relevant chunks of text based on precomputed embeddings. This helps in narrowing down the context related to the query.
   - **Format the Prompt:** The selected context chunks are combined with the query into a structured prompt. This prompt is designed to elicit a concise answer from the language model, ensuring that only the direct answer is provided.
   - **Model Input and Generation:** The prompt is then tokenized and fed into the language model. The model generates a response based on the input, with the temperature parameter controlling the creativity of the output. The response is decoded to extract the final answer.
   - **Extract the Answer:** After generating the text, the response is cleaned to retain only the relevant answer portion.

2. **Gradio Interface:**
   - **Setup the Interface:** Gradio is used to create a user-friendly web interface where users can input their queries and adjust the temperature setting using a slider.
   - **Inputs and Outputs:** The interface has a textbox for entering the query and a slider to set the temperature. The output is displayed as text.
   - **Launch the Interface:** Once set up, the interface is launched, allowing users to interact with the model through a web browser.

This setup allows users to submit queries and receive direct answers, with the flexibility to adjust the model's response creativity via the temperature slider.


In [None]:
!pip install gradio

In [60]:
def generate_answer(query: str, temperature: float) -> str:

    # Retrieve relevant resources
    indices, scores = retrieve_relevant_resources(query=query, embeddings=embeddings, model=embedding_model)

    # Create a list of context items from the top results
    context_items = [pages_and_chunks[i] for i in indices]

    # Format the prompt with the query and context items
    prompt = format_prompt(query=query, context_items=context_items)

    # Prepare the input for the model
    input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Generate the response with specified settings
    outputs = model.generate(
        input_ids=input_ids['input_ids'],
        attention_mask=input_ids['attention_mask'],
        max_new_tokens=256,
        temperature=temperature,
        do_sample=True
    )

    # Decode the output and return the generated text
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Keep only the answer
    generated_text = generated_text.split("Answer:")[1].strip()
    return generated_text

# Create Gradio interface
interface = gr.Interface(
    fn=generate_answer,
    inputs=[
        gr.Textbox(label="Enter your query"), # Removed gr.inputs
        gr.Slider(minimum=0, maximum=1, step=0.1, value=0.2, label="Temperature") # Changed default to value
    ],
    outputs="text",
    title="Query Answering System",
    description="Enter a query and adjust the temperature to generate a detailed answer based on contextual information."
)

# Launch the interface
interface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://0b04901379bdce944e.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


