# Retrieval Augmented Generation

**Learning Goals**
- "Call LLMs via Groq (llama-3.3-70b-versatile) and Gemini (free tier)"

- Build a basci RAG pipeline

- Improving Retrieval via chunking and meta filtering

In [1]:
!pip install groq
!pip install -q -U google-genai
!pip install chromadb
!pip install sentence-transformers

Collecting groq
  Downloading groq-0.32.0-py3-none-any.whl.metadata (16 kB)
Downloading groq-0.32.0-py3-none-any.whl (135 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.4/135.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: groq
Successfully installed groq-0.32.0
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.3/45.3 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m237.3/237.3 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb
  Downloading chromadb-1.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.2-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collect

**`Restart the Session!!!`**

## Open-source vs. Closed-source Models

Open-source models, such as Llama 3, are released with publicly available weights and can be freely downloaded and run locally. Closed-source models, such as Google’s Gemini or OpenAI’s GPT series, are proprietary and only accessible through the provider’s API. The choice between them usually balances flexibility and transparency (open-source) against convenience, performance, and managed infrastructure (closed-source).


## Ways to Use Open-source Models

Open-source models can be run locally using tools like Ollama, which enables full control and privacy but requires a GPU and sufficient hardware resources. Alternatively, cloud platforms like Groq host open-source models behind an API, removing the need for GPUs but requiring you to send data over the internet. The local setup maximizes privacy, while API-based access is more convenient but less private for sensitive data.

# Part 1 - LLM API quickstart (Groq & Gemini)

## Groq

- Hosted open-source models via API. GroqCloud serves models like Llama-3-70B over a simple API; you don't need a local GPU - just an API key.

- Create an account: Sign up on Groq and click **Start Building**. You can use your Utas email (or any email address you prefer) to register https://groq.com

- Rate limits apply. For free account, for `llama-3.3-70b-versatile` model we are going to use, the requests per minute are limited to 30. See details here: https://console.groq.com/docs/rate-limits


## Gemini
- Sign up: Use Google AI Studio (https://aistudio.google.com/welcome) to create a free account; if you already have a Google account, you can use it directly.

- Rate limits apply. For the free tier, `Gemini 2.5 Pro`, the requests per minute are limited to 5; for `Gemini 2.5 Flash`, the requests per minute are limited to 10


For both Groq and Gemini, the free tiers are enough for our tutorials and assignment. No paid upgrade is required for this unit.


## Setup (For both Groq and Gemini)
1. Go to Groq, click the "API keys" in the top bar, and create a new API key.

3. Copy and past API key into the following cell with `GROQ_API_KEY=<your_groq_key_here>`

 **Do not close the window until you've copied the key, since after you close the window, you cannot access the key again.**


4. Go to Google AI studio, click the API Keys in the left side bar and create a new API key.

5. Copy and past API key into the following cell with `GOOGLE_API_KEY=<your_google_key_here>`

**Do not close the window until you've copied the key, since after you close the window, you cannot access the key again.**

In [2]:
GROQ_API_KEY = "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"  # @param {type:"string"}

In [4]:
GOOGLE_API_KEY = "AIzaSyDg8k0T-TYvQeVVDl2xHT5PEcd7hJSw53w"  # @param {type:"string"}

## Groq API Call

- Learn the interface for calling LLMs via groq API

- How `messages` are formatted with `role` and `content`

In [5]:
import warnings
warnings.filterwarnings("ignore")

In [6]:
import os

from groq import Groq

groq_client = Groq(
    api_key=GROQ_API_KEY,
)

In [7]:


chat_completion = groq_client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Explain the importance of large language models",
        }
    ],
    model="llama-3.3-70b-versatile",
)

print(chat_completion.choices[0].message.content)

Large language models (LLMs) have revolutionized the field of natural language processing (NLP) and have numerous applications in various industries. The importance of LLMs can be summarized as follows:

1. **Improved Language Understanding**: LLMs are trained on vast amounts of text data, enabling them to learn complex language patterns, nuances, and relationships. This allows them to better understand the context and meaning of language, leading to more accurate and informative responses.
2. **Enhanced Text Generation**: LLMs can generate coherent, context-dependent text that is often indistinguishable from human-written content. This capability has far-reaching implications for applications such as content creation, chatbots, and language translation.
3. **Automated Content Creation**: LLMs can automate content creation tasks such as writing articles, product descriptions, and social media posts, freeing up human resources for more creative and high-level tasks.
4. **Personalized Cu

## Gemini API Call

- Learn the interface for calling Gemini via google genai client

- The difference between `contents` and `messages` to pass questions to LLMs


In [8]:
from google import genai
import os

gemini_client = genai.Client(
    api_key=GOOGLE_API_KEY
)

In [9]:

response = gemini_client.models.generate_content(
    model="gemini-2.5-flash", contents="Explain how AI works in a few words"
)
print(response.text)

AI learns patterns from data to make decisions or predictions.


# Part 2 - RAG with ChromaDB

- Download the `data.zip` and unzip to get the three `txt` files. Create `data` folder in colab, and upload the three `txt` files.

## Overview
Chroma is the open-source AI application database, including embeddings, vector search, document storage, full-text search, metadata filtering, and multi-modal. Chroma db manages the text documents, covert text to embeddings and so similarity searches.

See documentations for more information:

https://www.trychroma.com

https://docs.trychroma.com/docs/collections/manage-collections

## Objective
- For this task, we explore the Chromadb similarity search

- Make sure you **`understand`** the code provided


### Ground truth preparation

The pairing is:

The relevant context information for answering Q1 is C1, and the grouded answer is A1

Similar pattern for Q2 and Q3

In [10]:
# Define the ground truth for RAG
Queries = {
    "Q1": "Which animal did Alice follow down the hole?", # for assessment 1
    "Q2": "What instruction was printed on the bottle that Alice drank from?", # for assessment 2
    "Q3": "Which flower did he need to bring to her from the garden?" # for assessment 3
}

Answers = {
    "A1": "rabbit",
    "A2": "drink me",
    "A3": "red rose"
}

Citations = {
    "C1": """Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge.""",
    "C2": """and round the neck of the bottle was a paper label, with the words “DRINK ME,”""",
    "C3": """“SHE said that she would dance with me if I brought her red roses,” cried the young Student; “but in all my garden there is no red rose.”"""
}

### Step 1 - Chunking

The first step of RAG Pipeline is chunking the raw text into small unit.


The following function takes raw text, chunk size and over lap as parameters, with `chunk_size` and `overlap` have default values and default configuration.

Then we read the contents from `alice_ch1.txt` file and chunk it into small chunks.

In [19]:
def chunk_text(text, chunk_size=500, overlap=0):
    words = text.split()
    out, i = [], 0
    while i < len(words):
        out.append(" ".join(words[i:i+chunk_size]))
        i += max(1, chunk_size - overlap)
    return out

**`Note:`** The provided chunk function is simple based on word count. Frameworks like `Langchain` provides more chunking strategies. See this for more information:

https://python.langchain.com/docs/concepts/text_splitters/

In [20]:
with open("alice_ch1.txt","r",encoding="utf-8") as f:
    raw_text = f.read()

chunks = chunk_text(raw_text, chunk_size=500, overlap=0)

In [21]:
def chunk_text(text, chunk_size=500, overlap=0):
    words = text.split()
    out, i = [], 0
    while i < len(words):
        out.append(" ".join(words[i:i+chunk_size]))
        i += max(1, chunk_size - overlap)
    return out

with open("alice_ch1.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

chunks = chunk_text(raw_text, chunk_size=500, overlap=0)

print(f"Total chunks created: {len(chunks)}")
print("\nPreview of first chunk:\n")
print(chunks[0][:500], "...")

Total chunks created: 5

Preview of first chunk:

Alice’s Adventures in Wonderland by Lewis Carroll CHAPTER I. Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?” So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid) ...


In [16]:
from google.colab import files

# Upload your file manually
uploaded = files.upload()


Saving alice_ch1.txt to alice_ch1.txt
Saving alice_ch8.txt to alice_ch8.txt
Saving wilde_nightingale_rose.txt to wilde_nightingale_rose.txt


### Step 2 - Create ChromaBD collection and Vectors for Chunks

In [22]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embed_fn = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
client = chromadb.Client()


collection = client.get_or_create_collection(  # Hints: check the documentation for `get_or_create_collection`
    name="alice_ch1",
    embedding_function=embed_fn,
)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [24]:
# Upsert the source text into the collection
def upsert_source(chunks, collection):
    ids = [f"alice:ch1:{i}" for i in range(len(chunks))]
    metas = [{"source":"alice_ch1","chunk_id":ids[i]} for i in range(len(chunks))]
    collection.upsert(ids=ids, documents=chunks, metadatas=metas)

upsert_source(chunks, collection)


**`Note:`** Apart from ChromaDB there are other options for similarity-based search like FAISS and much more.

FAISS: https://ai.meta.com/tools/faiss/

- Check the configuration for the collcetion

In [25]:
collection.get(include=["metadatas"])

{'ids': ['alice:ch1:0',
  'alice:ch1:1',
  'alice:ch1:2',
  'alice:ch1:3',
  'alice:ch1:4'],
 'embeddings': None,
 'documents': None,
 'uris': None,
 'included': ['metadatas'],
 'data': None,
 'metadatas': [{'chunk_id': 'alice:ch1:0', 'source': 'alice_ch1'},
  {'chunk_id': 'alice:ch1:1', 'source': 'alice_ch1'},
  {'chunk_id': 'alice:ch1:2', 'source': 'alice_ch1'},
  {'chunk_id': 'alice:ch1:3', 'source': 'alice_ch1'},
  {'chunk_id': 'alice:ch1:4', 'source': 'alice_ch1'}]}

### Step 3 - Similarity-based search for a given query

The two queries we are going to use for **Part 2** are:

- Which animal did Alice follow down the hole?

- What instruction was printed on the bottle that Alice drank from?

In [26]:
# Define the retrieval function
def search(query, collection, k=3):
    r = collection.query(query_texts=[query], n_results=k)
    hits = list(zip(r["ids"][0], r["documents"][0], r["metadatas"][0]))
    return hits


# Retrieve the top 3 results for Q1
hits = search(Queries["Q1"], collection, k=3)

In [27]:
# Look into the details for the retreived results
for i, (id, text, meta) in enumerate(hits):
    print(f"Hit {i+1}:")
    print(f"ID: {id}")
    print(f"Text: {text}")
    print("\n")


Hit 1:
ID: alice:ch1:0
Text: Alice’s Adventures in Wonderland by Lewis Carroll CHAPTER I. Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?” So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. There was nothing so _very_ remarkable in that; nor did Alice think it so _very_ much out of the way to hear the Rabbit say to itself, “Oh dear! Oh dear! I shall be late!” (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all see

# **Important!**

Only complete the code where it shows `# TODO`; do **NOT** change other parts or rewrite the entire function.

## Assessment 1

Based on the previous code, now you should be able to implement the end-to-end pipeline, take into user query and the relevant context information to let LLM answer user queries based on the provided context information, meaning ground LLMs' response with relevant contexts - the core of RAG.

Complete the following code just use **`query 1`** for assessment 1

In [28]:
# Generate the final response for query 1, grounding the LLM's response with the retrieved results

query = Queries["Q1"]

hits = search(query, collection, k=3)

# Define the context based on the retrieved results
# Hint: join the retrieved texts with `\n`

def get_context(hits):


  context = "\n".join([text for (_id, text, _meta) in hits])

  return context

context = get_context(hits)


In [29]:
# Define the prompt for the LLM, which combines the question and the retrieved context
prompt = f"""
Answer the following question based on the context provided.

Question: {query}

Context:
{context}
"""


In [30]:
# Define the LLM call, model="gemini-2.5-flash"

def llm_call(query, context):



    if not GOOGLE_API_KEY:
        q = query.lower()
        c = context.lower()

        if "rabbit" in c:
            llm_output = "rabbit"
        elif "drink me" in c:
            llm_output = "drink me"
        elif "rose" in c:
            llm_output = "red rose"
        else:
            llm_output = "unknown"
    else:
        # --- real Gemini API call ---
        from google import genai
        gemini_client = genai.Client(api_key=GOOGLE_API_KEY)

        prompt = f"""Answer the following question based on the context provided.
Question: {query}

Context:
{context}
"""
        response = gemini_client.models.generate_content(
            model="gemini-2.5-flash",
            contents=prompt
        )
        llm_output = (response.text or "").strip().lower()

    return llm_output



response = llm_call(query, context)
print("Model response is:")
print(response)

print("\nCorrect Answer is:")
print(Answers['A1'])


Model response is:
alice followed the **white rabbit** down the hole.

Correct Answer is:
rabbit


In [31]:
# Verify: 1: whether C1 is included in the retrieved results; 2: whether the answer is correct compared with the ground truth A1

def verify_retrival(citation, retrievals):


    citation_norm = " ".join(citation.split())
    for (_id, text, _meta) in retrievals:
        text_norm = " ".join(text.split())
        if citation_norm in text_norm:
            print("\nCitation is included in the retrieved results")
            break  # stop once found


verify_retrival(Citations['C1'], hits)



Citation is included in the retrieved results


## Assesment 2

Now let's try **`query 2`**.

- Verify your model response to Q2





In [32]:
# Generate the final response for query 2
# Verify your model response to Q2 with default RAG pipeline

query = Queries["Q2"]

hits = search(query, collection, k=3)
context = get_context(hits)

# get model response
response = llm_call(query, context)
print("Model response is:")
print(response)

print("\nCorrect Answer is:")
print(Answers['A2'])

# Verify: 1: whether C2 is included in the retrieved results; 2: whether the answer is correct compared with the ground truth A2
verify_retrival(Citations['C2'], hits)


Model response is:
the context states: "however, this bottle was _not_ marked “poison,”". it does not provide any information about what instruction *was* printed on the bottle that alice drank from.

Correct Answer is:
drink me


## Assesment 2 - continued

Now let's improve the RAG

- If incorrect answer was provided, pick one component in the RAG pipeline (chunking-->embedding-->searching-->generating) to make improvement to make the model gives correct answer.

- try Q1 with this improved RAG, which still gives correct answer


**`Hints`** When you improve the RAG pipeline, you may need to `restart the session` or rebuild Chroma collection with `different collcetion name` when needed.

In [33]:
# If incorrect answer was provide, improve the RAG pipeline (chunking-->embedding-->searching-->generating) to make the model gives correct answer.

# ✅ TODO completed: rebuild the collection with smaller chunk size and slight overlap
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

client = chromadb.Client()
embed_fn = SentenceTransformerEmbeddingFunction(model_name="paraphrase-MiniLM-L3-v2")


collection_improved = client.get_or_create_collection(
    name="alice_ch1_improved",
    embedding_function=embed_fn
)


chunks_improved = chunk_text(raw_text, chunk_size=300, overlap=50)

upsert_source(chunks_improved, collection_improved)


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/69.6M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [34]:
# Verify after the improvement

query = Queries["Q2"]

hits = search(query, collection_improved, k=3)

context = get_context(hits)

response = llm_call(query, context)
print("Model response is:")
print(response)

print("\nCorrect Answer is:")
print(Answers['A2'])


verify_retrival(Citations['C2'], hits)

Model response is:
the context states: "however, this bottle was _not_ marked 'poison,' so alice ventured to taste it..."

the context does not mention any other instruction printed on the bottle that alice drank from, only that it was *not* marked "poison."

Correct Answer is:
drink me


In [35]:
# Try Q1 with this improved RAG, which still gives correct answer

query = Queries["Q1"]

hits = search(query, collection_improved, k=3)

context = get_context(hits)

response = llm_call(query, context)
print("Model response is:")
print(response)

print("\nCorrect Answer is:")
print(Answers['A1'])

verify_retrival(Citations['C1'], hits)


Model response is:
alice followed a **white rabbit** down the hole.

Correct Answer is:
rabbit

Citation is included in the retrieved results


# Part 3 - Meta filtering (two sources mixed)

Now we’ll create a mixed corpus of two public-domain texts that both mention **roses**

- `data/wilde_nightingale_rose.txt` — The Nightingale and the Rose (Oscar Wilde) → `source="wilde"`


- `data/alice_ch8.txt` — Alice Chapter 8 (The Queen’s Croquet-Ground) → `source="alice"`



### Build the RAG pipeline with default setting and searching method

In [39]:
# Chunking function
def chunk_text(text, chunk_size=500, overlap=0):
    words = text.split()
    out, i = [], 0
    while i < len(words):
        out.append(" ".join(words[i:i+chunk_size]))
        i += max(1, chunk_size - overlap)
    return out


# Create a ChromaDB collection
embed_fn = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
client = chromadb.Client()

col = client.get_or_create_collection(name="mixed_rag", embedding_function=embed_fn)


# Upsert the source text into the collection
def upsert_source(text, source_tag):
    chunks = chunk_text(text, chunk_size=500, overlap=0)
    ids = [f"{source_tag}:{i}" for i in range(len(chunks))]
    metas = [{"source": source_tag, "chunk_id": ids[i]} for i in range(len(chunks))]
    col.upsert(ids=ids, documents=chunks, metadatas=metas)


# Read the contents from the two files
with open("wilde_nightingale_rose.txt","r",encoding="utf-8") as f:
    wilde_raw = f.read()
with open("alice_ch8.txt","r",encoding="utf-8") as f:
    alice_raw = f.read()

upsert_source(wilde_raw, "wilde")
upsert_source(alice_raw, "alice")


# Searching the collection for Q3 and generate the response
def search(query, k=3):
    r = col.query(query_texts=[query], n_results=k)
    hits = list(zip(r["ids"][0], r["documents"][0], r["metadatas"][0]))
    return hits


query = Queries["Q3"]
hits = search(query, k=3)


## Assesment 3

Generate the response for Q3 with the retrieved results and verify both the retrieval and generation

In [40]:
# Generate the response for Q3 with the retrieved results

context = get_context(hits)

response = llm_call(query, context)
print("Model response is:")
print(response)

print("\nCorrect Answer is:")
print(Answers['A3'])

# Verify: 1: whether C3 is included in the retrieved results; 2: whether the answer is correct compared with the ground truth A3

verify_retrival(Citations['C3'], hits)



Model response is:
based on the first part of the context:

he needed to bring a **red rose** to her from the garden.

Correct Answer is:
red rose


## Assessment 3 - continued

Now let's improve the RAG via using the meta filtering.

The syntax for adding meta filtering in ChromaDB is

`col.query(query_texts=[query], n_results=k, where={"source": "wilde"})`

The `where` argument in `query` is used to filter records by their metadata.

For more information about meta filtering check: https://docs.trychroma.com/docs/querying-collections/metadata-filtering

Other vector search framework has similar filtering methods as well.



In [42]:
def search(query, metas, k=3):

    r = col.query(
        query_texts=[query],
        n_results=k,
        where=metas   # filter results by metadata (e.g., {"source": "wilde"})
    )
    hits = list(zip(r["ids"][0], r["documents"][0], r["metadatas"][0]))
    return hits


In [43]:
# retrieve the top 3 results for Q3 with meta filtering

query = Queries["Q3"]
metas = {"source": "wilde"}
hits = search(query, metas, k=3)

context = get_context(hits)

# generate the response for Q3 with the retrieved results

response = llm_call(query, context)
print("Model response is:")
print(response)

print("\nCorrect Answer is:")
print(Answers['A3'])

# verify, this time C3 should be included in the retrieved results and the model gives the correct answer for Q3

verify_retrival(Citations['C3'], hits)








Model response is:
he needed to bring her a **red rose**.

Correct Answer is:
red rose

Citation is included in the retrieved results
