# 📦 Day 3: Building a RAG System (Retrieval-Augmented Generation)

Welcome to the third session of the Generative AI workshop!

Today we'll learn how to **build a Retrieval-Augmented Generation (RAG) pipeline** using open-source tools. You'll see how to process documents, embed them into a vector store, and query them with a language model to generate intelligent responses grounded in real content.

🎯 **Objectives**

- Understand the concept and benefits of Retrieval-Augmented Generation (RAG)
- Chunk and embed a document using `sentence-transformers`
- Store and search document vectors using `ChromaDB`
- Query a document using a local language model (`FLAN-T5`)
- Build and test a simple QA system over a PDF — no API keys required!


## 🔧 Step 1: Install Required Packages

We'll begin by installing all the necessary Python libraries for this RAG pipeline:

- `chromadb` for vector storage and retrieval
- `PyPDF2` for extracting text from PDF documents
- `transformers` for loading our language model (FLAN-T5)
- `sentence-transformers` for generating text embeddings

This may take a minute the first time you run it.


In [None]:
!pip install chromadb PyPDF2 transformers sentence-transformers --quiet


## 📦 Step 2: Import Required Libraries

Now that we've installed our dependencies, let's import the necessary libraries:

- `PyPDF2` to read PDF files
- `sentence-transformers` to embed text chunks
- `transformers` to load and run our LLM (FLAN-T5)
- `chromadb` to store and retrieve vectorized document chunks
- `torch` as the backend for running the language model


In [None]:
import os
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM
import torch
import chromadb
from chromadb.config import Settings

## 📄 Step 3: Upload and Extract Text from a PDF

We'll now upload a PDF file using Colab's file uploader and extract its text content.

- This step reads each page of the PDF using `PyPDF2`
- It joins the extracted text into a single string
- The resulting `full_text` variable will be used for chunking and embedding in the next steps


In [None]:
pip install pymupdf

Collecting pymupdf
  Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m94.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.26.3


In [None]:
from google.colab import files
import fitz  # PyMuPDF

# TODO: Upload a PDF file from your computer
uploaded = files.upload()  # Hint: What method allows users to upload files?

# TODO: Get the filename of the uploaded file
filename = next(iter(uploaded))  # Hint: What variable contains the uploaded files?

# TODO: Create a PDF reader object
reader = fitz.open(filename)  # Hint: What variable contains the filename?

# Extract all text from the PDF
# TODO: Extract text from each page and join into a single string
cleaned_pages = []
for page in reader:
    words = page.get_text("words")  # Extract individual words as (x0, y0, x1, y1, "word", block_no, line_no, word_no)
    words.sort(key=lambda w: (w[1], w[0]))  # Sort by y0 (top-down), then x0 (left-right)
    text = " ".join(w[4] for w in words)  # Join just the word text with a single space
    cleaned_pages.append(text.strip())

full_text = "\n".join(cleaned_pages)

print(f"✅ PDF uploaded and processed!")
print(f"📄 Filename: {filename}")
print(f"📝 Total text length: {len(full_text)} characters")

# 💡 LEARNING NOTES:
# - This step reads each word on the page and ensures clean spacing
# - It joins words in the correct visual order (left to right, top to bottom)
# - We use 'words' mode in PyMuPDF to avoid messy newlines or irregular gaps


Saving Eman_CV.pdf to Eman_CV.pdf
✅ PDF uploaded and processed!
📄 Filename: Eman_CV.pdf
📝 Total text length: 5010 characters


In [None]:
# TODO: Display the first 1000 characters of the extracted text
print(full_text[:1000])  # Hint: What variable contains our text and how many characters should we show?

# 💡 This is helpful for:
# - Verifying that the PDF contains valid text (not just images)
# - Understanding what content the model will later use for answering questions
# - Checking if the text extraction worked properly

EMAN ALHAJRI Artificial intelligence and Data Science Specialist Muscat, Al-Seeb| +968 94428223 | emaanhajri@gmail.com Github: https://github.com/1iEman | Linkedin: www.linkedin.com/in/eman-al-hajri EDUCATION​ Sultan Qaboos University ALkhoud, Muscat Bachelor of Computer Science 2019-2025​ Major in Artificial Intelligence and Data Science Cumulative GPA: 3.55/4.0, Dean’s List: 2021,2022,2023,2024 Relevant Coursework: Artificial Intelligence courses; Data Analysis and Visualization with Python, Machine learning, Deep Learning, Computer Vision, Pattern Recognition and analysis, Digital Image Processing, Mobile robotics, Natural Language Processing. Final Year Project in AI and Data Science. EXPERIENCE Makeen Bootcamp – AI & Data Science Stream Apr 2025 – Present Data Collection & Pipelines: Web scraping for data collection, building automated workflows using Apache Airflow.​ management.​ Infrastructure & Databases: Devops using Docker & MySQL for data storage and Visualization, Statistic

## 👀 Optional: View Extracted Text

Let’s preview the extracted text from the PDF to ensure it was loaded correctly.

This is helpful for:
- Verifying that the PDF contains valid text (not just images)
- Understanding what content the model will later use for answering questions


## ✂️ Step 4: Chunk the Text

To make the text manageable for embedding and retrieval, we'll break the PDF content into smaller chunks.

- This function splits the text into sentences using regular expressions
- It groups sentences together until a character limit (e.g., 300) is reached
- The result is a list of `chunks`, each suitable for embedding in the next step


In [None]:
import re

# Clean the text first
full_text1 = re.sub(r'\s+', ' ', full_text).strip()

def chunk_text(text, max_length=300):
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks, chunk = [], ""
    for sentence in sentences:
        if len(chunk) + len(sentence) <= max_length:
            chunk += sentence + " "
        else:
            chunks.append(chunk.strip())
            chunk = sentence + " "
    if chunk:
        chunks.append(chunk.strip())
    return chunks

chunks = chunk_text(full_text1)

print(f"✅ Text chunked successfully!")
print(f"📊 Total chunks created: {len(chunks)}")
print(f"📏 Average chunk length: {sum(len(c) for c in chunks) / len(chunks):.0f} characters")
print(f"🔍 First chunk preview:\n{chunks[0]}")


✅ Text chunked successfully!
📊 Total chunks created: 21
📏 Average chunk length: 238 characters
🔍 First chunk preview:



## 📚 Optional: Inspect the Chunks

Let’s inspect a few individual chunks to understand how the original text was segmented.

This helps you:
- See how the chunking logic grouped sentences together
- Verify whether the chunks are clean and meaningful for embedding


In [None]:
# TODO: Loop through the first 3 chunks
for i in range(7):  # Hint: How many chunks do we want to see?
    print(f"--- Chunk {i + 1} ---")  # Hint: What variable represents the current chunk number? i + 1?
    print(chunks[i])  # Hint: What list contains our chunks and what index are we at?
    print()

--- Chunk 1 ---


--- Chunk 2 ---
EMAN ALHAJRI Artificial intelligence and Data Science Specialist Muscat, Al-Seeb| +968 94428223 | emaanhajri@gmail.com Github: https://github.com/1iEman | Linkedin: www.linkedin.com/in/eman-al-hajri EDUCATION​ Sultan Qaboos University ALkhoud, Muscat Bachelor of Computer Science 2019-2025​ Major in Artificial Intelligence and Data Science Cumulative GPA: 3.55/4.0, Dean’s List: 2021,2022,2023,2024 Relevant Coursework: Artificial Intelligence courses; Data Analysis and Visualization with Python, Machine learning, Deep Learning, Computer Vision, Pattern Recognition and analysis, Digital Image Processing, Mobile robotics, Natural Language Processing.

--- Chunk 3 ---
Final Year Project in AI and Data Science.

--- Chunk 4 ---
EXPERIENCE Makeen Bootcamp – AI & Data Science Stream Apr 2025 – Present Data Collection & Pipelines: Web scraping for data collection, building automated workflows using Apache Airflow.​ management.​ Infrastructure & Databases: Devop

## 🧬 Step 5: Generate Embeddings

We’ll now convert each text chunk into a numerical vector using a pre-trained sentence embedding model.

- We're using the `all-MiniLM-L6-v2` model from `sentence-transformers`
- These embeddings will later be stored in a vector database for retrieval

Each chunk is now represented in a way that a machine learning model can understand semantically.


In [None]:
# TODO: Initialize a sentence transformer model for creating embeddings
embedder = SentenceTransformer("all-MiniLM-L6-v2")  # Hint: What's the model name for all-MiniLM-L6-v2?

# TODO: Convert all chunks into embedding vectors
embeddings = embedder.encode(chunks)  # Hint: What method creates embeddings and what variable contains our chunks?

print(f"✅ Embeddings created successfully!")
print(f"📊 Number of embeddings: {len(embeddings)}")
print(f"📏 Embedding dimension: {embeddings.shape[1]}")
print(f"🔍 First embedding preview (first 10 values):")
print(embeddings[0][:10])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Embeddings created successfully!
📊 Number of embeddings: 21
📏 Embedding dimension: 384
🔍 First embedding preview (first 10 values):
[-0.11883842  0.04829862 -0.00254814 -0.01101124  0.05195079  0.0102918
  0.11543332  0.0007007  -0.08592541 -0.070654  ]


## 🔎 Optional: Inspect Embeddings

Let’s take a quick look at the generated embeddings.

- Each embedding is a high-dimensional vector representing the meaning of a chunk
- These vectors are what the model uses to retrieve relevant information later

Note: Embeddings are large arrays of numbers, so we’ll only display the first one for illustration.


In [None]:
# TODO: Print information about the first embedding
print(f"Embedding for Chunk 1 (dimension: {len(embeddings[0])}):")
print(embeddings[0])

Embedding for Chunk 1 (dimension: 384):
[-1.18838422e-01  4.82986234e-02 -2.54813721e-03 -1.10112354e-02
  5.19507863e-02  1.02917971e-02  1.15433320e-01  7.00698933e-04
 -8.59254077e-02 -7.06540048e-02  1.33175042e-03 -3.54724042e-02
  1.84340514e-02 -6.73720986e-03  2.44030710e-02 -2.95030996e-02
 -5.81384338e-02 -5.04396111e-02 -2.07655355e-02  2.90359166e-02
 -6.36760369e-02  2.40299329e-02  2.62433402e-02 -6.03735121e-03
 -1.10766171e-02 -1.40066631e-03 -1.86198559e-02  3.27700749e-02
  2.88602617e-03 -5.69439270e-02 -4.39416729e-02  2.54140683e-02
  8.79094303e-02 -2.49911081e-02 -3.66832465e-02  6.24138303e-03
 -6.64680302e-02 -6.71444014e-02  2.05642469e-02  4.23887521e-02
  2.18802430e-02 -4.28824984e-02 -3.43770310e-02  6.14668801e-02
  6.56373054e-02 -7.85202682e-02  2.94870026e-02  1.07982671e-02
  6.33241981e-02 -4.50847223e-02 -1.82340480e-02 -2.77211033e-02
 -3.67373996e-03 -3.65946442e-02  5.42501844e-02 -2.08566878e-02
  1.50349056e-02 -6.00950569e-02  1.63938235e-02 -

## 🗂️ Step 6: Store Chunks in ChromaDB

Now we’ll store the chunks and their corresponding embeddings in a ChromaDB collection.

- ChromaDB is an efficient local vector database
- We create a collection named `"test"` (or reuse it if it already exists)
- Each chunk is added along with its embedding and a unique ID

This setup allows us to later search for relevant chunks based on user questions.


In [None]:
# TODO: Create a ChromaDB client with anonymized telemetry disabled
chroma_client = chromadb.Client(Settings(anonymized_telemetry=False))  # Hint: Should telemetry be enabled? True or False?

# TODO: Create a collection to store our documents and embeddings
collection = chroma_client.create_collection(name="pdf_chunks", get_or_create=True)  # Hint: What should we name our collection and should we create if it exists?

# TODO: Add documents and embeddings to the collection
for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):  # Hint: What two lists should we iterate through together?
    collection.add(
        documents=[chunk],  # Hint: What text chunk are we adding?
        embeddings=[emb.tolist()],  # Hint: What embedding (converted to list) are we adding?
        ids=[str(i)]  # Hint: What should be the unique ID for this document?
    )

print(f"✅ Vector database created successfully!")
print(f"📊 Total documents stored: {collection.count()}")


✅ Vector database created successfully!
📊 Total documents stored: 21


## 📋 Optional: Preview Stored Chunks in ChromaDB

Let’s confirm that the chunks and embeddings were properly added to the ChromaDB collection.

This quick check allows us to:
- View some of the stored chunk texts
- Ensure each one has a unique ID


In [None]:
# TODO: Retrieve documents from the collection
results = collection.get(include=["documents"])  # Hint: What type of data do we want to retrieve from the collection?

# TODO: Display the first 3 documents
for i in range(min(3, len(results["documents"]))):  # Hint: How many documents to show and what key contains the documents?
    print(f"📄 Chunk ID: {results['ids'][i]}")  # Hint: What key contains IDs and what index are we at?
    print(results["documents"][i])  # Hint: What key contains documents and what index are we at?
    print("-" * 80)  # Hint: What character should create a separator line?


📄 Chunk ID: 0

--------------------------------------------------------------------------------
📄 Chunk ID: 1
EMAN ALHAJRI Artificial intelligence and Data Science Specialist Muscat, Al-Seeb| +968 94428223 | emaanhajri@gmail.com Github: https://github.com/1iEman | Linkedin: www.linkedin.com/in/eman-al-hajri EDUCATION​ Sultan Qaboos University ALkhoud, Muscat Bachelor of Computer Science 2019-2025​ Major in Artificial Intelligence and Data Science Cumulative GPA: 3.55/4.0, Dean’s List: 2021,2022,2023,2024 Relevant Coursework: Artificial Intelligence courses; Data Analysis and Visualization with Python, Machine learning, Deep Learning, Computer Vision, Pattern Recognition and analysis, Digital Image Processing, Mobile robotics, Natural Language Processing.
--------------------------------------------------------------------------------
📄 Chunk ID: 2
Final Year Project in AI and Data Science.
--------------------------------------------------------------------------------


In [None]:
from google.colab import files
import fitz  # PyMuPDF
import re
from sentence_transformers import SentenceTransformer

# TODO: Upload multiple PDF files from your computer
uploaded = files.upload()  # This allows you to upload more than one PDF

all_texts = []

# TODO: Loop through each uploaded file
for filename in uploaded:
    reader = fitz.open(filename)
    cleaned_pages = []
    for page in reader:
        words = page.get_text("words")
        words.sort(key=lambda w: (w[1], w[0]))  # Sort top-down, then left-right
        text = " ".join(w[4] for w in words)
        cleaned_pages.append(text.strip())
    full_text = "\n".join(cleaned_pages)

    # Clean whitespace
    full_text_clean = re.sub(r'\s+', ' ', full_text).strip()
    all_texts.append(full_text_clean)

# Combine all cleaned PDF texts into one
combined_text = " ".join(all_texts)

# ✅ Sentence-based chunking
def chunk_text(text, max_length=300):
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks, chunk = [], ""
    for sentence in sentences:
        if len(chunk) + len(sentence) <= max_length:
            chunk += sentence + " "
        else:
            chunks.append(chunk.strip())
            chunk = sentence + " "
    if chunk:
        chunks.append(chunk.strip())
    return chunks

chunks = chunk_text(combined_text)

print(f"✅ Text chunked successfully!")
print(f"📊 Total chunks created: {len(chunks)}")
print(f"📏 Average chunk length: {sum(len(c) for c in chunks) / len(chunks):.0f} characters")
print(f"🔍 First chunk preview:\n{chunks[0]}")

# ✅ Create embeddings
embedder = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedder.encode(chunks)

print(f"✅ Embeddings created successfully!")
print(f"📊 Number of embeddings: {len(embeddings)}")
print(f"📏 Embedding dimension: {embeddings.shape[1]}")
print(f"🔍 First embedding preview (first 10 values):")
print(embeddings[0][:10])


Saving Aya Al Hasani.pdf to Aya Al Hasani (2).pdf
✅ Text chunked successfully!
📊 Total chunks created: 9
📏 Average chunk length: 363 characters
🔍 First chunk preview:
Aya Khamis AL-Hasani Data Scientist & Data Engineer ABOUT ME EDUCATION I am passionate about AI and machine • University of Technology and Applied Sciences, learning, driven to apply these Muscat, Oman - 2024 technologies to solve real-world Bachelor’s Degree challenges.


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


✅ Embeddings created successfully!
📊 Number of embeddings: 9
📏 Embedding dimension: 384
🔍 First embedding preview (first 10 values):
[-0.055553    0.03918847  0.05182587  0.01580185 -0.0172099  -0.13430738
  0.03710235 -0.02356444 -0.04780589  0.02980794]


## 🤖 Step 7: Load the Language Model (FLAN-T5)

We’ll now load a lightweight instruction-tuned language model to generate answers based on retrieved context.

- `google/flan-t5-base` is a small and efficient model suitable for Q&A tasks
- We load both the tokenizer and the model using Hugging Face Transformers

This model will take the retrieved document chunks and generate context-aware answers to user questions.


In [None]:
# TODO: Load the tokenizer for the T5 model
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")  # Hint: What's the model name for google/flan-t5-base?

# TODO: Load the T5 model for sequence-to-sequence generation
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")  # Hint: Should we use the same name?

## ❓ Step 8: Define a Question-Answering Function

This function allows us to query the document using natural language and receive an answer generated by the language model.

Here’s how it works:

- It first embeds the user's question using the same embedding model as before
- It then queries the ChromaDB collection to retrieve the most relevant text chunks
- These chunks are used as context in a prompt passed to the `flan-t5-base` model
- The model generates an answer based on the context and the question

You can now ask the model questions about the uploaded PDF!


In [None]:
def ask_question(query):
    # TODO: Convert the query into an embedding vector
    query_vec = embedder.encode([query])[0]  # Hint: What method creates embeddings? What should we encode? What index for first result?

    # TODO: Search for similar documents in the vector database
    results = collection.query(query_embeddings=[query_vec.tolist()], n_results=3)  # Hint: What vector to search with? How many similar chunks to retrieve? Bonus: What if we are able to add a threshold?

    # TODO: Combine retrieved documents into context
    context = " ".join(results["documents"][0])  # Hint: What key contains the retrieved documents? What index for our query results?

    # TODO: Create a prompt that includes context and question
    instruction = "You are a helpful assistant. Use the context to answer the question."
    prompt = (
        f"{instruction}\n\n"
        f"Context:\n{context}\n\n"  # Hint: What variable contains our retrieved context?
        f"Question: {query}\n\n"  # Hint: What variable contains the user's question?
        "Answer:"
    )

    # Display the components before generating answer
    print("=" * 80)
    print("🔍 QUERY:")
    print(f"'{query}'")
    print("\n" + "=" * 80)
    print("📋 INSTRUCTION:")
    print(instruction)
    print("\n" + "=" * 80)
    print("📄 RETRIEVED CONTEXT:")
    print(context)
    print("\n" + "=" * 80)
    print("🤖 GENERATED ANSWER:")
    print("-" * 40)

    # TODO: Tokenize the prompt for the model
    inputs = tokenizer(prompt, return_tensors="pt")  # Hint: What should we tokenize? What tensor format does PyTorch use? ("pt")

    # TODO: Generate an answer using the model
    outputs = model.generate(**inputs, max_new_tokens=200)  # Hint: What inputs should we pass? What's a reasonable token limit for answers?

    # TODO: Decode and return the generated answer
    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)  # Hint: What outputs to decode? What index for first result? Should we skip special tokens?

    # Extract only the answer part (after "Answer:")
    if "Answer:" in full_response:
        answer_start = full_response.find("Answer:") + len("Answer:")
        answer = full_response[answer_start:].lstrip()  # Use lstrip() instead of strip() to preserve trailing whitespace
    else:
        answer = full_response.strip()  # Fallback if "Answer:" not found

    print(answer)
    print("=" * 80)


## ▶️ Step 9: Ask a Question!

Let’s test the full pipeline by asking a question about the uploaded PDF.

- This example asks the model to generate a bullet-point summary
- You can replace the prompt with any question relevant to the document

Try experimenting with different question styles to explore the model's capabilities!


In [None]:
# TODO: Ask a question about the document content
ask_question("What Aya specialist in ? ")  # Hint: Write a question that would require information from your uploaded PDF

# 💡 Try different types of questions:
# - Factual questions about specific content
# - Summary requests
# - Questions that require combining information from multiple chunks

NameError: name 'collection' is not defined

# 🚀 Advanced RAG Experiments
## For Students Who Want to Go Further!

Congratulations! You've built a complete RAG system. Now it's time to become a **real RAG researcher** and explore what makes these systems work better.

---

## 🔬 **Choose Your Experiment Track**

### 🧩 **Track 1: Chunking Strategy Optimization**
**The Question**: How does text splitting affect answer quality?

**Experiments to Try:**
- **Chunk size comparison**: Test 100, 300, 500, 1000 character chunks
- **Overlap experiments**: Add 50-100 character overlap between chunks
- **Smart boundaries**: Split by paragraphs vs. sentences vs. fixed length
- **Hybrid approaches**: Combine multiple splitting strategies

**Success Metrics**: Answer quality, retrieval accuracy, response coherence

---

### 🎯 **Track 2: Embedding Model Showdown**
**The Question**: Which embedding model gives the best retrieval results?

**Models to Compare:**
- `all-MiniLM-L6-v2` (what we used - fast and small)
- `all-mpnet-base-v2` (larger, potentially better quality)
- `sentence-transformers/all-MiniLM-L12-v2` (larger variant)
- Domain-specific models for your document type

**Success Metrics**: Retrieval precision, answer relevance, speed comparison

---

### 🔍 **Track 3: Retrieval Strategy Enhancement**
**The Question**: How many chunks should we retrieve and how should we rank them?

**Experiments to Try:**
- **Retrieval count**: Test 1, 3, 5, 10 retrieved chunks
- **Similarity thresholds**: Only use chunks above 0.5, 0.7, 0.8 similarity
- **Re-ranking**: Use different similarity metrics
- **Context limits**: How much context can the model handle effectively?

**Success Metrics**: Answer completeness, hallucination reduction, context utilization

---

### 📚 **Track 4: Multi-Document Mastery**
**The Question**: How well does RAG work with multiple different documents?

**Experiments to Try:**
- Upload 2-3 different PDFs and ask cross-document questions
- Test document type mixing (PDFs + text files + web content)
- Source attribution: Can you track which document answered what?
- Conflicting information: How does the system handle contradictions?

**Success Metrics**: Cross-document reasoning, source accuracy, conflict resolution

---

### ⚡ **Track 5: Real-World Application**
**The Question**: Can you build something actually useful?

**Project Ideas:**
- **Study Assistant**: Upload your course materials, create a personal tutor
- **Research Helper**: Upload papers from your field, ask comparative questions
- **Policy Bot**: Upload company/school policies, create an internal help system
- **Personal Knowledge Base**: Upload your notes, papers, articles

**Success Metrics**: Practical utility, user satisfaction, real-world accuracy

---

### 📊 **Track 6: Evaluation & Quality Analysis**
**The Question**: How do we measure if our RAG system is actually good?

**Evaluation Methods to Build:**
- **Answer quality rubric**: Rate responses on accuracy, relevance, completeness
- **Retrieval evaluation**: Check if the right chunks were found
- **Speed benchmarking**: Measure response times across configurations
- **Hallucination detection**: Identify when the model makes things up

**Success Metrics**: Systematic quality measurement, performance optimization

---

## 📝 **Documentation Tips**

As you experiment, keep track of:
- ✅ **What you tried** (specific configurations, parameters)
- ✅ **What worked** (successful approaches and why)
- ✅ **What didn't work** (failures teach us too!)
- ✅ **Surprising discoveries** (unexpected results often lead to breakthroughs)
- ✅ **Practical insights** (what would you use in a real project?)

---

## 🤝 **Collaboration Encouraged!**

- **Team up** with classmates to tackle different tracks
- **Share findings** - compare results across different approaches
- **Peer review** each other's experiments
- **Present discoveries** to the class

---

## 🌟 **Remember**

> *"The best way to understand RAG is not just to build it, but to break it, improve it, and push its boundaries."*

**Every expert started as a curious experimenter. Every breakthrough began with someone asking "What if...?"**

Ready to become a RAG researcher? Pick your track and start experimenting! 🚀