# RAG Workshop - Part 1: Fundamentals

In this notebook, we'll explore the core concepts behind RAG:
1. **The Problem**: Why LLMs need external knowledge
2. **Chunking**: How to split documents for retrieval
3. **Embeddings**: How to represent text as vectors
4. **Vector Search**: How to find similar content

## Setup

First, let's import the libraries we need.

In [1]:
import os
from pathlib import Path
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Verify API key is set
if not os.getenv("GOOGLE_API_KEY"):
    print("⚠️ GOOGLE_API_KEY not found. Please set it in your .env file.")
else:
    print("✅ API key loaded successfully!")

✅ API key loaded successfully!


## 1. The Problem: LLM Hallucination

Let's first see what happens when we ask an LLM about our fictional university courses.

In [3]:
from langchain_google_genai import ChatGoogleGenerativeAI

# Initialize Gemini
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0
)

# Ask about our fictional courses
response = llm.invoke("What are the prerequisites for CS401 at Fictional University?")
print("LLM Response:")
print(response.content)

LLM Response:
To answer your question accurately, I need to know the prerequisites for CS401 at Fictional University. Here's how we can find that information:

1.  **Check the Fictional University's Course Catalog:** This is the most reliable source. Look for the official course catalog (usually available online) and search for CS401. The prerequisites will be listed in the course description.

2.  **Search the Fictional University's Computer Science Department Website:** The CS department website might have a list of courses and their prerequisites.

3.  **Contact the Fictional University's Computer Science Department:** If you can't find the information online, you can email or call the CS department directly. They will be able to tell you the prerequisites.

**Example of what you might find:**

*   **CS401: Data Structures and Algorithms**
    *   **Prerequisites:** CS201 (Introduction to Programming) and MATH220 (Discrete Mathematics)

**Without knowing the specific prerequisites l

### LLM Control Parameters

| Parameter | What it does | Values | When to use |
|-----------|--------------|--------|-------------|
| **temperature** | Controls randomness | 0-1 | 0-0.3 for RAG/factual, 0.7-1 for creative writing |
| **max_output_tokens** | Limits response length | 256-4096 | Set based on expected answer length |
| **top_p** | Nucleus sampling | 0.1-1.0 | Lower (0.1-0.5) for focused, higher for diverse |
| **top_k** | Considers top K tokens | 1-100 | Lower for deterministic, higher for variety |

**For RAG applications**: Use `temperature=0` to `0.3` for factual, consistent answers.

**Observation**: The LLM either:
- Admits it doesn't know (good!), or
- Makes up an answer (hallucination - bad!)

This is the fundamental problem RAG solves. Let's learn how!

---

## 2. Chunking: Breaking Documents into Pieces

### Why do we chunk?
1. **Context window limits**: LLMs can only process limited text at once
2. **Retrieval granularity**: We want to retrieve relevant portions, not entire documents
3. **Semantic coherence**: Each chunk should be a meaningful unit

In [4]:
# Load a sample document
data_path = Path("../data/syllabi/CS301.md")
with open(data_path, "r", encoding="utf-8") as f:
    document = f.read()

print(f"Document length: {len(document)} characters")
print(f"\nFirst 500 characters:")
print(document[:500])

Document length: 4197 characters

First 500 characters:
# CS301: Introduction to Machine Learning

## Course Information
- **Course Code:** CS301
- **Credits:** 4
- **Prerequisites:** CS201 (Data Structures), MATH201 (Linear Algebra), STAT101 (Statistics)
- **Instructor:** Dr. Emily Watson
- **Office:** AI Research Center, Room 205
- **Email:** emily.watson@fictional.edu
- **Semester:** Fall only

## Course Description

This course provides a comprehensive introduction to machine learning, covering both theoretical foundations and practical applicati


### Chunking Strategies

Let's compare different chunk sizes and see the impact.

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

def chunk_document(text, chunk_size, chunk_overlap):
    """Chunk a document and return statistics."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    chunks = splitter.split_text(text)
    return chunks

# Compare different chunk sizes
chunk_sizes = [200, 500, 1000]

print("Chunk Size Comparison:")
print("=" * 50)

for size in chunk_sizes:
    chunks = chunk_document(document, chunk_size=size, chunk_overlap=50)
    avg_len = sum(len(c) for c in chunks) / len(chunks)
    print(f"\nChunk size: {size}")
    print(f"  Number of chunks: {len(chunks)}")
    print(f"  Average chunk length: {avg_len:.0f} chars")

Chunk Size Comparison:

Chunk size: 200
  Number of chunks: 37
  Average chunk length: 115 chars

Chunk size: 500
  Number of chunks: 12
  Average chunk length: 352 chars

Chunk size: 1000
  Number of chunks: 5
  Average chunk length: 841 chars


### Let's Look at the Chunks

In [6]:
# Create chunks with our chosen size
chunks = chunk_document(document, chunk_size=500, chunk_overlap=50)

print("Sample Chunks (500 chars):")
print("=" * 50)

for i, chunk in enumerate(chunks[:3]):
    print(f"\n--- Chunk {i+1} ({len(chunk)} chars) ---")
    print(chunk[:300] + "..." if len(chunk) > 300 else chunk)

Sample Chunks (500 chars):

--- Chunk 1 (366 chars) ---
# CS301: Introduction to Machine Learning

## Course Information
- **Course Code:** CS301
- **Credits:** 4
- **Prerequisites:** CS201 (Data Structures), MATH201 (Linear Algebra), STAT101 (Statistics)
- **Instructor:** Dr. Emily Watson
- **Office:** AI Research Center, Room 205
- **Email:** emily.wat...

--- Chunk 2 (303 chars) ---
## Course Description

This course provides a comprehensive introduction to machine learning, covering both theoretical foundations and practical applications. Students will learn the fundamental algorithms and techniques used to build systems that can learn from data and make predictions or decisio...

--- Chunk 3 (477 chars) ---
The course bridges the gap between mathematical theory and real-world implementation. Students will gain hands-on experience with popular machine learning libraries (scikit-learn, pandas, numpy) while developing a deep understanding of the underlying algorithms.

Machine learni

### Chunking Trade-offs

| Chunk Size | Pros | Cons |
|------------|------|------|
| **Small (200)** | Precise retrieval | May lose context |
| **Medium (500)** | Good balance | Usually the sweet spot |
| **Large (1000)** | More context | May include irrelevant info |

---

## 3. Embeddings: Text to Vectors

Embeddings convert text into numerical vectors that capture semantic meaning.

**Key insight**: Similar meanings → similar vectors

In [7]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load embedding model (runs locally, no API needed)
print("Loading embedding model...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print("✅ Model loaded!")

Loading embedding model...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Model loaded!


In [13]:
# Let's embed some example sentences
sentences = [
    "The cat sat on the mat",
    "A feline rested on a rug",
    "The stock market crashed yesterday",
    "Machine learning is a subset of AI",
    "Deep learning uses neural networks"
]

# Generate embeddings
embeddings = embedding_model.encode(sentences)

print(f"Embedding shape: {embeddings.shape}")
print(f"Each sentence becomes a vector of {embeddings.shape[1]} dimensions")

Embedding shape: (5, 384)
Each sentence becomes a vector of 384 dimensions


### Semantic Similarity

Let's compute similarity between sentences using cosine similarity.

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarity matrix
similarity_matrix = cosine_similarity(embeddings)

print("Sentence Similarity Matrix:")
print("=" * 60)

# Print as a readable table
for i, sent in enumerate(sentences):
    print(f"\n[{i}] {sent[:40]}..." if len(sent) > 40 else f"\n[{i}] {sent}")

print("\n\nSimilarity scores:")
print("       ", "  ".join([f"[{i}]" for i in range(len(sentences))]))
for i, row in enumerate(similarity_matrix):
    scores = "  ".join([f"{s:.2f}" for s in row])
    print(f"[{i}]    {scores}")

Sentence Similarity Matrix:

[0] The cat sat on the mat

[1] A feline rested on a rug

[2] The stock market crashed yesterday

[3] Machine learning is a subset of AI

[4] Deep learning uses neural networks


Similarity scores:
        [0]  [1]  [2]  [3]  [4]
[0]    1.00  0.56  0.11  -0.05  -0.08
[1]    0.56  1.00  0.08  -0.11  -0.04
[2]    0.11  0.08  1.00  0.05  0.04
[3]    -0.05  -0.11  0.05  1.00  0.43
[4]    -0.08  -0.04  0.04  0.43  1.00


**Key observations:**
- Sentences [0] and [1] are similar (cat/mat vs feline/rug) - high similarity!
- Sentences [3] and [4] are similar (both about ML) - high similarity!
- Sentence [2] (stock market) is different from the others - low similarity

---

## 4. Vector Search with ChromaDB

Now let's put it together: chunk our course documents, embed them, and search!

In [15]:
import chromadb
from chromadb.utils import embedding_functions

# Create a ChromaDB client (in-memory for this demo)
client = chromadb.Client()

# Use sentence-transformers for embeddings
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Create a collection
collection = client.create_collection(
    name="course_syllabus",
    embedding_function=embedding_fn
)

print("✅ ChromaDB collection created!")

✅ ChromaDB collection created!


In [16]:
# Load all syllabi and add to collection
syllabi_path = Path("../data/syllabi")

all_chunks = []
all_ids = []
all_metadata = []

for filepath in syllabi_path.glob("*.md"):
    with open(filepath, "r", encoding="utf-8") as f:
        content = f.read()
    
    # Chunk the document
    chunks = chunk_document(content, chunk_size=500, chunk_overlap=50)
    
    # Create IDs and metadata
    for i, chunk in enumerate(chunks):
        all_chunks.append(chunk)
        all_ids.append(f"{filepath.stem}_{i}")
        all_metadata.append({"source": filepath.stem})

# Add to collection
collection.add(
    documents=all_chunks,
    ids=all_ids,
    metadatas=all_metadata
)

print(f"✅ Added {len(all_chunks)} chunks from {len(list(syllabi_path.glob('*.md')))} documents")

✅ Added 86 chunks from 8 documents


### Let's Search!

In [17]:
def search(query, n_results=3):
    """Search for relevant chunks."""
    results = collection.query(
        query_texts=[query],
        n_results=n_results
    )
    return results

# Test search
query = "What topics does the machine learning course cover?"
results = search(query)

print(f"Query: {query}")
print("=" * 60)

for i, (doc, metadata) in enumerate(zip(results['documents'][0], results['metadatas'][0])):
    print(f"\n--- Result {i+1} (from {metadata['source']}) ---")
    print(doc[:300] + "..." if len(doc) > 300 else doc)

Query: What topics does the machine learning course cover?

--- Result 1 (from CS301) ---
The course bridges the gap between mathematical theory and real-world implementation. Students will gain hands-on experience with popular machine learning libraries (scikit-learn, pandas, numpy) while developing a deep understanding of the underlying algorithms.

Machine learning is transforming ind...

--- Result 2 (from CS401) ---
# CS401: Deep Learning

## Course Information
- **Course Code:** CS401
- **Credits:** 4
- **Prerequisites:** CS301 (Introduction to Machine Learning)
- **Instructor:** Dr. James Liu
- **Office:** AI Research Center, Room 312
- **Email:** james.liu@fictional.edu
- **Semester:** Spring only

## Course...

--- Result 3 (from CS301) ---
## Topics Covered

### Module 1: Foundations (Weeks 1-2)
- What is machine learning?
- Types of ML: supervised, unsupervised, reinforcement
- The ML pipeline: data collection, preprocessing, modeling, evaluation
- Python ML ecosystem (numpy

In [18]:
# Try another query
query = "Who teaches linear algebra?"
results = search(query)

print(f"Query: {query}")
print("=" * 60)

for i, (doc, metadata) in enumerate(zip(results['documents'][0], results['metadatas'][0])):
    print(f"\n--- Result {i+1} (from {metadata['source']}) ---")
    print(doc[:300] + "..." if len(doc) > 300 else doc)

Query: Who teaches linear algebra?

--- Result 1 (from MATH201) ---
Linear algebra provides the mathematical foundation for machine learning, computer graphics, quantum mechanics, and many other fields. The concepts learned in this course are essential for advanced study in mathematics and for practical applications in science and technology.

--- Result 2 (from MATH201) ---
## Course Description

This course introduces the fundamental concepts of linear algebra, including vectors, matrices, linear transformations, and eigenvalues. Linear algebra is one of the most widely used areas of mathematics, with applications in computer science, engineering, physics, economics, ...

--- Result 3 (from MATH201) ---
## Why Linear Algebra Matters

Linear algebra is essential for:
- **Machine Learning:** Neural networks, dimensionality reduction
- **Computer Graphics:** 3D transformations, rendering
- **Data Science:** Principal component analysis, regression
- **Engineering:** Signal processing, c

### The Limitation: Relationship Questions

In [19]:
# This query is harder - it requires understanding relationships
query = "Can I take CS401 if I've only completed CS101?"
results = search(query)

print(f"Query: {query}")
print("=" * 60)

for i, (doc, metadata) in enumerate(zip(results['documents'][0], results['metadatas'][0])):
    print(f"\n--- Result {i+1} (from {metadata['source']}) ---")
    print(doc[:300] + "..." if len(doc) > 300 else doc)

Query: Can I take CS401 if I've only completed CS101?

--- Result 1 (from CS201) ---
Building on the programming foundations from CS101, this course takes students deeper into how data is organized, stored, and manipulated efficiently. Understanding these concepts is essential for success in advanced computer science courses and professional software development.

## Learning Object...

--- Result 2 (from CS201) ---
# CS201: Data Structures and Algorithms

## Course Information
- **Course Code:** CS201
- **Credits:** 4
- **Prerequisites:** CS101 (Introduction to Programming)
- **Instructor:** Prof. Michael Torres
- **Office:** Engineering Building, Room 415
- **Email:** michael.torres@fictional.edu
- **Semester...

--- Result 3 (from CS401) ---
## Prerequisites in Detail

**CS301 is strictly required.** This course assumes familiarity with:
- Machine learning fundamentals (supervised/unsupervised learning)
- Model evaluation and validation
- Python programming and numpy
- Basic neural 

**Notice**: The search returns relevant chunks, but doesn't necessarily understand the full prerequisite chain (CS401 → CS301 → CS201 → CS101).

This is where:
- **RAG with good prompting** can help
- **Graph RAG** excels (captures relationships explicitly)
- **Agentic RAG** can reason through multiple steps

We'll explore these in the next notebooks!

---

## Summary

In this notebook, we learned:

1. **The Problem**: LLMs hallucinate when they don't have the right knowledge

2. **Chunking**: Breaking documents into meaningful pieces
   - Balance chunk size: not too small (loses context), not too large (noise)
   - Use overlap to preserve context at boundaries

3. **Embeddings**: Converting text to vectors
   - Similar meanings → similar vectors
   - Enable semantic search (not just keyword matching)

4. **Vector Search**: Finding relevant content
   - ChromaDB stores and searches embeddings
   - Returns top-k most similar chunks

**Next**: In notebook 02, we'll build a complete RAG pipeline that combines retrieval with LLM generation!