# RAG Workshop - Part 2: Building the RAG Pipeline

In this notebook, we'll build a complete RAG system:
1. **Load & Index**: Process documents into a vector store
2. **Retrieve**: Find relevant context for queries
3. **Generate**: Use LLM with retrieved context
4. **Test**: See what works and what doesn't

## Setup

In [9]:
import os
from pathlib import Path
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Verify API key
assert os.getenv("GOOGLE_API_KEY"), "Please set GOOGLE_API_KEY in .env file"
print("‚úÖ Environment loaded!")

‚úÖ Environment loaded!


---

## 1. Load Documents

We'll use LangChain's document loaders to read our course syllabi.

In [10]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

# Load all markdown files from syllabi folder
loader = DirectoryLoader(
    "../data/syllabi",
    glob="**/*.md",
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"}
)

documents = loader.load()

print(f"Loaded {len(documents)} documents")
for doc in documents:
    print(f"  - {Path(doc.metadata['source']).name}: {len(doc.page_content)} chars")

Loaded 8 documents
  - CS101.md: 3066 chars
  - CS201.md: 3516 chars
  - CS301.md: 4197 chars
  - CS401.md: 4455 chars
  - CS501.md: 4324 chars
  - MATH101.md: 3569 chars
  - MATH201.md: 3968 chars
  - STAT101.md: 4084 chars


## 2. Chunk Documents

In [11]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Create text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n## ", "\n### ", "\n\n", "\n", ". ", " ", ""]
)

# Split documents
chunks = text_splitter.split_documents(documents)

print(f"Created {len(chunks)} chunks from {len(documents)} documents")
print(f"\nSample chunk:")
print("-" * 40)
print(chunks[5].page_content[:300])

Created 90 chunks from 8 documents

Sample chunk:
----------------------------------------
### Module 3: Functions (Weeks 5-6)
- Defining and calling functions
- Parameters and return values
- Scope and lifetime of variables
- Built-in functions and modules

### Module 4: Data Structures (Weeks 7-9)
- Lists and list operations
- Strings and string manipulation
- Dictionaries and sets
- Ne


## 3. Create Vector Store

In [5]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

# Use local embedding model (no API cost!)
print("Loading embedding model...")
embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"}
)

# Create vector store
print("Creating vector store...")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    collection_name="course_advisor"
)

print(f"‚úÖ Vector store created with {len(chunks)} chunks!")

Loading embedding model...
Creating vector store...
‚úÖ Vector store created with 90 chunks!


## 4. Create Retriever

In [12]:
# Create retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}  # Return top 4 chunks
)

# Test retrieval
query = "What topics are covered in machine learning?"
docs = retriever.invoke(query)

print(f"Query: {query}")
print(f"Retrieved {len(docs)} chunks:")
for i, doc in enumerate(docs):
    source = Path(doc.metadata['source']).stem
    print(f"\n[{i+1}] From {source}:")
    print(doc.page_content[:200] + "...")

Query: What topics are covered in machine learning?
Retrieved 4 chunks:

[1] From CS301:
## Course Description

This course provides a comprehensive introduction to machine learning, covering both theoretical foundations and practical applications. Students will learn the fundamental algo...

[2] From CS301:
## Topics Covered

### Module 1: Foundations (Weeks 1-2)
- What is machine learning?
- Types of ML: supervised, unsupervised, reinforcement
- The ML pipeline: data collection, preprocessing, modeling,...

[3] From CS301:
Upon successful completion of this course, students will be able to:
- Understand the mathematical foundations of machine learning algorithms
- Implement and apply supervised learning algorithms (regr...

[4] From CS301:
The course bridges the gap between mathematical theory and real-world implementation. Students will gain hands-on experience with popular machine learning libraries (scikit-learn, pandas, numpy) while...


## 5. Set Up LLM

In [13]:
from langchain_google_genai import ChatGoogleGenerativeAI

# Initialize Gemini
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0,
    max_output_tokens=1024
)

print("‚úÖ LLM initialized!")

‚úÖ LLM initialized!


## 6. Build RAG Chain

Now we combine retrieval + LLM generation.

In [15]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# RAG prompt template
template = """You are a helpful course advisor for Fictional University.
Answer the question based ONLY on the following context. 
If the context doesn't contain enough information, say "I don't have enough information to answer that."

Context:
{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

# Helper to format documents
def format_docs(docs):
    return "\n\n---\n\n".join(
        f"[Source: {Path(doc.metadata['source']).stem}]\n{doc.page_content}"
        for doc in docs
    )

# Build the RAG chain
rag_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)

print("‚úÖ RAG chain built!")

‚úÖ RAG chain built!


### Understanding the LCEL Chain Syntax

The `|` operator is **LangChain Expression Language (LCEL)** - it pipes data through each step like Unix pipes.

```
"What is CS301?" ‚îÄ‚îÄ‚ñ∫ { parallel execution } ‚îÄ‚îÄ‚ñ∫ prompt ‚îÄ‚îÄ‚ñ∫ llm ‚îÄ‚îÄ‚ñ∫ StrOutputParser ‚îÄ‚îÄ‚ñ∫ "Answer..."
```

**Breaking it down:**

| Component | What it does |
|-----------|--------------|
| `{"context": ..., "question": ...}` | Creates a dict with two parallel branches |
| `retriever \| format_docs` | Finds relevant docs ‚Üí formats them as a string |
| `RunnablePassthrough()` | Passes input unchanged (identity function) |
| `prompt` | Fills template placeholders with the dict values |
| `llm` | Sends prompt to Gemini, returns AI message |
| `StrOutputParser()` | Extracts just the text from the response |

**Data flow example:**
```python
# Input: "What is CS301?"

# After first stage (parallel dict):
{"context": "[CS301] Machine learning course...", "question": "What is CS301?"}

# After prompt: formatted prompt string with context + question filled in
# After llm: AIMessage object with response
# After StrOutputParser: "CS301 is a machine learning course..."
```

## 7. Test the RAG System

Let's test with various types of questions.

In [16]:
def ask(question):
    """Ask a question and show the answer."""
    print(f"Question: {question}")
    print("-" * 50)
    answer = rag_chain.invoke(question)
    print(f"Answer: {answer}")
    print("\n")

In [17]:
# Test 1: Simple factual question
ask("What topics are covered in the Machine Learning course?")

Question: What topics are covered in the Machine Learning course?
--------------------------------------------------
Answer: The Machine Learning course covers the following topics:

Module 1: Foundations (Weeks 1-2)
- What is machine learning?
- Types of ML: supervised, unsupervised, reinforcement
- The ML pipeline: data collection, preprocessing, modeling, evaluation
- Python ML ecosystem (numpy, pandas, scikit-learn)

Module 2: Supervised Learning - Regression (Weeks 3-4)
- Linear regression
- Polynomial regression
- Regularization (Ridge, Lasso)
- Gradient descent optimization




In [18]:
# Test 2: Who teaches a course?
ask("Who teaches Linear Algebra?")

Question: Who teaches Linear Algebra?
--------------------------------------------------
Answer: I don't have enough information to answer that.




In [19]:
# Test 3: Prerequisites question
ask("What are the prerequisites for the Deep Learning course?")

Question: What are the prerequisites for the Deep Learning course?
--------------------------------------------------
Answer: The prerequisite for CS401 (Deep Learning) is CS301 (Introduction to Machine Learning).




In [20]:
# Test 4: Comparison question
ask("What's the difference between CS301 and CS401?")

Question: What's the difference between CS301 and CS401?
--------------------------------------------------
Answer: I don't have enough information to answer that.




### The Challenge: Complex Relationship Questions

In [21]:
# Test 5: This is harder!
ask("Can I take CS401 (Deep Learning) if I've only completed CS101?")

Question: Can I take CS401 (Deep Learning) if I've only completed CS101?
--------------------------------------------------
Answer: No, you cannot take CS401 (Deep Learning) if you've only completed CS101. CS401 requires CS301 (Introduction to Machine Learning) as a prerequisite.




In [22]:
# Test 6: Learning path question
ask("I want to become an NLP specialist. What courses should I take and in what order?")

Question: I want to become an NLP specialist. What courses should I take and in what order?
--------------------------------------------------
Answer: I don't have enough information to answer that.




### Analysis: What Works and What Doesn't?

| Query Type | Works Well? | Why |
|------------|-------------|-----|
| Simple facts | ‚úÖ Yes | Direct retrieval |
| Who teaches X? | ‚úÖ Yes | Info in single chunk |
| Prerequisites for X? | ‚úÖ Mostly | Usually in same doc |
| Can I take X given Y? | ‚ö†Ô∏è Sometimes | Needs reasoning across chunks |
| Full learning path | ‚ùå Often fails | Needs multi-hop reasoning |

**The limitation**: Simple RAG retrieves relevant chunks, but can't reason across them or traverse relationship chains.

---

## 8. Adding Source Attribution

Let's enhance our system to show which sources were used.

In [23]:
def ask_with_sources(question):
    """Ask a question and show answer with sources."""
    # Get relevant documents
    docs = retriever.invoke(question)
    
    # Format context
    context = format_docs(docs)
    
    # Generate answer
    messages = prompt.format_messages(context=context, question=question)
    response = llm.invoke(messages)
    
    print(f"Question: {question}")
    print("=" * 50)
    print(f"\nAnswer: {response.content}")
    
    print(f"\nüìö Sources used:")
    sources = set(Path(doc.metadata['source']).stem for doc in docs)
    for source in sources:
        print(f"  - {source}")
    print()

In [24]:
ask_with_sources("What programming languages are used in the courses?")

Question: What programming languages are used in the courses?

Answer: CS101 uses Python. I don't have information about the programming languages used in CS201.

üìö Sources used:
  - CS201
  - CS101



---

## 9. Experimenting with Parameters

Let's see how changing parameters affects results.

In [26]:
# More chunks = more context
retriever_more = vectorstore.as_retriever(search_kwargs={"k": 8})

# Fewer chunks = more focused
retriever_fewer = vectorstore.as_retriever(search_kwargs={"k": 2})

question = "What math is needed for machine learning?"

print("With k=2 (fewer chunks):")
docs = retriever_fewer.invoke(question)
for doc in docs:
    print(f"  - {Path(doc.metadata['source']).stem}")

print("\nWith k=8 (more chunks):")
docs = retriever_more.invoke(question)
for doc in docs:
    print(f"  - {Path(doc.metadata['source']).stem}")

With k=2 (fewer chunks):
  - CS301
  - CS301

With k=8 (more chunks):
  - CS301
  - CS301
  - CS301
  - CS301
  - CS501
  - MATH201
  - CS301
  - CS401


---

## Summary

In this notebook, we built a complete RAG pipeline:

1. **Document Loading**: Used LangChain loaders to read markdown files
2. **Chunking**: Split documents into 500-char chunks with overlap
3. **Vector Store**: Indexed chunks in ChromaDB with sentence-transformers
4. **Retrieval**: Found relevant chunks using semantic search
5. **Generation**: Combined context with Gemini to generate answers
6. **Source Attribution**: Showed which documents were used

### Limitations Discovered
- Works great for simple factual questions
- Struggles with multi-hop reasoning (prerequisite chains)
- Can't plan learning paths effectively

**Next**: In notebook 03, we'll build an Agentic RAG system that can reason through complex queries!