# Tutorial 08: Basic RAG (Retrieval-Augmented Generation)

In this tutorial, you'll build a foundational RAG system that retrieves relevant documents and uses them to generate informed responses.

**What you'll learn:**
- **Document Loading**: Loading PDFs and text files
- **Chunking**: Splitting documents into searchable pieces
- **Embeddings**: Converting text to vectors with sentence-transformers
- **Vector Storage**: Storing and querying with ChromaDB
- **RAG Pipeline**: Combining retrieval with generation in LangGraph

By the end, you'll have a working RAG system that answers questions using your local documents.

## Why RAG?

Large Language Models have knowledge cutoffs and can hallucinate. RAG solves this by:

1. **Retrieving** relevant documents for the user's question
2. **Augmenting** the prompt with this context
3. **Generating** an answer grounded in the retrieved information

```
User Question → Retrieve Documents → Augment Prompt → Generate Answer
```

## Prerequisites

Make sure you have:
1. Ollama running with a model pulled
2. RAG dependencies installed: `pip install -e ".[rag]"`
3. PDF files in the `sources/` directory

In [None]:
# Verify setup
from langgraph_ollama_local import LocalAgentConfig
import os

config = LocalAgentConfig()
print(f"Ollama server: {config.ollama.base_url}")
print(f"Model: {config.ollama.model}")

# Check sources directory
sources_dir = "../../sources"
if os.path.exists(sources_dir):
    files = os.listdir(sources_dir)
    print(f"\nFound {len(files)} files in sources/")
    for f in files[:5]:
        print(f"  - {f}")
else:
    print("\nWarning: sources/ directory not found")

## Step 1: Load Documents

First, we need to load our source documents. The `DocumentLoader` class handles PDFs, text files, and markdown:

In [None]:
from langgraph_ollama_local.rag import DocumentLoader

# Create loader
loader = DocumentLoader()

# Load all documents from sources/
documents = loader.load_directory("../../sources")

print(f"Loaded {len(documents)} document pages")
print(f"\nFirst document preview:")
print(f"  Source: {documents[0].metadata.get('source', 'unknown')}")
print(f"  Page: {documents[0].metadata.get('page', 'N/A')}")
print(f"  Content: {documents[0].page_content[:200]}...")

## Step 2: Chunk Documents

Documents need to be split into smaller chunks for effective retrieval. Key considerations:

- **Chunk size**: ~1000 characters is a good starting point
- **Overlap**: Some overlap (200 chars) ensures context isn't lost at boundaries
- **Sentence boundaries**: Try to break at natural points

In [None]:
from langgraph_ollama_local.rag import DocumentIndexer
from langgraph_ollama_local.rag.indexer import IndexerConfig

# Configure chunking (using default collection "documents")
indexer_config = IndexerConfig(
    chunk_size=1000,       # Characters per chunk
    chunk_overlap=200,     # Overlap between chunks
    embedding_model="all-mpnet-base-v2",
)

# Create indexer
indexer = DocumentIndexer(config=indexer_config)

# Chunk the documents
chunks = indexer.chunk_documents(documents)

print(f"Created {len(chunks)} chunks from {len(documents)} pages")
print(f"\nExample chunk:")
print(f"  Source: {chunks[0].metadata.get('filename', 'unknown')}")
print(f"  Chunk index: {chunks[0].metadata.get('chunk_index', 0)}")
print(f"  Length: {len(chunks[0].page_content)} characters")

## Step 3: Generate Embeddings

Embeddings convert text into numerical vectors that capture semantic meaning. We use `sentence-transformers` for local embedding generation:

| Model | Dimensions | Quality | Speed |
|-------|------------|---------|-------|
| `all-mpnet-base-v2` | 768 | High | Medium |
| `all-MiniLM-L6-v2` | 384 | Good | Fast |

In [None]:
from langgraph_ollama_local.rag import LocalEmbeddings

# Create embedding model
embeddings = LocalEmbeddings(model_name="all-mpnet-base-v2")

# Test embedding generation
test_texts = [
    "What is Retrieval-Augmented Generation?",
    "RAG combines retrieval with generation.",
]

vectors = embeddings.embed_documents(test_texts)

print(f"Model: {embeddings.model_name}")
print(f"Dimensions: {embeddings.dimensions}")
print(f"\nGenerated {len(vectors)} embeddings")
print(f"First vector shape: {len(vectors[0])} dimensions")
print(f"First 5 values: {vectors[0][:5]}")

## Step 4: Store in ChromaDB

ChromaDB is a local vector database that stores embeddings and enables similarity search. It persists to disk so you don't need to re-index every time.

In [None]:
# Index all chunks into ChromaDB
# This generates embeddings and stores them
num_indexed = indexer.index_documents(chunks)

print(f"Indexed {num_indexed} document chunks")

# Check stats
stats = indexer.get_stats()
print(f"\nCollection stats:")
for key, value in stats.items():
    print(f"  {key}: {value}")

## Step 5: Query the Index

Now we can search for relevant documents using semantic similarity:

In [None]:
from langgraph_ollama_local.rag import LocalRetriever
from langgraph_ollama_local.rag.retriever import RetrieverConfig

# Create retriever (using same default collection as indexer)
retriever_config = RetrieverConfig(
    default_k=4,  # Return top 4 results
)
retriever = LocalRetriever(config=retriever_config)

# Search for relevant documents
query = "What is Self-RAG and how does it work?"
results = retriever.retrieve(query, k=3)

print(f"Query: {query}\n")
print(f"Found {len(results)} relevant documents:\n")

for i, (doc, score) in enumerate(results, 1):
    print(f"--- Result {i} (score: {score:.3f}) ---")
    print(f"Source: {doc.metadata.get('filename', 'unknown')}")
    print(f"Content: {doc.page_content[:300]}...\n")

## Step 6: Build the RAG Graph

Now let's build a LangGraph that combines retrieval and generation:

```
START → retrieve → generate → END
```

In [None]:
from typing import Annotated, List
from typing_extensions import TypedDict
from langchain_core.documents import Document
from langgraph.graph import StateGraph, START, END

# Define the state for our RAG graph
class RAGState(TypedDict):
    """State for the RAG pipeline."""
    question: str                    # User's question
    documents: List[Document]        # Retrieved documents
    generation: str                  # Generated answer

print("State schema defined!")

In [None]:
# Create the LLM
from langchain_ollama import ChatOllama

llm = ChatOllama(
    model=config.ollama.model,
    base_url=config.ollama.base_url,
    temperature=0,  # More deterministic for RAG
)

print(f"Using model: {config.ollama.model}")

In [None]:
# Define the retrieve node
def retrieve(state: RAGState) -> dict:
    """Retrieve relevant documents for the question."""
    question = state["question"]
    
    # Get documents (without scores for simplicity)
    docs = retriever.retrieve_documents(question, k=4)
    
    print(f"Retrieved {len(docs)} documents")
    return {"documents": docs}

print("Retrieve node defined!")

In [None]:
from langchain_core.prompts import ChatPromptTemplate

# RAG prompt template
RAG_PROMPT = """You are an assistant for question-answering tasks.
Use the following retrieved context to answer the question.
If you don't know the answer from the context, say so.
Keep your answer concise and focused.

Context:
{context}

Question: {question}

Answer:"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

# Define the generate node
def generate(state: RAGState) -> dict:
    """Generate an answer using retrieved documents."""
    question = state["question"]
    documents = state["documents"]
    
    # Format documents into context
    context = "\n\n".join([
        f"Document {i+1}:\n{doc.page_content}"
        for i, doc in enumerate(documents)
    ])
    
    # Generate response
    messages = rag_prompt.format_messages(
        context=context,
        question=question
    )
    response = llm.invoke(messages)
    
    return {"generation": response.content}

print("Generate node defined!")

In [None]:
# Build the graph
graph_builder = StateGraph(RAGState)

# Add nodes
graph_builder.add_node("retrieve", retrieve)
graph_builder.add_node("generate", generate)

# Add edges
graph_builder.add_edge(START, "retrieve")
graph_builder.add_edge("retrieve", "generate")
graph_builder.add_edge("generate", END)

# Compile
rag_graph = graph_builder.compile()

print("RAG graph compiled!")

In [None]:
# Visualize the graph
from IPython.display import Image, display

try:
    display(Image(rag_graph.get_graph().draw_mermaid_png()))
except Exception as e:
    print(f"Could not render graph: {e}")
    print(rag_graph.get_graph().draw_ascii())

## Step 7: Test the RAG Pipeline

In [None]:
# Test the RAG pipeline
question = "What is Self-RAG and how does it improve upon traditional RAG?"

print(f"Question: {question}\n")
print("Processing...\n")

result = rag_graph.invoke({"question": question})

print("=" * 50)
print("ANSWER:")
print("=" * 50)
print(result["generation"])
print("\n" + "=" * 50)
print(f"Based on {len(result['documents'])} retrieved documents")

In [None]:
# Try another question
question2 = "What are the key components of a CRAG system?"

print(f"Question: {question2}\n")

result2 = rag_graph.invoke({"question": question2})

print("Answer:")
print(result2["generation"])

## Step 8: Add Source Citations

A key feature of RAG is being able to cite sources. Let's enhance our pipeline to include citations:

In [None]:
def format_sources(documents: List[Document]) -> str:
    """Format document sources for citation."""
    sources = []
    seen = set()
    
    for doc in documents:
        filename = doc.metadata.get('filename', 'Unknown')
        page = doc.metadata.get('page', '')
        
        # Create unique source identifier
        source_id = f"{filename}:{page}"
        if source_id not in seen:
            seen.add(source_id)
            if page:
                sources.append(f"- {filename} (page {page})")
            else:
                sources.append(f"- {filename}")
    
    return "\n".join(sources)

# Test with our last result
print("Sources:")
print(format_sources(result2["documents"]))

In [None]:
def rag_query(question: str, show_sources: bool = True) -> str:
    """Complete RAG query with optional source display."""
    result = rag_graph.invoke({"question": question})
    
    output = []
    output.append(f"Question: {question}")
    output.append("")
    output.append("Answer:")
    output.append(result["generation"])
    
    if show_sources:
        output.append("")
        output.append("Sources:")
        output.append(format_sources(result["documents"]))
    
    return "\n".join(output)

# Test the helper function
print(rag_query("What techniques does Adaptive RAG use?"))

## Complete Code

Here's the complete Basic RAG implementation:

In [None]:
# Complete Basic RAG Implementation

from typing import List
from typing_extensions import TypedDict
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama import ChatOllama
from langgraph.graph import StateGraph, START, END
from langgraph_ollama_local import LocalAgentConfig
from langgraph_ollama_local.rag import (
    DocumentLoader,
    DocumentIndexer,
    LocalRetriever,
)

# 1. State definition
class RAGState(TypedDict):
    question: str
    documents: List[Document]
    generation: str

# 2. Setup components
config = LocalAgentConfig()
llm = ChatOllama(
    model=config.ollama.model,
    base_url=config.ollama.base_url,
    temperature=0,
)

# Retriever uses default "documents" collection
retriever = LocalRetriever()

# 3. RAG prompt
RAG_PROMPT = ChatPromptTemplate.from_template(
    """Answer based on the context. If unknown, say so.

Context:
{context}

Question: {question}

Answer:"""
)

# 4. Node functions
def retrieve(state: RAGState) -> dict:
    docs = retriever.retrieve_documents(state["question"], k=4)
    return {"documents": docs}

def generate(state: RAGState) -> dict:
    context = "\n\n".join([d.page_content for d in state["documents"]])
    messages = RAG_PROMPT.format_messages(
        context=context, question=state["question"]
    )
    response = llm.invoke(messages)
    return {"generation": response.content}

# 5. Build graph
graph = StateGraph(RAGState)
graph.add_node("retrieve", retrieve)
graph.add_node("generate", generate)
graph.add_edge(START, "retrieve")
graph.add_edge("retrieve", "generate")
graph.add_edge("generate", END)
rag_app = graph.compile()

# 6. Use it!
result = rag_app.invoke({"question": "What is RAG?"})
print(result["generation"])

## Key Concepts Recap

| Component | Purpose |
|-----------|--------|
| **DocumentLoader** | Loads PDFs, text, markdown files |
| **Chunking** | Splits documents into searchable pieces |
| **Embeddings** | Converts text to vectors (semantic meaning) |
| **ChromaDB** | Stores vectors and enables similarity search |
| **Retriever** | Finds relevant documents for a query |
| **RAG Prompt** | Combines context with question for LLM |

## Limitations of Basic RAG

Basic RAG has some limitations:

1. **No quality check**: Retrieved documents might not be relevant
2. **No hallucination detection**: LLM might still hallucinate
3. **Single retrieval**: One-shot retrieval might miss information
4. **No fallback**: No alternative if retrieval fails

We'll address these in the upcoming tutorials:
- **Self-RAG** (Tutorial 09): Grades documents and answers
- **CRAG** (Tutorial 10): Web search fallback
- **Adaptive RAG** (Tutorial 11): Routes to best strategy
- **Agentic RAG** (Tutorial 12): Multi-step retrieval

## Exercises

1. **Tune chunk size**: Try different chunk sizes (500, 1500) and see how it affects retrieval quality
2. **Adjust k**: Retrieve more/fewer documents and observe the impact
3. **Different embedding model**: Try `all-MiniLM-L6-v2` for faster embeddings
4. **Add metadata filtering**: Filter by source file or page number

## What's Next?

In [Tutorial 09: Self-RAG](09_self_rag.ipynb), you'll learn:
- How to grade retrieved documents for relevance
- Detecting hallucinations in generated answers
- Building retry loops for improved quality
- The Self-RAG architecture pattern