# Session 5 â€” Retrieval-Augmented Generation (RAG)

In this notebook, we build a **minimal RAG pipeline** using Lewis Carroll's two Alice books as our corpus as usual:

- *Alice's Adventures in Wonderland*
- *Through the Looking-Glass*

> **IMPORTANT**: It is **highly recommended** to use a virtual environment for this session!  
> The packages and downloaded models (embeddings, transformers) can easily reach over **1 GB** in size.  
> Using a venv keeps your system clean and makes it easy to manage these large dependencies and delete them when not needed anymore.

## What is RAG?

**Retrieval-Augmented Generation (RAG)** is a technique that enhances LLM responses by:
1. **Retrieving** relevant information from a knowledge base (your documents)
2. **Augmenting** the LLM prompt with this retrieved context
3. **Generating** an answer based on both the question and the retrieved information

This approach allows LLMs to answer questions about documents they weren't trained on, and reduces hallucinations by grounding responses in actual source material.

## Pipeline Overview

We will:

1. **Load** the two books as plain text  
2. **Split** them into **overlapping chunks** (text segmentation)  
3. **Create embeddings** for each chunk (convert text to vectors)  
4. **Store** them in a **vector database (FAISS)** for efficient similarity search  
5. **Build** a **retrieval + generation chain** to answer questions about the books  
6. **Query** the system with natural language questions

The focus is on understanding the *pipeline*, not on perfect model choices. You can swap components (embeddings, LLMs, vector stores) as needed.

In [1]:
# Core imports
from pathlib import Path
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)

# LangChain components
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate

# Hub for pulling prompts
from langsmith import Client
hub = Client()

# LLM: we use Ollama (local) here to avoid API keys
# Make sure you have installed and started Ollama, and pulled a model, e.g.:
#   - install from https://ollama.com
#   - in a terminal, run: `ollama pull llama3.2`
from langchain_ollama import OllamaLLM

# A small helper for nicer printing
import textwrap

## Setup & Configuration

### LLM Options

In this notebook, we use **Ollama** for local LLM inference (no API keys required).

**Alternative LLM options:**
- **OpenAI**: `from langchain_openai import ChatOpenAI` â†’ requires API key
- **Groq**: `from langchain_groq import ChatGroq` â†’ requires API key  
- **Anthropic**: `from langchain_anthropic import ChatAnthropic` â†’ requires API key
- **HuggingFace**: `from langchain_huggingface import HuggingFaceEndpoint` â†’ requires API key

**To use Ollama:**
1. Install from [https://ollama.com/download](https://ollama.com/download)
2. Run in terminal: `ollama pull llama3.2` (or another model)
3. Ollama runs on `localhost:11434` by default

In [None]:
# Paths to the Alice books (plain text)
# Adjust these paths if your files live somewhere else.
DATA_DIR = Path("../data")
WONDERLAND_PATH = DATA_DIR / "Wonderland.txt"
LOOKING_GLASS_PATH = DATA_DIR / "Looking-Glass.txt"

# Set up local LLM via Ollama
# If you prefer Groq or OpenAI, you can swap this block for your own client.
llm = OllamaLLM(
    model="llama3.2",
    temperature=0.0  # Controls randomness: 0.0 = deterministic, 1.0 = creative
)

# print("Data directory:", DATA_DIR.resolve())
# print("Using LLM model:", "llama3.2 (Ollama)")

Data directory: /Users/nargeschinichian/Desktop/Teaching_Uni/SRH/Applied_NLP/applied-NLP-week5/data
Using LLM model: llama3.2 (Ollama)


### Configuration Notes

**Model Selection:**
- `llama3.2`: Fast, good for local testing (3B parameters)
- Other Ollama models: `llama3.1`, `mistral`, `phi3` (run `ollama list` to see installed models)

**Temperature Setting:**
- `temperature=0.0`: Deterministic responses (same answer every time)
- `temperature=0.7`: More creative/varied responses
- `temperature=1.0`: Maximum creativity (may be less factual)

For RAG applications, **lower temperatures (0.0-0.3)** are recommended to keep answers focused on retrieved content.

## 1. Load books

We reuse the idea of the **`load_book`** helper from earlier sessions, but keep it simple:

**Steps:**
1. **Read** the text file from disk
2. **Strip** Project Gutenberg header/footer (boilerplate text)
3. **Return** clean text ready for processing

**Why clean the text?**
- Project Gutenberg files contain legal notices and metadata
- These sections aren't part of the actual book content
- Including them would pollute our embeddings with irrelevant information

**Data Sources:**
- You can use any plain text files (`.txt`)
- For other formats: PDF â†’ use `PyPDF2` or `pdfplumber`, DOCX â†’ use `python-docx`

In [3]:
def load_book(filepath: Path, name: str) -> str:
    # Load and roughly clean a Project Gutenberg text file.
    if not filepath.exists():
        raise FileNotFoundError(f"File not found: {filepath}")

    with open(filepath, "r", encoding="utf-8") as f:
        text = f.read()

    # Very simple cleaning: try to cut away Gutenberg boilerplate
    start_markers = ["CHAPTER I", "*** START OF"]
    end_markers = ["*** END OF", "End of Project Gutenberg"]

    start_idx = 0
    for marker in start_markers:
        if marker in text:
            start_idx = text.find(marker)
            break

    end_idx = len(text)
    for marker in end_markers:
        if marker in text:
            end_idx = text.find(marker)
            break

    cleaned = text[start_idx:end_idx].strip()
    print(f"{name}: {len(cleaned):,} characters after cleaning")
    return cleaned

wonderland_text = load_book(WONDERLAND_PATH, "Alice's Adventures in Wonderland")
looking_glass_text = load_book(LOOKING_GLASS_PATH, "Through the Looking-Glass")

Alice's Adventures in Wonderland: 144,481 characters after cleaning
Through the Looking-Glass: 161,373 characters after cleaning


## 2. Chunk the texts for retrieval

Large documents are **too long** to embed and retrieve as a single vector.  
Instead, we split the books into **overlapping chunks**:

### Parameters Explained

- **`chunk_size`**: Maximum number of characters per chunk (default: 800)
  - This is a **hard limit** - chunks won't exceed this size
  - Too small â†’ loses context, more chunks to search
  - Too large â†’ less precise retrieval, may exceed embedding model limits
  - **Typical range**: 500-1500 characters

- **`chunk_overlap`**: How much neighboring chunks overlap (default: 150)
  - Ensures sentences near boundaries aren't split awkwardly
  - Helps maintain context across chunk boundaries
  - **Typical range**: 10-20% of chunk_size

- **`separators`**: Priority order for splitting points
  - These determine **where** to split when approaching the chunk_size limit
  - The splitter tries each separator in order to find a natural break point:
    1. `\n\n` â†’ paragraph breaks (preferred - most context preserved)
    2. `\n` â†’ line breaks
    3. `. ` â†’ sentence endings
    4. ` ` â†’ word boundaries (last resort)
  - **Key point**: The splitter builds chunks up to ~800 chars, then looks for the best separator to split on

### How It Works Together

Example: If text reaches 780 characters, the splitter looks for the first `\n\n` (paragraph break). If found, it splits there (even if only 750 chars). If not found, it tries `\n`, then `. `, then ` `. This keeps chunks **under 800 chars** while breaking at **natural boundaries**.

### Experimentation

Try adjusting these values to see how they affect:
- Number of chunks created
- Retrieval quality
- Answer accuracy

In [4]:
def chunk_text(text: str, book_name: str, chunk_size: int = 800, chunk_overlap: int = 150):
    # Split a long text into overlapping chunks using RecursiveCharacterTextSplitter.
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ". ", " "],  # try to keep chunks on sentence/paragraph boundaries
    )
    docs = splitter.create_documents([text])
    print(f"{book_name}: {len(docs)} chunks (chunk_size={chunk_size}, overlap={chunk_overlap})")
    return docs

wonderland_chunks = chunk_text(wonderland_text, "Wonderland")
looking_glass_chunks = chunk_text(looking_glass_text, "Looking-Glass")

# Combine chunks from both books into a single corpus
all_chunks = wonderland_chunks + looking_glass_chunks
print("Total chunks in corpus:", len(all_chunks))

Wonderland: 240 chunks (chunk_size=800, overlap=150)
Looking-Glass: 255 chunks (chunk_size=800, overlap=150)
Total chunks in corpus: 495


## 3. Create embeddings & build a vector database (FAISS)

### What are embeddings? (We've been using them since session 3)

**Embeddings** convert text into numerical vectors (arrays of numbers) that capture semantic meaning:
- Similar texts â†’ similar vectors
- Enables mathematical similarity comparisons
- Typical dimensions: 384, 768, or 1536 numbers per chunk

### Vector Database (FAISS)

**FAISS** (Facebook AI Similarity Search) is a library for efficient similarity search:
- Stores all chunk embeddings
- Quickly finds the most similar chunks to a query
- Works entirely offline (no API needed)

### Embedding Model Options

**Current**: `sentence-transformers/all-mpnet-base-v2`
- Dimensions: 768
- Quality: High for general-purpose tasks
- Speed: Medium

**Alternatives:**
- `all-MiniLM-L6-v2` â†’ Faster, smaller (384 dim), slightly lower quality (you've already used this one)
- `all-mpnet-base-v1` â†’ Similar to v2
- OpenAI embeddings â†’ `text-embedding-3-small` (requires API key)

**To change**: Just replace the `model_name` parameter in `HuggingFaceEmbeddings()`

In [5]:
VECTOR_DB_DIR = Path("../vector_databases")
VECTOR_DB_DIR.mkdir(parents=True, exist_ok=True)
VECTOR_DB_PATH = VECTOR_DB_DIR / "vector_db_alice"

def create_embedding_vector_db(chunks, db_path: Path):
    # 1. Instantiate an embedding model (HuggingFace embeddings)
    # 2. Create a FAISS vector store from the chunks
    # 3. Save it locally so we can reload it later
    embedding = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-mpnet-base-v2"
    )

    vectorstore = FAISS.from_documents(
        documents=chunks,
        embedding=embedding
    )

    vectorstore.save_local(str(db_path))
    print(f"Vector database saved to: {db_path}")

create_embedding_vector_db(all_chunks, VECTOR_DB_PATH)

Vector database saved to: ../vector_databases/vector_db_alice


### Performance Notes

**First run:**
- Downloads the embedding model (~420MB for all-mpnet-base-v2)
- Creates embeddings for all chunks (may take 1-2 minutes)
- Saves the vector database to disk

**Subsequent runs:**
- Model is cached locally
- Can skip this step if vector database already exists
- Just load the saved database (next section)

## 4. Build a retriever from the vector database

To use RAG, we need a **retriever** object that:

1. Takes a user question  
2. Converts it to an embedding (using the same model as the chunks)
3. Finds the **k most similar chunks** in the vector store using cosine similarity
4. Returns those chunks to be passed to the LLM

### The `k` Parameter

**`k=4`** means "retrieve the 4 most similar chunks"

**Trade-offs:**
- **Low k (1-3)**: Faster, more focused, but might miss relevant information
- **Medium k (4-6)**: Balanced approach (recommended starting point)
- **High k (7-15)**: More comprehensive, but may include irrelevant chunks and slow down the LLM

**Experiment:** Try different `k` values to see how they affect answer quality and response time.

### Search Strategies

FAISS supports different search algorithms:
- **Similarity search** (default): Returns top-k most similar chunks
- **MMR** (Maximum Marginal Relevance): Returns diverse results
- **Similarity with score threshold**: Only returns chunks above a certain similarity score

In [6]:
def load_retriever(db_path: Path, k: int = 4):
    # Reload the FAISS vector store from disk and create a retriever.
    embedding = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-mpnet-base-v2"
    )

    vectorstore = FAISS.load_local(
        folder_path=str(db_path),
        embeddings=embedding,
        allow_dangerous_deserialization=True,  # needed in some environments
    )

    retriever = vectorstore.as_retriever(search_kwargs={"k": k})
    print(f"Retriever ready (k={k}) from {db_path}")
    return retriever

alice_retriever = load_retriever(VECTOR_DB_PATH, k=4)

Retriever ready (k=4) from ../vector_databases/vector_db_alice


## 5. Connect retriever + LLM = RAG chain

We now create a **retrieval chain** using **LCEL** (LangChain Expression Language):

### Pipeline Flow

1. **Input** â†’ User's question
2. **Retriever** â†’ Fetches relevant chunks from vector database
3. **Format** â†’ Combines chunks into context string
4. **Prompt** â†’ Creates LLM prompt with context + question
5. **LLM** â†’ Generates answer based on context
6. **Output Parser** â†’ Extracts clean string from LLM response

### Custom Prompt Design

Our prompt instructs the LLM to:
- Use only the provided context (retrieved chunks)
- **Cite specific passages** from the books
- Include brief quotes to support answers
- Avoid making up information not in the context

### Prompt Customization Options

You can modify the system message to change LLM behavior:
- Add stricter citation requirements
- Request different answer formats (bullet points, summaries, etc.)
- Specify answer length constraints
- Add domain-specific instructions

In [9]:
def build_rag_chain(retriever):
    # Connects the retriever with an LLM using a custom prompt that asks for references.
    
    # Custom prompt that instructs the LLM to cite sources
    prompt = ChatPromptTemplate.from_messages([
        ("system", """You are a helpful assistant answering questions about Lewis Carroll's Alice books.
Use the following context to answer the question. Always cite specific passages from the books in your answer.
When you use information from the context, include a brief quote or reference to show where it came from.

Context:
{context}"""),
        ("human", "{input}")
    ])
    
    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)
    
    # Build RAG chain using LCEL
    rag_chain = (
        {"context": retriever | format_docs, "input": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    
    return rag_chain

alice_rag_chain = build_rag_chain(alice_retriever)
print("RAG chain ready.")

RAG chain ready.


### Alternative Prompt Strategies

**Without citations** (original hub prompt):
```python
prompt = hub.pull_prompt("langchain-ai/retrieval-qa-chat")
```

**With structured output:**
```python
prompt = ChatPromptTemplate.from_messages([
    ("system", """Answer in this format:
    ANSWER: [your answer]
    SOURCES: [relevant quotes]
    """),
    ("human", "{input}")
])
```

**With confidence levels:**
```python
prompt = ChatPromptTemplate.from_messages([
    ("system", """Answer the question and rate your confidence (low/medium/high) 
    based on how well the context supports your answer."""),
    ("human", "{input}")
])
```

## 6. Ask questions about the *Alice* books

Now we can **chat with the corpus**!

### How It Works

When you ask a question:
1. Question â†’ embedding vector
2. Vector database â†’ finds 4 most similar chunks
3. Chunks + question â†’ sent to LLM as context
4. LLM â†’ generates answer with citations
5. Answer â†’ displayed with text wrapping

### Key Points

- The LLM **does NOT answer from its pretraining alone**
- It first retrieves relevant chunks from the *Alice* books
- Answers are **grounded in the actual text**
- Citations help verify the information

### Evaluation Tips

When testing your RAG system, consider:
- **Relevance**: Does the answer address the question?
- **Accuracy**: Is the information correct per the source?
- **Citation quality**: Are quotes/references provided?
- **Completeness**: Does it cover all relevant aspects?
- **No hallucination**: Does it avoid making up information?

Try questions that:
- Require specific details (names, events)
- Need synthesis across multiple passages
- Ask about comparisons between the books
- Test the system's limits (questions not answerable from the text)

In [11]:
def ask_alice(question: str, chain=alice_rag_chain):
    # Send a question to the RAG chain and print a nicely wrapped answer.
    print(f"\nQUESTION:\n{question}\n" + "-"*80)
    answer = chain.invoke(question)
    print("\nANSWER:\n")
    print(textwrap.fill(answer, width=100))

# Example questions
ask_alice("How does Alice feel when she falls down the rabbit hole?")
ask_alice("What differences are there between Wonderland and the world behind the looking-glass?")
ask_alice("How is Alice referred to in both books?")


QUESTION:
How does Alice feel when she falls down the rabbit hole?
--------------------------------------------------------------------------------

ANSWER:

When Alice falls down the rabbit hole, she doesn't seem to feel any fear or anxiety about her
situation. In fact, she is described as being "never once considering how in the world she was to
get out again" (CHAPTER I. Down the Rabbit-Hole). This suggests that she is somewhat detached from
her predicament and is more focused on her immediate surroundings.  As she falls, Alice experiences
a sense of disorientation and confusion, but it doesn't seem to evoke any strong emotions. She is
simply swept up in the moment and carried along by the rabbit hole's sudden descent (CHAPTER I. Down
the Rabbit-Hole).  It's only when she finds herself in the long, low hall lit by lamps that Alice
begins to feel a sense of unease and disorientation. However, even then, her emotions are more those
of frustration and disappointment rather than fear o

---

## ðŸŽ¯ Exercises & Extensions

### Beginner Level
1. Change `k=4` to `k=2` or `k=8` in the retriever and observe the difference
2. Modify `chunk_size` and `chunk_overlap` and rebuild the vector database
3. Try different temperatures (0.0, 0.5, 1.0) and compare answers
4. Ask your own questions about the Alice books

### Intermediate Level
5. Switch to a different embedding model (e.g., `all-MiniLM-L6-v2`)
6. Modify the prompt to request bullet-point answers
7. Add a function to show which chunks were retrieved for each question
8. Implement a conversation history (multi-turn dialogue)

### Advanced Level
9. Add metadata to chunks (book name, chapter) and use it in retrieval
10. Implement a hybrid search (keyword + semantic)
11. Use a different vector database (Chroma, Pinecone)
12. Build a Gradio or Streamlit UI for the RAG system
13. Evaluate retrieval quality using metrics (precision@k, recall@k)

---

## ðŸ“š Additional Resources

- [LangChain Documentation](https://python.langchain.com/)
- [FAISS Documentation](https://faiss.ai/)
- [Sentence Transformers](https://www.sbert.net/)
- [Ollama Models](https://ollama.com/library)
- [RAG Survey Paper](https://arxiv.org/abs/2312.10997)