# Retrieval-Augmented Generation (RAG) Basics with **ChromaDB** (Local)

Target audience: **4th-year CS students**

## Learning Objectives
- Understand the core idea of RAG: *retrieve relevant context first, then generate an answer grounded in that context*.
- Learn the basic steps: **chunk → embed → index → retrieve → (optionally re-rank) → assemble context → prompt**.
- Use a **local ChromaDB** vector database (no external server) and **Sentence-Transformers** for embeddings.
- Practice writing simple queries and inspecting retrieved passages.

### RAG flow (ASCII sketch)
```
Raw Document(s)
   │
   ├── Chunking (fixed size + overlap)
   │       ↓
   ├── Embeddings (Sentence-Transformers)
   │       ↓
   ├── Vector DB (ChromaDB: local, file-backed)
   │       ↓
   ├── Retrieval (vector search on query embedding)
   │       ↓
   └── RAG Answer Construction (assemble top-k chunks → prompt an LLM)
```

> **Note:** This notebook demonstrates retrieval and context building. You can plug in any open-source LLM later.


## 1) Environment Setup

This notebook uses only open-source Python libraries.

**Dependencies:**
- `chromadb` for a local, file-backed vector database
- `sentence-transformers` for embeddings
- `numpy`, `pandas`, `tqdm` for data handling and progress bars
- `scikit-learn` for cosine similarity utilities
- `matplotlib` (optional visualization)


In [None]:
# %%
# Install minimal dependencies (uncomment if needed). Running this may take a few minutes.
# If you're inside a managed environment, prefer installing via terminal beforehand.

# !pip install --quiet chromadb sentence-transformers numpy pandas tqdm scikit-learn matplotlib


### 1.a) Use ChromaDB Locally (No Server Needed)

**ChromaDB** runs locally in-process and persists to a folder (default here: `./chroma_data`).  
No Docker and no external services are required — perfect for teaching and small RAG demos.

**Example architecture**

```
Sentence-Transformers → Embeddings → ChromaDB (local) → Retrieval → Context
```

Below we initialize a persistent Chroma client and (re)create a collection.


In [None]:
# %%
import os

# Choose a persistent directory for Chroma data (you can delete this to reset)
CHROMA_DIR = "./chroma_data"
os.makedirs(CHROMA_DIR, exist_ok=True)

try:
    import chromadb
    from chromadb.config import Settings
except Exception as e:
    raise SystemExit("chromadb is not installed. Please run the pip cell above and retry.")

# Initialize a persistent Chroma client
client = chromadb.PersistentClient(path=CHROMA_DIR)

print(f"✅ ChromaDB initialized. Data directory: {CHROMA_DIR}")


## 2) Choose / Load a Document

We'll use **Alice's Adventures in Wonderland – Chapter 1** (public domain) as a default, because it's short, familiar, and approachable.

To keep this notebook fully runnable **offline**, we provide a compact inline fallback excerpt. If you prefer, you can replace the content with your own `.txt` file(s) or add network code to download from Project Gutenberg.


In [None]:
# %%
ALICE_CH1_FALLBACK = (
    """
Alice was beginning to get very tired of sitting by her sister on the bank, 
and of having nothing to do: once or twice she had peeped into the book her 
sister was reading, but it had no pictures or conversations in it, 'and what 
is the use of a book,' thought Alice 'without pictures or conversations?' 

So she was considering in her own mind (as well as she could, for the hot day 
made her feel very sleepy and stupid), whether the pleasure of making a 
daisy-chain would be worth the trouble of getting up and picking the daisies, 
when suddenly a White Rabbit with pink eyes ran close by her. 

There was nothing so very remarkable in that; nor did Alice think it so very 
much out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I 
shall be late!' (when she thought it over afterwards, it occurred to her that 
she ought to have wondered at this, but at the time it all seemed quite 
natural); but when the Rabbit actually took a watch out of its waistcoat-
pocket, and looked at it, and then hurried on, Alice started to her feet, for 
it flashed across her mind that she had never before seen a rabbit with either 
a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, 
she ran across the field after it, and fortunately was just in time to see it 
pop down a large rabbit-hole under the hedge.
"""
)

raw_text = ALICE_CH1_FALLBACK.strip()
print(f"Loaded text length (chars): {len(raw_text)}")
print(raw_text[:300] + ("..." if len(raw_text) > 300 else ""))


## 3) Chunking Basics (Why and How)

We split a long document into **chunks** so that:
- Each chunk fits typical model/context limits
- We can **retrieve** only the relevant pieces for a query
- Overlap preserves continuity across chunk boundaries

**Rules of thumb (first pass):**
- Use ~800–1200 characters per chunk (or 300–500 tokens)
- 10–20% overlap (to avoid losing context at boundaries)
- Adjust based on your queries, domain, and evaluation


In [None]:
# %%
from typing import List, Dict

def simple_char_chunk(
    text: str,
    chunk_chars: int = 1000,
    overlap_chars: int = 150,
) -> List[Dict]:
    """Split text into overlapping character chunks.
    Returns a list of dicts: {id, text, start_idx, end_idx}
    """
    assert chunk_chars > 0 and overlap_chars >= 0
    chunks = []
    n = len(text)
    start = 0
    chunk_id = 0
    while start < n:
        end = min(start + chunk_chars, n)
        chunk_text = text[start:end]
        chunks.append({
            "id": f"chunk_{chunk_id}",
            "text": chunk_text,
            "start_idx": start,
            "end_idx": end,
        })
        chunk_id += 1
        if end == n:
            break
        start = max(0, end - overlap_chars)
    return chunks

chunks = simple_char_chunk(raw_text, chunk_chars=800, overlap_chars=120)
print(f"Created {len(chunks)} chunk(s)")
for c in chunks[:3]:
    preview = c["text"][:120].replace("\n", " ")
    print(f"{c['id']} [{c['start_idx']}:{c['end_idx']}] → {preview}...")


## 4) Embeddings: Sentence-Transformers

We'll default to **`sentence-transformers/all-MiniLM-L6-v2`** because it's small, fast, and a great baseline.  
We'll embed each chunk and **normalize** the vectors for cosine similarity.


In [None]:
# %%
import numpy as np
from tqdm import tqdm

try:
    from sentence_transformers import SentenceTransformer
except Exception as e:
    raise SystemExit("sentence-transformers not installed. Please run the pip install cell above.")

EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(EMBEDDING_MODEL_NAME)
print(f"Loaded embedding model: {EMBEDDING_MODEL_NAME}")

def l2_normalize(mat: np.ndarray, eps: float = 1e-12) -> np.ndarray:
    norms = np.linalg.norm(mat, axis=1, keepdims=True)
    norms = np.maximum(norms, eps)
    return mat / norms

texts = [c["text"] for c in chunks]
embs = model.encode(texts, batch_size=32, show_progress_bar=True)
embs = np.asarray(embs, dtype=np.float32)
embs = l2_normalize(embs)
print("Embeddings shape:", embs.shape)


## 5) Create a Chroma Collection & Ingest Chunks

We'll create a collection named `RAGChunk` and upsert each chunk with its **precomputed embedding** and some basic metadata.


In [None]:
# %%
COLLECTION_NAME = "RAGChunk"

# Drop any existing collection with the same name to keep re-runs clean
existing = [c.name for c in client.list_collections()]
if COLLECTION_NAME in existing:
    client.delete_collection(COLLECTION_NAME)

# Create new collection; we supply embeddings explicitly
collection = client.create_collection(name=COLLECTION_NAME, metadata={"hnsw:space": "cosine"})

ids = [c["id"] for c in chunks]
documents = [c["text"] for c in chunks]
metadatas = [{"source": "alice_ch1_fallback", "start_idx": int(c["start_idx"]), "end_idx": int(c["end_idx"])} for c in chunks]
emb_list = [e.tolist() for e in embs]

collection.add(ids=ids, documents=documents, metadatas=metadatas, embeddings=emb_list)
print(f"Ingestion complete. Collection '{COLLECTION_NAME}' now has {collection.count()} items.")


## 6) Retrieval

We perform **vector search** in ChromaDB by embedding the query with the same model, then asking the collection for the nearest neighbors.  
We also show a transparent **cosine re-ranking** step (computed locally) on the top-N.


In [None]:
# %%
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

def chroma_vector_search(query_text: str, top_k: int = 5):
    q_emb = model.encode([query_text])
    q_emb = l2_normalize(np.asarray(q_emb, dtype=np.float32))
    res = collection.query(query_embeddings=q_emb.tolist(), n_results=top_k, include=["metadatas", "documents", "embeddings", "distances"])
    return res, q_emb

def cosine_rerank(res, q_emb, top_k: int = 5):
    if not res or not res.get("documents"):
        return pd.DataFrame(columns=["cosine", "preview", "start_idx", "end_idx"])
    docs = res["documents"][0]
    metas = res["metadatas"][0]
   
    # Use returned embeddings if present; otherwise re-embed
    if "embeddings" in res and res["embeddings"] and res["embeddings"][0] is not None:
        doc_embs = np.array(res["embeddings"][0], dtype=np.float32)
        doc_embs = l2_normalize(doc_embs)
    else:
        doc_embs = model.encode(docs)
        doc_embs = l2_normalize(np.asarray(doc_embs, dtype=np.float32))

    sims = cosine_similarity(q_emb, doc_embs)[0]
    rows = []
    for i, (d, m) in enumerate(zip(docs, metas)):
        rows.append({
            "cosine": float(sims[i]),
            "preview": (d[:160].replace("\n", " ") + ("..." if len(d) > 160 else "")),
            "start_idx": m.get("start_idx"),
            "end_idx": m.get("end_idx"),
        })
    df = pd.DataFrame(rows).sort_values("cosine", ascending=False).head(top_k).reset_index(drop=True)
    return df

example_queries = [
    "Why is Alice bored at the beginning?",
    "What unusual thing did the White Rabbit do?",
    "Where did Alice see the rabbit go?",
]

for q in example_queries:
    print("\n=== Query:", q)
    res, q_emb = chroma_vector_search(q, top_k=5)
    df = cosine_rerank(res, q_emb, top_k=5)
    display(df)


## 7) Build a RAG Answer (Template)

Below we assemble the **top-k** retrieved chunks into a single **context**. In a real RAG system, you'd pass this context to an LLM along with the user question.

We also show a **prompt template** that encourages grounded answers and citing chunk IDs/offsets.


In [None]:
# %%
def build_context(query: str, k: int = 3) -> str:
    res, q_emb = chroma_vector_search(query, top_k=max(k, 5))
    df = cosine_rerank(res, q_emb, top_k=k)
    context_blocks = []
    for i, row in df.iterrows():
        context_blocks.append(f"[CHUNK {i}] start={row['start_idx']}\n{row['preview']}")
    context = "\n\n".join(context_blocks)
    return context

user_question = "Why did Alice run after the rabbit?"
context = build_context(user_question, k=3)
print("Constructed context:\n\n" + context)

prompt_template = f"""
You are a helpful assistant. Answer the user's question **using only** the provided context.
If the answer isn't in the context, say you don't know.

Question: {user_question}

Context:
{context}

Instructions:
- Ground your answer in the context.
- If the context is insufficient, say "I don't know based on the provided context."
- Cite the chunk indices you used (e.g., [CHUNK 0, CHUNK 2]).
"""
print("\n--- RAG Prompt Template ---\n")
print(prompt_template)


## 8) Wrap-Up & Next Steps
- **Adjust chunk size & overlap** and evaluate retrieval quality with a few representative questions.
- **Swap the embedding model** (e.g., `intfloat/e5-small-v2`) and consider its query formatting (`"query: ..."` vs `"passage: ..."`).
- **Scale up**: multiple documents, PDFs → text extraction, deduplication, metadata hygiene.
- **Evaluation**: add a small labeled set of queries and compute Recall@k/Precision@k.
- **Rerankers**: try stronger re-rankers if needed (cross-encoder or LLM-based), once students grasp the basics.
- **Guardrails**: instruct the LLM to answer only from retrieved context and to cite chunks.
