
# RAG Lab 2: Query → Retrieve (Chroma) → Rerank → HF LLM Answer

**Updated:** 2025-11-08

In this short lab you will:
1. Enter a natural-language **query** via a widget
2. **Embed** the query using the **same model** as your document embeddings
3. Retrieve **top k×2** candidates from an **existing ChromaDB** collection
4. **Rerank** those candidates with a **cross-encoder** reranker
5. Keep the **top k** passages and build a grounded prompt
6. Run **Hugging Face LLM** inference and display the **answer**

> Goal: illustrate a lean RAG pipeline with reranking. 


In [1]:
# ✅ Install minimal dependencies. If these are already installed, this cell is a no-op.
# !pip install langchain langchain-community pypdf chromadb sentence-transformers transformers tqdm --quiet

In [2]:

# ---- Imports & configuration ----
import os, numpy as np, pandas as pd
from typing import List, Dict
from tqdm import tqdm
import ipywidgets as w

import chromadb
from sentence_transformers import SentenceTransformer, CrossEncoder
from transformers import pipeline

from pprint import PrettyPrinter
# Create a PrettyPrinter with custom indentation
pp = PrettyPrinter(indent=4)

# === Paths & model names (match prior lab defaults) ===
PERSIST_DIR    = "./rag_chroma"                      # where the prior lab persisted vectors
COLLECTION_NAME= "cnu_rag_lab"                       # collection name used previously
EMBED_MODEL_NAME = "sentence-transformers/msmarco-distilbert-cos-v5"
RERANK_MODEL   = "cross-encoder/ms-marco-electra-base"    # compact cross-encoder
LLM_MODEL     = "google/flan-t5-small"               # compact text-generation model
K_DEFAULT      = 5


## 1) Load embedder, reranker and a text generation model to give answers

In [3]:

# Load the SAME embedder used for documents, so query embeddings live in the same space.
embedder = SentenceTransformer(EMBED_MODEL_NAME)
print("Embedder:", EMBED_MODEL_NAME)

# Cross-encoder reranker: computes relevance for (query, passage) pairs. Higher = better.
reranker = CrossEncoder(RERANK_MODEL)
print("Reranker:", RERANK_MODEL)

# Lightweight HF text2text model for grounded answering (fast on CPU relative to larger LLMs).
# You can swap to a chat model later if you have more compute.
gen = pipeline("text2text-generation", model=LLM_MODEL)
print("HF LLM:", LLM_MODEL)


Embedder: sentence-transformers/msmarco-distilbert-cos-v5
Reranker: cross-encoder/ms-marco-electra-base


Device set to use cuda:0


HF LLM: google/flan-t5-small


## 2) Load the chroma database generated in Lab 1

In [4]:
import chromadb

# Connect to existing Chroma collection
client = chromadb.PersistentClient(path=PERSIST_DIR)
try: 
    collection = client.get_collection(name=COLLECTION_NAME)
    print(f"Connected to collection '{COLLECTION_NAME}' with {collection.count()} vectors at {PERSIST_DIR}")
except Exception as e:
    raise SystemExit(
        f"[Error] Could not open Chroma collection '{COLLECTION_NAME}' at {PERSIST_DIR}.\n"
        "Run the previous RAG lab to build it, then re-run this notebook."
    )


Connected to collection 'cnu_rag_lab' with 1256 vectors at ./rag_chroma



## 3) Typical RAG Retrieval Flow (two‑stage)

```
Query  ──► Bi‑encoder vector ──► ANN index (top‑k docs)
                               └─► k candidates
Query + each candidate doc ──► Cross‑encoder (re‑ranker) ──► final ordered list
Top‑m chunks ──► Prompt context for LLM
```



## 4) Why is a re‑ranker more accurate than a bi‑encoder? (Core intuition)

| Model | Input/Scoring | Strengths | Weaknesses |
|---|---|---|---|
| **Bi‑encoder** | Encodes **query** and **doc** *independently* into vectors; score = cosine/dot | Very **fast**, **scalable** (precompute doc vectors; index with FAISS/Chroma/Pinecone) | **No token‑level interaction** between query & doc; can miss subtle meaning (negation, entities, context) |
| **Cross‑encoder** (Re‑ranker) | Reads **query + doc together** (e.g., `[CLS] query [SEP] doc`); predicts **relevance** | **Deep token‑level attention**; **context‑sensitive** scoring; higher **accuracy** | **Slower** (must score each pair), no doc precomputation |

**Example:**

**Query:** “Documents not about CNNs”

**Doc:** “This paper discusses convolutional networks”

- Bi-encoder: high similarity (misses “not”)

- Re-ranker: low relevance (understands negation)


**Reason the re‑ranker wins:** It attends across tokens of query **and** document jointly, so it can model negation, long‑distance dependencies, and nuanced phrasing that a single fixed vector (bi‑encoder output) cannot fully capture.



## 5) Retrieval: Top‑k Nearest Chunks

We'll embed the user query with the **same model** used to create the chroma database in the last lab, then query ChromaDB for the nearest chunks.


In [5]:

def embed_query(q: str) -> np.ndarray:
    '''Embed a query string using the same SentenceTransformer model (normalized).'''
    # For asymmetric semantic search, you are recommended to use SentenceTransformer.encode_query to encode your queries
    v = embedder.encode_query([q], convert_to_numpy=True, normalize_embeddings=True)
    return v[0].astype("float32")

def chroma_retrieve(query: str, k2: int) -> List[Dict]:
    '''Retrieve top k*2 candidates from Chroma using vector similarity.'''
    # q_emb = embed_query(query).tolist()
    q_emb = embed_query(query)

    res = collection.query(
        query_embeddings=[q_emb],
        n_results=k2,
        include=["documents", "metadatas", "distances"]  #returns string chunks, plus info about each page in the pdf and chunk number, and finaly the cosign distance    )
    )
    
    # Flatten results into a list of dicts for easy handling
    items = []
    for cid, doc, meta, dist in zip(res["ids"][0], res["documents"][0], res["metadatas"][0], res["distances"][0]):
        items.append({
            "id": cid,
            "text": doc,
            "source": meta.get("source"),
            "page": meta.get("page"),
            "method": meta.get("method"),
            "distance": float(dist)
        })
    return items

def rerank(query: str, candidates: List[Dict], k: int) -> List[Dict]:
    '''Use a cross-encoder to rerank and return the top-k passages for the final prompt.'''
    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)  # vector of relevance scores (higher is better)

    # Attach scores to candidates
    for c, s in zip(candidates, scores):
        c["score"] = float(s)
        
    # Sort by score descending and keep top-k
    return sorted(candidates, key=lambda x: x["score"], reverse=True)[:k]




## 6) Build LLM prompt with context from reranker

In [6]:
def build_context(passages: List[Dict]) -> str:
    '''Build a compact context block with source/page for grounding.'''
    lines = []
    for p in passages:
        tag = f"[{p.get('source')} p.{p.get('page')} | {p.get('method')}] "
        text = p["text"].strip().replace("\n", " ")
        lines.append(tag + text)
    return "\n- " + "\n- ".join(lines) if lines else ""

#same function from week11/embeddings_cosine_similarity_mini_lab.ipynb
def build_prompt(passages: List[Dict], question: str) -> dict:

    # context = " ".join(context)  #a list of docs is provided
    context=build_context(passages)

    system_msg = ("""You are a helpful assistant. Answer the user's question **using only** the provided Data.
    If the answer isn't in the context, say you don't know.
    Instructions:
    - Ground your answer in the context.
    - If the answer is not in the context, say "I don't know based on the provided context."
    """)

    user_msg = f"Context:\n{context}\n\nQuestion:\n{question}\n\nAnswer:"

    messages = [
        {"role": "system", "content": system_msg},
        {"role": "user", "content": user_msg}
    ]
    return messages

In [7]:
from transformers import pipeline
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
llm = pipeline("text-generation", model=model_name,tokenizer=model_name) 


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Device set to use cuda:0


In [8]:

def answer_with_llm(query: str, passages: List[Dict], max_new_tokens: int = 256, verbose: bool = True) -> str:
    prompt=build_prompt(passages, query)
    if(verbose):
        pp.pprint(prompt)
    return llm(prompt, max_new_tokens=1000)[0]['generated_text'][2]['content']

In [12]:
# Display the UI components in a vertical box layout
# - q_box: text area for entering the query
# - k_slider: slider to select number of top results (k)
# - run_btn: button to trigger the retrieve → rerank → answer pipeline
# - out: output area to display results
# Simple UI: text area for the query and slider for k
q_box   = w.Textarea(value="", placeholder="Type Query Here", description="Query:", layout=w.Layout(width="100%", height="100px"))
k_slider= w.IntSlider(value=K_DEFAULT, min=2, max=10, step=1, description="k")
run_btn = w.Button(description="Retrieve → Rerank → Answer", button_style="primary")

def generate_df(topk:list)->pd.DataFrame:
    return pd.DataFrame([{
            "rank": i+1,
            "CosSim": round(p["score"], 4),
            "source": p["source"],
            "page": p["page"],
            "snippet": (p["text"][:220] + "…") if len(p["text"]) > 220 else p["text"]
        } for i, p in enumerate(topk)])

# Flag to control verbosity of output (detailed logging and intermediate results)
verbose = False

def on_click(_):
    """
    Event handler for the 'Retrieve → Rerank → Answer' button.
    Executes the full RAG pipeline:
    1. Embed and retrieve top k*10 candidates from ChromaDB
    2. Rerank candidates using cross-encoder to get top k
    3. Build prompt with top k passages and generate answer with LLM
    """
    out.clear_output(wait=True)
    query = q_box.value.strip()
    k = int(k_slider.value)
    
    # Validate query input
    if not query:
        with out: print("Please type a query.")
        return
    
    with out:
        # Step 1: Retrieve k*10 candidates using bi-encoder (fast, approximate)
        cands = chroma_retrieve(query, k2=k*10)
        if not cands:
            print("No results. Make sure your Chroma collection is populated (run the first lab).")
            return
        
        # Step 2: Rerank candidates using cross-encoder (slower, more accurate)
        topk = rerank(query, cands, k=k)
        
        # Display intermediate results if verbose mode is on
        if verbose: 
            print(f'Query length={len(query)}')
            print("Retrieved from Chroma…")

            print(f"Retrieved {len(cands)} candidates.")
            df = generate_df(cands)
            display(df)

            print(f"Top-{k} Reranked Passages:")
            df = generate_df(topk)
            display(df)

        # Step 3: Generate answer using LLM with top-k reranked passages as context
        ans = answer_with_llm(query, topk, max_new_tokens=1000, verbose=verbose)
        pp.pprint(ans)

# Attach click handler to button
run_btn.on_click(on_click)

# Display UI components in vertical layout
w.VBox([q_box, k_slider, run_btn, out])

#what are the pre reqs for CPSC 475

VBox(children=(Textarea(value='', description='Query:', layout=Layout(height='100px', width='100%'), placehold…


### Tips & next steps
- **Why rerankers help:** bi-encoders (embeddings) are fast but approximate; cross-encoders read the *pair* (query, passage) and usually re-order the top candidates more accurately.
- Try other rerankers: `cross-encoder/ms-marco-MiniLM-L-12-v2`, `BAAI/bge-reranker-base` (bigger = slower but better).
- Keep the **same embedder** for documents and queries. Only the reranker changes.
- Experiment: different `k`, prompt templates, or add citation formatting to your final answer.
