# CS 5588 — Week 2 Hands-On: Applied RAG for Product & Venture Development (Two-Step)
**Initiation (20 min, Jan 27)** → **Completion (60 min, Jan 29)**

**Submission:** Survey + GitHub  
**Due:** **Jan 29 (Thu), end of class**

## New Requirement (Important)
For **full credit (2% individual)** you must:
1) Use **your own project-aligned dataset** (not only benchmark)  
2) Add **your own explanations** for key steps

### ✅ “Cell Description” rule (same style as CS 5542)
After each **IMPORTANT** code cell, add a short Markdown **Cell Description** (2–5 sentences):
- What the cell does
- Why it matters for a **product-grade** RAG system
- Any design choices (chunk size, α, reranker, etc.)

> Treat these descriptions as **mini system documentation** (engineering + product thinking).


## Project Dataset Guide (Required for Full Credit)

### Minimum requirements
- **5–25 documents** (start small; scale later)
- Prefer **plain text** documents (`.txt`)
- Put files in a folder named: `project_data/`

### Recommended dataset types (choose one)
- Policies / guidelines / compliance docs
- Technical docs / manuals / SOPs
- Customer support FAQs / tickets (de-identified)
- Research notes / literature summaries
- Domain corpus (healthcare, cybersecurity, business, etc.)

> Benchmarks are optional, but **cannot** earn full credit by themselves.


## 0) One-Click Setup + Import Check  ✅ **IMPORTANT: Add Cell Description after running**
If you are in **Google Colab**, run the install cell below, then **Runtime → Restart session** if imports fail.


In [None]:
# CS 5588 Lab 2 — One-click dependency install (Colab)
!pip -q install -U sentence-transformers chromadb faiss-cpu scikit-learn rank-bm25 transformers accelerate

import sys, platform
print("Python:", sys.version)
print("Platform:", platform.platform())
print("✅ If imports fail later: Runtime → Restart session and run again.")


Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
Platform: Linux-6.6.105+-x86_64-with-glibc2.35
✅ If imports fail later: Runtime → Restart session and run again.


### ✍️ Cell Description (Student)
Write 2–5 sentences explaining what the setup cell does and why restarting the runtime sometimes matters after pip installs.


# STEP 1 — INITIATION (Jan 27, 20 minutes)
**Goal:** Define the **product**, **users**, **dataset reality**, and **trust risks**.

> This is a **product milestone**, not a coding demo.


## 1A) Product Framing (Required)  ✅ **IMPORTANT: Add Cell Description after running**
Fill in the template below like a founder/product lead.


In [None]:
# 1A) Product Framing — Founder / Product Lead View

product = {
  "product_name": "TrustDoc AI",
  "target_users": (
      "Students, analysts, and early-career professionals who rely on AI "
      "to answer questions from technical, academic, or policy documents"
  ),
  "core_problem": (
      "General-purpose chatbots generate fluent answers but often hallucinate, "
      "overgeneralize, or fabricate sources when asked questions that require "
      "grounding in specific documents."
  ),
  "why_rag_not_chatbot": (
      "A standard chatbot answers from parametric memory and cannot verify "
      "claims against real documents. RAG retrieves relevant source material, "
      "grounds the response in evidence, provides citations, and can explicitly "
      "refuse to answer when documentation is insufficient."
  ),
  "failure_harms_who_and_how": (
      "If the system hallucinates or overstates confidence, students may learn "
      "incorrect information, researchers may make flawed decisions, and trust "
      "in AI-assisted analysis is reduced. In high-stakes scenarios, this can "
      "lead to academic, legal, or ethical consequences."
  ),
}

product


{'product_name': 'TrustDoc AI',
 'target_users': 'Students, analysts, and early-career professionals who rely on AI to answer questions from technical, academic, or policy documents',
 'core_problem': 'General-purpose chatbots generate fluent answers but often hallucinate, overgeneralize, or fabricate sources when asked questions that require grounding in specific documents.',
 'why_rag_not_chatbot': 'A standard chatbot answers from parametric memory and cannot verify claims against real documents. RAG retrieves relevant source material, grounds the response in evidence, provides citations, and can explicitly refuse to answer when documentation is insufficient.',
 'failure_harms_who_and_how': 'If the system hallucinates or overstates confidence, students may learn incorrect information, researchers may make flawed decisions, and trust in AI-assisted analysis is reduced. In high-stakes scenarios, this can lead to academic, legal, or ethical consequences.'}

### ✍️ Cell Description (Student)
Explain your product in 3–5 sentences: who the user is, what pain point exists today, and why grounded RAG helps.


## 1B) Dataset Reality Plan (Required)  ✅ **IMPORTANT: Add Cell Description after running**
Describe where your data comes from **in the real world**.


In [None]:
# 1B) Dataset Reality Plan — Real-World View

dataset_plan = {
  "data_owner": (
      "Public institutions, academic publishers, and instructor-provided "
      "course materials"
  ),
  "data_sensitivity": (
      "Public to low-sensitivity internal data; no personally identifiable "
      "information (PII) or regulated data is included"
  ),
  "document_types": (
      "Technical reports, academic papers, course notes, policy summaries, "
      "and instructional documentation in PDF or text format"
  ),
  "expected_scale_in_production": (
      "Initial deployment: 500–2,000 documents; "
      "scaled deployment: 10,000+ documents with periodic updates"
  ),
  "data_reality_check_paragraph": (
      "In real-world deployment, documents are often heterogeneous, noisy, "
      "and inconsistently formatted. Many PDFs contain tables, references, or "
      "scanned text that require preprocessing. Documents may be outdated or "
      "partially relevant, which makes retrieval quality more important than "
      "model size. The system must handle incomplete coverage gracefully and "
      "avoid answering when reliable evidence is not present."
  ),
}

dataset_plan


{'data_owner': 'Public institutions, academic publishers, and instructor-provided course materials',
 'data_sensitivity': 'Public to low-sensitivity internal data; no personally identifiable information (PII) or regulated data is included',
 'document_types': 'Technical reports, academic papers, course notes, policy summaries, and instructional documentation in PDF or text format',
 'expected_scale_in_production': 'Initial deployment: 500–2,000 documents; scaled deployment: 10,000+ documents with periodic updates',
 'data_reality_check_paragraph': 'In real-world deployment, documents are often heterogeneous, noisy, and inconsistently formatted. Many PDFs contain tables, references, or scanned text that require preprocessing. Documents may be outdated or partially relevant, which makes retrieval quality more important than model size. The system must handle incomplete coverage gracefully and avoid answering when reliable evidence is not present.'}

### ✍️ Cell Description (Student)
In a real deployment, the data would come from public academic sources and instructor-provided materials. The documents are low-sensitivity and contain no personal or regulated data. The system must ensure proper attribution and avoid fabricating or misrepresenting source content.

## 1C) User Stories + Mini Rubric (Required)  ✅ **IMPORTANT: Add Cell Description after running**
Define **3 user stories** (U1 normal, U2 high-stakes, U3 ambiguous/failure) + rubric for evidence and correctness.


In [None]:


user_stories = {
  "U1_normal": {
    "user_story": (
        "As a student, I want to ask questions about course documents so that "
        "I can quickly understand key concepts with verified sources."
    ),
    "acceptable_evidence": [
        "Direct quotes or summaries from retrieved course documents",
        "Citations referencing document IDs or chunks"
    ],
    "correct_answer_must_include": [
        "A clear, relevant answer to the question",
        "At least one explicit citation to supporting evidence"
    ],
  },

  "U2_high_stakes": {
    "user_story": (
        "As a researcher, I want evidence-backed answers from domain documents "
        "so that I can make informed decisions without relying on hallucinated information."
    ),
    "acceptable_evidence": [
        "Multiple corroborating documents",
        "High-similarity retrieved passages reviewed by the governance layer"
    ],
    "correct_answer_must_include": [
        "Explicit grounding in retrieved evidence",
        "Clear explanation tied directly to cited sources"
    ],
  },

  "U3_ambiguous_failure": {
    "user_story": (
        "As a user, I want the system to clearly state when there is insufficient "
        "evidence so that I am not misled by speculative answers."
    ),
    "acceptable_evidence": [
        "Low or no relevant retrieval results",
        "Similarity scores below the evidence threshold"
    ],
    "correct_answer_must_include": [
        "An explicit 'not enough evidence' response",
        "No fabricated facts or citations"
    ],
  },
}

user_stories


{'U1_normal': {'user_story': 'As a student, I want to ask questions about course documents so that I can quickly understand key concepts with verified sources.',
  'acceptable_evidence': ['Direct quotes or summaries from retrieved course documents',
   'Citations referencing document IDs or chunks'],
  'correct_answer_must_include': ['A clear, relevant answer to the question',
   'At least one explicit citation to supporting evidence']},
 'U2_high_stakes': {'user_story': 'As a researcher, I want evidence-backed answers from domain documents so that I can make informed decisions without relying on hallucinated information.',
  'acceptable_evidence': ['Multiple corroborating documents',
   'High-similarity retrieved passages reviewed by the governance layer'],
  'correct_answer_must_include': ['Explicit grounding in retrieved evidence',
   'Clear explanation tied directly to cited sources']},
 'U3_ambiguous_failure': {'user_story': 'As a user, I want the system to clearly state when ther

### ✍️ Cell Description (Student)
Explain why U2 is “high-stakes” and what the system must do to avoid harm (abstain, cite evidence, etc.).


## 1D) Trust & Risk Table (Required)
Fill at least **3 rows**. These risks should match your product and user stories.


In [None]:

risk_table = [
  {
    "risk": "Hallucination",
    "example_failure": (
        "The system generates an answer about a document requirement that "
        "is not supported by any retrieved source."
    ),
    "real_world_consequence": (
        "Students or researchers may accept false information as fact, "
        "leading to incorrect conclusions or decisions."
    ),
    "safeguard_idea": "Force citations + abstain when evidence confidence is low",
  },
  {
    "risk": "Omission",
    "example_failure": (
        "The system retrieves only one partially relevant document and misses "
        "other important supporting documents."
    ),
    "real_world_consequence": (
        "Incomplete answers may misrepresent the topic and reduce user trust "
        "in the system’s reliability."
    ),
    "safeguard_idea": "Recall tuning + hybrid keyword and vector retrieval",
  },
  {
    "risk": "Bias / Misleading Output",
    "example_failure": (
        "Retrieved documents disproportionately reflect one viewpoint, "
        "leading the model to present a skewed interpretation."
    ),
    "real_world_consequence": (
        "Users may be influenced toward incorrect or biased conclusions, "
        "especially in high-stakes research scenarios."
    ),
    "safeguard_idea": "Reranking rules + human review for high-risk queries",
  },
]

risk_table


[{'risk': 'Hallucination',
  'example_failure': 'The system generates an answer about a document requirement that is not supported by any retrieved source.',
  'real_world_consequence': 'Students or researchers may accept false information as fact, leading to incorrect conclusions or decisions.',
  'safeguard_idea': 'Force citations + abstain when evidence confidence is low'},
 {'risk': 'Omission',
  'example_failure': 'The system retrieves only one partially relevant document and misses other important supporting documents.',
  'real_world_consequence': 'Incomplete answers may misrepresent the topic and reduce user trust in the system’s reliability.',
  'safeguard_idea': 'Recall tuning + hybrid keyword and vector retrieval'},
 {'risk': 'Bias / Misleading Output',
  'example_failure': 'Retrieved documents disproportionately reflect one viewpoint, leading the model to present a skewed interpretation.',
  'real_world_consequence': 'Users may be influenced toward incorrect or biased concl

✅ **Step 1 Checkpoint (End of Jan 27)**
Commit (or submit) your filled templates:
- `product`, `dataset_plan`, `user_stories`, `risk_table`


# STEP 2 — COMPLETION (Jan 29, 60 minutes)
**Goal:** Build a working **product-grade** RAG pipeline:
Chunking → Keyword + Vector Retrieval → Hybrid α → Governance Rerank → Grounded Answer → Evaluation


## 2A) Project Dataset Setup (Required for Full Credit)  ✅ **IMPORTANT: Add Cell Description after running**

### Colab Upload Tips
- Left sidebar → **Files** → Upload `.txt`
- Place them into `project_data/`

This cell creates the folder and shows how many files were found.


In [None]:
import os, glob, shutil
from pathlib import Path

PROJECT_FOLDER = "project_data"
os.makedirs(PROJECT_FOLDER, exist_ok=True)

# (Optional helper) Move any .txt in current directory into project_data/
moved = 0
for fp in glob.glob("*.txt"):
    shutil.move(fp, os.path.join(PROJECT_FOLDER, os.path.basename(fp)))
    moved += 1

files = sorted(glob.glob(os.path.join(PROJECT_FOLDER, "*.txt")))
print("✅ project_data/ ready | moved:", moved, "| files:", len(files))
print("Example files:", files[:5])


✅ project_data/ ready | moved: 1 | files: 1
Example files: ['project_data/rag_trust_overview.txt']


### ✍️ Cell Description (Student)
The dataset includes domain documents related to trustworthy AI and RAG, stored as text files. A limited number of documents are used to simulate how real knowledge bases are processed. These documents reflect the same retrieval and grounding needs as the intended product scenario.


## 2B) Load Documents + Build Chunks  ✅ **IMPORTANT: Add Cell Description after running**
This milestone cell loads `.txt` documents and produces chunks using either **fixed** or **semantic** chunking.


In [None]:
import re
from pathlib import Path

def load_project_docs(folder="project_data", max_docs=25):
    paths = sorted(Path(folder).glob("*.txt"))[:max_docs]
    docs = []
    for p in paths:
        txt = p.read_text(encoding="utf-8", errors="ignore").strip()
        if txt:
            docs.append({"doc_id": p.name, "text": txt})
    return docs

def fixed_chunk(text, chunk_size=900, overlap=150):
    # Character-based chunking for speed + simplicity
    chunks, i = [], 0
    step = max(1, chunk_size - overlap)
    while i < len(text):
        chunks.append(text[i:i+chunk_size])
        i += step
    return [c.strip() for c in chunks if c.strip()]

def semantic_chunk(text, max_chars=1000):
    # Paragraph-based packing
    paras = [p.strip() for p in re.split(r"\n\s*\n", text) if p.strip()]
    chunks, cur = [], ""
    for p in paras:
        if len(cur) + len(p) + 2 <= max_chars:
            cur = (cur + "\n\n" + p).strip()
        else:
            if cur: chunks.append(cur)
            cur = p
    if cur: chunks.append(cur)
    return chunks

# ---- Choose chunking policy ----
CHUNKING = "semantic"   # "fixed" or "semantic"
FIXED_SIZE = 900
FIXED_OVERLAP = 150
SEM_MAX = 1000

docs = load_project_docs(PROJECT_FOLDER, max_docs=25)
print("Loaded docs:", len(docs))

all_chunks = []
for d in docs:
    chunks = fixed_chunk(d["text"], FIXED_SIZE, FIXED_OVERLAP) if CHUNKING == "fixed" else semantic_chunk(d["text"], SEM_MAX)
    for j, c in enumerate(chunks):
        all_chunks.append({"chunk_id": f'{d["doc_id"]}::c{j}', "doc_id": d["doc_id"], "text": c})

print("Chunking:", CHUNKING, "| total chunks:", len(all_chunks))
print("Sample chunk id:", all_chunks[0]["chunk_id"] if all_chunks else "NO CHUNKS (upload .txt files first)")


Loaded docs: 1
Chunking: semantic | total chunks: 2
Sample chunk id: rag_trust_overview.txt::c0


### ✍️ Cell Description (Student)
Explain why you chose fixed vs semantic chunking for your product, and how chunking affects precision/recall and trust.


## 2C) Build Retrieval Engines (BM25 + Vector Index)  ✅ **IMPORTANT: Add Cell Description after running**
This cell builds:
- **Keyword retrieval** (BM25) for exact matches / compliance
- **Vector retrieval** (embeddings + FAISS) for semantic matches


In [None]:

import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss

# ----- Keyword (BM25) -----
tokenized = [c["text"].lower().split() for c in all_chunks]
bm25 = BM25Okapi(tokenized) if len(tokenized) else None

def keyword_search(query, k=10):
    if bm25 is None:
        return []
    scores = bm25.get_scores(query.lower().split())
    idx = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
    return [(all_chunks[i], float(scores[i])) for i in idx]

# ----- Vector (Embeddings + FAISS) -----
EMB_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
embedder = SentenceTransformer(EMB_MODEL_NAME)

chunk_texts = [c["text"] for c in all_chunks]
if len(chunk_texts) > 0:
    emb = embedder.encode(
        chunk_texts,
        show_progress_bar=True,
        normalize_embeddings=True
    )
    emb = np.asarray(emb, dtype="float32")

    # Inner product works as cosine similarity when embeddings are normalized
    index = faiss.IndexFlatIP(emb.shape[1])
    index.add(emb)

    def vector_search(query, k=10):
        q = embedder.encode([query], normalize_embeddings=True).astype("float32")
        scores, idx = index.search(q, k)
        out = [(all_chunks[int(i)], float(s)) for s, i in zip(scores[0], idx[0])]
        return out

    print("✅ Vector index built | chunks:", len(all_chunks), "| dim:", emb.shape[1])
else:
    index = None
    def vector_search(query, k=10):
        return []
    print("⚠️ No chunks found. Upload .txt files to project_data/ and rerun.")


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

✅ Vector index built | chunks: 2 | dim: 384


### ✍️ Cell Description (Student)
Keyword retrieval is effective for exact terms, definitions, and compliance-style queries where specific wording matters. Vector retrieval captures semantic meaning and can find relevant passages even when users phrase questions differently from the documents. Using both ensures higher recall and reduces missed evidence in real-world queries.


## 2D) Hybrid Retrieval (α Fusion Policy)  ✅ **IMPORTANT: Add Cell Description after running**
Hybrid score = **α · keyword + (1 − α) · vector** after simple normalization.

Try α ∈ {0.2, 0.5, 0.8} and justify your choice.


In [None]:

chunk_by_id = {c["chunk_id"]: c for c in all_chunks}

def minmax_norm(pairs):
    scores = np.array([s for _, s in pairs], dtype="float32") if pairs else np.array([], dtype="float32")
    if len(scores) == 0:
        return []
    mn, mx = float(scores.min()), float(scores.max())
    if mx - mn < 1e-8:
        return [(c, 1.0) for c, _ in pairs]
    return [(c, float((s - mn) / (mx - mn))) for (c, s) in pairs]

def hybrid_search(query, k_kw=10, k_vec=10, alpha=0.5, k_out=10):
    kw = keyword_search(query, k_kw)
    vc = vector_search(query, k_vec)

    kw_n = {c["chunk_id"]: s for c, s in minmax_norm(kw)}
    vc_n = {c["chunk_id"]: s for c, s in minmax_norm(vc)}

    ids = set(kw_n) | set(vc_n)
    fused = []
    for cid in ids:
        s = alpha * kw_n.get(cid, 0.0) + (1 - alpha) * vc_n.get(cid, 0.0)
        fused.append((chunk_by_id[cid], float(s)))

    fused.sort(key=lambda x: x[1], reverse=True)
    return fused[:k_out]

ALPHA = 0.5  # try 0.2 / 0.5 / 0.8


The primary users of this product are precision-first users who require accurate, evidence-backed answers rather than exploratory discovery. Because incorrect or hallucinated information poses a trust risk, a balanced α value of 0.5 was chosen to ensure both exact keyword matches and semantic relevance are captured. This reduces retrieval errors while maintaining reliable evidence grounding for decision-oriented use cases.

## 2E) Governance Layer (Re-ranking)  ✅ **IMPORTANT: Add Cell Description after running**
Re-ranking is treated as **governance** (risk reduction), not just performance tuning.


In [None]:
from sentence_transformers import CrossEncoder

RERANK = True
RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"
reranker = CrossEncoder(RERANK_MODEL) if RERANK else None

def rerank(query, candidates):
    if reranker is None or len(candidates) == 0:
        return candidates
    pairs = [(query, c["text"]) for c, _ in candidates]
    scores = reranker.predict(pairs)
    out = [(c, float(s)) for (c, _), s in zip(candidates, scores)]
    out.sort(key=lambda x: x[1], reverse=True)
    return out

print("✅ Reranker:", RERANK_MODEL if RERANK else "OFF")


Loading weights:   0%|          | 0/105 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: cross-encoder/ms-marco-MiniLM-L-6-v2
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


✅ Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2


### ✍️ Cell Description (Student)
Governance in this product refers to enforcing trust and safety controls between retrieval and answer generation. The reranking step helps prevent irrelevant or weakly related chunks from influencing the final response. This reduces hallucination risk and ensures that only high-quality evidence is used for generation.


## 2F) Grounded Answer + Citations  ✅ **IMPORTANT: Add Cell Description after running**
We include a lightweight generation option, plus a fallback mode.

Your output must include citations like **[Chunk 1], [Chunk 2]** and support **abstention** (“Not enough evidence”).


In [None]:
from transformers import pipeline

USE_LLM = False  # set True to generate; keep False if downloads are slow
GEN_MODEL = "google/flan-t5-base"

gen = pipeline("text2text-generation", model=GEN_MODEL) if USE_LLM else None

def build_context(top_chunks, max_chars=2500):
    ctx = ""
    for i, (c, _) in enumerate(top_chunks, start=1):
        block = f"[Chunk {i}] {c['text'].strip()}\n"
        if len(ctx) + len(block) > max_chars:
            break
        ctx += block + "\n"
    return ctx.strip()

def rag_answer(query, top_chunks):
    ctx = build_context(top_chunks)
    if USE_LLM and gen is not None:
        prompt = (
            "Answer the question using ONLY the evidence below. "
            "If there is not enough evidence, say 'Not enough evidence.' "
            "Include citations like [Chunk 1], [Chunk 2].\n\n"
            f"Question: {query}\n\nEvidence:\n{ctx}\n\nAnswer:"
        )
        out = gen(prompt, max_new_tokens=180)[0]["generated_text"]
        return out, ctx
    else:
        # fallback: evidence-first placeholder
        answer = (
            "Evidence summary (fallback mode):\n"
            + "\n".join([f"- [Chunk {i}] evidence used" for i in range(1, min(4, len(top_chunks)+1))])
            + "\n\nEnable USE_LLM=True to generate a grounded answer."
        )
        return answer, ctx


### ✍️ Cell Description (Student)
Explain how citations and abstention improve trust in your product, especially for U2 (high-stakes) and U3 (ambiguous).


## 2G) Run the Pipeline on Your 3 User Stories  ✅ **IMPORTANT: Add Cell Description after running**
This cell turns your user stories into concrete queries, runs hybrid+rerank, and prints results.


In [None]:
# 2G) Run the Pipeline on Your 3 User Stories

import re

# ---- safety defaults ----
RERANK = True   # set False if you want to skip reranking

# ---- convert user story → query (used only for display) ----
def story_to_query(story_text):
    m = re.search(r"I want to (.+?)(?: so that|\.|$)", story_text, flags=re.IGNORECASE)
    return m.group(1).strip() if m else story_text.strip()

# ---- explicit demo queries (recommended for grading) ----
queries = [
    ("U1_normal", "Why does retrieval-augmented generation improve trust in AI systems?"),
    ("U2_high_stakes", "What safeguards should a RAG system use for high-stakes decisions?"),
    ("U3_ambiguous_failure", "What is the legal penalty for hallucinations in AI systems?"),
]

# ---- run full pipeline ----
def run_pipeline(query, alpha=ALPHA, k=10, do_rerank=RERANK):
    # hybrid retrieval
    base = hybrid_search(query, alpha=alpha, k_out=k)

    # governance reranking
    ranked = rerank(query, base) if do_rerank else base

    # take top chunks
    top5 = ranked[:5]

    # grounded answer (with abstention)
    answer, context = rag_answer(query, top5[:3])

    return top5, answer, context

# ---- execute for each user story ----
results = {}

for key, q in queries:
    top5, ans, ctx = run_pipeline(q)
    results[key] = {
        "query": q,
        "top5": top5,
        "answer": ans,
        "context": ctx
    }

# ---- print results ----
for key in results:
    print("\n==============================")
    print("User Story:", key)
    print("Query:", results[key]["query"])
    print("Top retrieved chunk IDs:")
    print([c["chunk_id"] for c, _ in results[key]["top5"][:3]])
    print("\nAnswer preview:\n")
    print(results[key]["answer"][:600])
    print("\n------------------------------")



User Story: U1_normal
Query: Why does retrieval-augmented generation improve trust in AI systems?
Top retrieved chunk IDs:
['rag_trust_overview.txt::c0', 'rag_trust_overview.txt::c1']

Answer preview:

Evidence summary (fallback mode):
- [Chunk 1] evidence used
- [Chunk 2] evidence used

Enable USE_LLM=True to generate a grounded answer.

------------------------------

User Story: U2_high_stakes
Query: What safeguards should a RAG system use for high-stakes decisions?
Top retrieved chunk IDs:
['rag_trust_overview.txt::c1', 'rag_trust_overview.txt::c0']

Answer preview:

Evidence summary (fallback mode):
- [Chunk 1] evidence used
- [Chunk 2] evidence used

Enable USE_LLM=True to generate a grounded answer.

------------------------------

User Story: U3_ambiguous_failure
Query: What is the legal penalty for hallucinations in AI systems?
Top retrieved chunk IDs:
['rag_trust_overview.txt::c0', 'rag_trust_overview.txt::c1']

Answer preview:

Evidence summary (fallback mode):
- [Chunk 1] 

### ✍️ Cell Description (Student)
Describe one place where the system helped (better grounding) and one place where it struggled (which layer and why).


## 2H) Evaluation (Technical + Product)  ✅ **IMPORTANT: Add Cell Description after running**
Use your rubric to label relevance and compute Precision@5 / Recall@10.
Also assign product scores: Trust (1–5) and Decision Confidence (1–5).


In [None]:
def precision_at_k(relevant_flags, k=5):
    rel = relevant_flags[:k]
    return sum(rel) / max(1, len(rel))

def recall_at_k(relevant_flags, total_relevant, k=10):
    rel_found = sum(relevant_flags[:k])
    return rel_found / max(1, total_relevant)

def make_flags(relevant_indices, n=10):
    flags = [0]*n
    for r in relevant_indices:
        if 1 <= r <= n:
            flags[r-1] = 1
    return flags

evaluation = {}

for key in results:
    print("\n---", key, "---")
    print("Query:", results[key]["query"])

    top10 = results[key]["top5"][:]

    for i, (c, s) in enumerate(top10, start=1):
        preview = c["text"].replace("\n", " ")[:120]
        print(i, c["chunk_id"], "| score:", round(s, 3), "|", preview, "...")

    if key == "U1_normal":
        relevant_ranks = [1, 2]
        total_relevant_est = 2
        trust = 4
        confidence = 4
    elif key == "U2_high_stakes":
        relevant_ranks = [1, 2, 3]
        total_relevant_est = 3
        trust = 4
        confidence = 4
    else:
        relevant_ranks = []
        total_relevant_est = 0
        trust = 5
        confidence = 2

    flags_top10 = make_flags(relevant_ranks, n=10)

    p5 = precision_at_k(flags_top10, k=5)
    r10 = recall_at_k(flags_top10, total_relevant_est, k=10) if total_relevant_est > 0 else None

    evaluation[key] = {
        "relevant_flags_top10": flags_top10,
        "total_relevant_chunks_estimate": total_relevant_est,
        "precision_at_5": p5,
        "recall_at_10": r10,
        "trust_score_1to5": trust,
        "confidence_score_1to5": confidence,
    }

    print("Precision@5:", p5)
    print("Recall@10:", r10)
    print("Trust:", trust, "| Confidence:", confidence)

evaluation



--- U1_normal ---
Query: Why does retrieval-augmented generation improve trust in AI systems?
1 rag_trust_overview.txt::c0 | score: 7.982 | Retrieval-Augmented Generation (RAG) is a system design pattern used to improve the reliability of large language models ...
2 rag_trust_overview.txt::c1 | score: -2.256 | A well-designed RAG system should include safeguards such as source citations, confidence thresholds, and refusal behavi ...
Precision@5: 0.4
Recall@10: 1.0
Trust: 4 | Confidence: 4

--- U2_high_stakes ---
Query: What safeguards should a RAG system use for high-stakes decisions?
1 rag_trust_overview.txt::c1 | score: 4.535 | A well-designed RAG system should include safeguards such as source citations, confidence thresholds, and refusal behavi ...
2 rag_trust_overview.txt::c0 | score: 2.462 | Retrieval-Augmented Generation (RAG) is a system design pattern used to improve the reliability of large language models ...
Precision@5: 0.6
Recall@10: 1.0
Trust: 4 | Confidence: 4

--- U3_

{'U1_normal': {'relevant_flags_top10': [1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
  'total_relevant_chunks_estimate': 2,
  'precision_at_5': 0.4,
  'recall_at_10': 1.0,
  'trust_score_1to5': 4,
  'confidence_score_1to5': 4},
 'U2_high_stakes': {'relevant_flags_top10': [1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
  'total_relevant_chunks_estimate': 3,
  'precision_at_5': 0.6,
  'recall_at_10': 1.0,
  'trust_score_1to5': 4,
  'confidence_score_1to5': 4},
 'U3_ambiguous_failure': {'relevant_flags_top10': [0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0],
  'total_relevant_chunks_estimate': 0,
  'precision_at_5': 0.0,
  'recall_at_10': None,
  'trust_score_1to5': 5,
  'confidence_score_1to5': 2}}

### ✍️ Cell Description (Student)
Explain how you labeled “relevance” using your rubric and what “trust” means for your target users.


## 2I) Failure Case + Venture Fix (Required)
Document one real failure and propose a **system-level** fix (data/chunking/α/rerank/human review).


In [None]:
failure_case = {
  "which_user_story": "U3_ambiguous_failure",
  "what_failed": (
      "The system retrieved semantically similar chunks about RAG trust, but none contained "
      "direct evidence answering the question. Without strong abstention, the generator could "
      "still produce a speculative answer that looks confident."
  ),
  "which_layer_failed": "Retrieval + Generation",
  "real_world_consequence": (
      "Users may believe an unsupported claim, which reduces trust and can lead to incorrect "
      "decisions in research or policy contexts (especially when the question is high-stakes)."
  ),
  "proposed_system_fix": (
      "Add an evidence sufficiency gate before generation: require a minimum similarity/rerank "
      "score and at least 2 corroborating chunks. If thresholds are not met, return 'Not enough "
      "evidence.' Additionally, tune α toward keyword-heavy for compliance-style queries and "
      "route high-risk queries to human review when the system abstains or confidence is low."
  ),
}

failure_case


{'which_user_story': 'U3_ambiguous_failure',
 'what_failed': 'The system retrieved semantically similar chunks about RAG trust, but none contained direct evidence answering the question. Without strong abstention, the generator could still produce a speculative answer that looks confident.',
 'which_layer_failed': 'Retrieval + Generation',
 'real_world_consequence': 'Users may believe an unsupported claim, which reduces trust and can lead to incorrect decisions in research or policy contexts (especially when the question is high-stakes).',
 'proposed_system_fix': "Add an evidence sufficiency gate before generation: require a minimum similarity/rerank score and at least 2 corroborating chunks. If thresholds are not met, return 'Not enough evidence.' Additionally, tune α toward keyword-heavy for compliance-style queries and route high-risk queries to human review when the system abstains or confidence is low."}

## 2J) README Template (Copy into GitHub README.md)

```md
# Week 2 Hands-On — Applied RAG Product Results (CS 5588)

## Product Overview
- Product name:
- Target users:
- Core problem:
- Why RAG:

## Dataset Reality
- Source / owner:
- Sensitivity:
- Document types:
- Expected scale in production:

## User Stories + Rubric
- U1:
- U2:
- U3:
(Rubric: acceptable evidence + correct answer criteria)

## System Architecture
- Chunking:
- Keyword retrieval:
- Vector retrieval:
- Hybrid α:
- Reranking governance:
- LLM / generation option:

## Results
| User Story | Method | Precision@5 | Recall@10 | Trust (1–5) | Confidence (1–5) |
|---|---|---:|---:|---:|---:|

## Failure + Fix
- Failure:
- Layer:
- Consequence:
- Safeguard / next fix:

## Evidence of Grounding
Paste one RAG answer with citations: [Chunk 1], [Chunk 2]
```
