# CS 5588 — Week 2 Hands-On: Applied RAG for Product & Venture Development (Two-Step)
**Initiation (20 min, Jan 27)** → **Completion (60 min, Jan 29)**

**Submission:** Survey + GitHub  
**Due:** **Jan 29 (Thu), end of class**

## New Requirement (Important)
For **full credit (2% individual)** you must:
1) Use **your own project-aligned dataset** (not only benchmark)  
2) Add **your own explanations** for key steps

### ✅ “Cell Description” rule (same style as CS 5542)
After each **IMPORTANT** code cell, add a short Markdown **Cell Description** (2–5 sentences):
- What the cell does
- Why it matters for a **product-grade** RAG system
- Any design choices (chunk size, α, reranker, etc.)

> Treat these descriptions as **mini system documentation** (engineering + product thinking).


## Project Dataset Guide (Required for Full Credit)

### Minimum requirements
- **5–25 documents** (start small; scale later)
- Prefer **plain text** documents (`.txt`)
- Put files in a folder named: `project_data/`

### Recommended dataset types (choose one)
- Policies / guidelines / compliance docs
- Technical docs / manuals / SOPs
- Customer support FAQs / tickets (de-identified)
- Research notes / literature summaries
- Domain corpus (healthcare, cybersecurity, business, etc.)

> Benchmarks are optional, but **cannot** earn full credit by themselves.


## 0) One-Click Setup + Import Check  ✅ **IMPORTANT: Add Cell Description after running**
If you are in **Google Colab**, run the install cell below, then **Runtime → Restart session** if imports fail.


In [5]:
# CS 5588 Lab 2 — One-click dependency install (Colab)
!pip -q install -U sentence-transformers chromadb faiss-cpu scikit-learn rank-bm25 transformers accelerate

import sys, platform
print("Python:", sys.version)
print("Platform:", platform.platform())
print("✅ If imports fail later: Runtime → Restart session and run again.")


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.0/52.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m494.1/494.1 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.1/21.1 MB[0m [31m59.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m47.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m57.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m69.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.2/278.2 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### ✍️ Cell Description (Student)
Write 2–5 sentences explaining what the setup cell does and why restarting the runtime sometimes matters after pip installs.


This cell sets up and installs  everything needed to run the RAG system by installing all required libraries in one step. Doing this upfront avoids version conflicts and makes the notebook easy for others to run without extra setup. Printing the Python and platform info helps quickly spot environment-related issues if something breaks. The restart note is a practical safeguard for Colab quirks so the workflow stays smooth.

# STEP 1 — INITIATION (Jan 27, 20 minutes)
**Goal:** Define the **product**, **users**, **dataset reality**, and **trust risks**.

> This is a **product milestone**, not a coding demo.


## 1A) Product Framing (Required)  ✅ **IMPORTANT: Add Cell Description after running**
Fill in the template below like a founder/product lead.


In [6]:
product = {
  "product_name": "AI-Powered Weather & Climate Intelligence System for Personalized Decision Support",
  "target_users": "General users, travelers, researchers, and analysts who need actionable weather insights for travel, daily activities, or climate research.",
  "core_problem": "Existing apps mainly focus on forecasts and alertsbut lack contextual interpretation, personalization and explainable insights",
  "why_rag_not_chatbot": "RAG enables the system to retrieve real-time weather data, historical climate records and severe weather alerts that are accurate and graunded. A standard bot is relies on pre-trained knowledge and could hallucinate or provide outdated weather information.",
  "failure_harms_who_and_how": "Incorrect forecasts, misinterpreted AI advice, or missed severe weather alerts could mislead users, causing travel disruptions, safety risks, or property damage. Poor explainability could also reduce trust in the system.",
}
product


{'product_name': 'AI-Powered Weather & Climate Intelligence System for Personalized Decision Support',
 'target_users': 'General users, travelers, researchers, and analysts who need actionable weather insights for travel, daily activities, or climate research.',
 'core_problem': 'Existing apps mainly focus on forecasts and alertsbut lack contextual interpretation, personalization and explainable insights',
 'why_rag_not_chatbot': 'RAG enables the system to retrieve real-time weather data, historical climate records and severe weather alerts that are accurate and graunded. A standard bot is relies on pre-trained knowledge and could hallucinate or provide outdated weather information.',
 'failure_harms_who_and_how': 'Incorrect forecasts, misinterpreted AI advice, or missed severe weather alerts could mislead users, causing travel disruptions, safety risks, or property damage. Poor explainability could also reduce trust in the system.'}

### ✍️ Cell Description (Student)
Explain your product in 3–5 sentences: who the user is, what pain point exists today, and why grounded RAG helps.


This AI-powered weather system helps users—travelers, students, or anyone planning daily activities—get clear and useful weather information. Most apps only show forecasts and alerts, leaving users to figure out what it means. Our system combines real-time weather, historical trends, and severe weather alerts to give personalized, easy-to-understand guidance. Using RAG, the AI bases its answers on real data, so users get accurate and trustworthy recommendations. This helps people plan trips, daily activities, and stay safe during extreme weather.

## 1B) Dataset Reality Plan (Required)  ✅ **IMPORTANT: Add Cell Description after running**
Describe where your data comes from **in the real world**.


In [7]:
dataset_plan = {
  "data_owner": "Public agencies and APIs (NOAA, NASA, city open data portals, OpenWeather API)",              # company / agency / public / internal team
  "data_sensitivity": "Public, no confidential data involved",        # public / internal / regulated / confidential
  "document_types": "Real-time weather readings, historical climate data, severe weather alerts, geospatial data",          # policies, manuals, reports, research, etc.
  "expected_scale_in_production": "Millions of data points daily from multiple cities; historical datasets spanning years for trend analysis",  # e.g., 200 docs, 10k docs, etc.
  "data_reality_check_paragraph": "The system gets real-time weather data from APIs like OpenWeather and city open data portals, while historical climate data comes from NOAA and NASA. The data comes in different formats—like JSON, CSV, or NetCDF—so it needs to be cleaned and standardized before use. Real-time APIs can have rate limits or occasional downtime, and historical datasets may have missing or inconsistent records. By handling these challenges carefully, the system can provide accurate forecasts, trend analysis, and personalized weather insights to users.”",
}
dataset_plan


{'data_owner': 'Public agencies and APIs (NOAA, NASA, city open data portals, OpenWeather API)',
 'data_sensitivity': 'Public, no confidential data involved',
 'document_types': 'Real-time weather readings, historical climate data, severe weather alerts, geospatial data',
 'expected_scale_in_production': 'Millions of data points daily from multiple cities; historical datasets spanning years for trend analysis',
 'data_reality_check_paragraph': 'The system gets real-time weather data from APIs like OpenWeather and city open data portals, while historical climate data comes from NOAA and NASA. The data comes in different formats—like JSON, CSV, or NetCDF—so it needs to be cleaned and standardized before use. Real-time APIs can have rate limits or occasional downtime, and historical datasets may have missing or inconsistent records. By handling these challenges carefully, the system can provide accurate forecasts, trend analysis, and personalized weather insights to users.”'}

### ✍️ Cell Description (Student)
Write 2–5 sentences describing where this data would come from in a real deployment and any privacy/regulatory constraints.


In a real deployment, our system would collect real-time weather from services like OpenWeather and city open data portals, while historical climate records come from NOAA and NASA. The data comes in different formats and sometimes has missing or inconsistent entries, so we clean and standardize it before use. All the sources are public, so privacy isn’t an issue, but we still need to handle large volumes carefully to make sure users always get accurate and reliable weather insights.

## 1C) User Stories + Mini Rubric (Required)  ✅ **IMPORTANT: Add Cell Description after running**
Define **3 user stories** (U1 normal, U2 high-stakes, U3 ambiguous/failure) + rubric for evidence and correctness.


In [8]:
user_stories = {
  "U1_normal": {
    "user_story": "As a general user, I want to The dashboard shows real-time weather and hourly forecasts for the selected city.check the current weather and hourly forecast for my city so that I can plan my day effectively.",
    "acceptable_evidence": ["The dashboard shows real-time weather and hourly forecasts for the selected city.",
        "User can see temperature, precipitation, wind, and humidity for their location."],
    "correct_answer_must_include": ["Real-time weather data is displayed accurately", "Forecast information is easily readable and understandable"],
  },
  "U2_high_stakes": {
    "user_story": "As a traveler, I want to receive alerts for severe weather events so that I can I can avoid dangerous conditions and plan my trips safely..",
    "acceptable_evidence":[ "System sends notifications or shows alerts for floods, storms, snow, or hurricanes.",
        "User can view details of the alert and recommended precautions."
    ],
    "correct_answer_must_include": ["Severe weather alerts are timely and accurate",
        "AI provides actionable advice based on the alerts"],
  },
  "U3_ambiguous_failure": {
    "user_story": "As a researcher, I want to compare current weather with historical trends so that I can understand anomalies or patterns over time",
    "acceptable_evidence": ["Dashboard visualizes historical climate trends alongside current data.",
        "User can analyze deviations from typical patterns for their city or region."],
    "correct_answer_must_include": ["Historical climate data is correctly integrated and visualized",
        "Anomalies or patterns are highlighted clearly"],
  },
}
user_stories


{'U1_normal': {'user_story': 'As a general user, I want to check the current weather and hourly forecast for my city so that I can plan my day effectively.',
  'acceptable_evidence': ['The dashboard shows real-time weather and hourly forecasts for the selected city.',
   'User can see temperature, precipitation, wind, and humidity for their location.'],
  'correct_answer_must_include': ['Real-time weather data is displayed accurately',
   'Forecast information is easily readable and understandable']},
 'U2_high_stakes': {'user_story': 'As a traveler, I want to receive alerts for severe weather events so that I can I can avoid dangerous conditions and plan my trips safely..',
  'acceptable_evidence': ['System sends notifications or shows alerts for floods, storms, snow, or hurricanes.',
   'User can view details of the alert and recommended precautions.'],
  'correct_answer_must_include': ['Severe weather alerts are timely and accurate',
   'AI provides actionable advice based on the al

### ✍️ Cell Description (Student)
Explain why U2 is “high-stakes” and what the system must do to avoid harm (abstain, cite evidence, etc.).


U2 is considered “high-stakes” because it affects users safety—if the system misses a severe weather alert, users could be caught in dangerous situations like floods, storms, or snowstorms. To prevent harm, the system must use trusted sources like NOAA, NASA, and city open data portals, show clearly where the information comes from, and explain the alert in simple terms. If the data is uncertain or missing, the system should warn users and avoid giving misleading advice, so they can make safe, informed decisions.


## 1D) Trust & Risk Table (Required)
Fill at least **3 rows**. These risks should match your product and user stories.


In [9]:
risk_table = [
  {
    "risk": "Hallucination",
    "example_failure": "The AI says there will be a snowstorm in New York tomorrow, but no such storm is predicted in any data source.",
    "real_world_consequence": "Users might cancel plans unnecessarily or panic, reducing trust in the system.",
    "safeguard_idea": "Force citations from trusted sources (NOAA, NASA, city portals) and abstain if data is insufficient."
  },
  {
    "risk": "Omission",
    "example_failure": "The system fails to alert users about an approaching hurricane because the data wasn’t retrieved properly.",
    "real_world_consequence": "Users could be caught in dangerous conditions, risking life, health, and property.",
    "safeguard_idea": "Use recall tuning and hybrid retrieval from multiple sources to ensure alerts are not missed."
  },
  {
    "risk": "Bias/Misleading",
    "example_failure": "The AI highlights favorable weather for one city while underestimating nearby regions at risk of severe weather.",
    "real_world_consequence": "Users may make unsafe travel or activity decisions, lowering trust and increasing exposure to hazards.",
    "safeguard_idea": "Apply reranking rules, incorporate human review for critical alerts, and clearly communicate uncertainty."
  },
]

risk_table


[{'risk': 'Hallucination',
  'example_failure': 'The AI says there will be a snowstorm in New York tomorrow, but no such storm is predicted in any data source.',
  'real_world_consequence': 'Users might cancel plans unnecessarily or panic, reducing trust in the system.',
  'safeguard_idea': 'Force citations from trusted sources (NOAA, NASA, city portals) and abstain if data is insufficient.'},
 {'risk': 'Omission',
  'example_failure': 'The system fails to alert users about an approaching hurricane because the data wasn’t retrieved properly.',
  'real_world_consequence': 'Users could be caught in dangerous conditions, risking life, health, and property.',
  'safeguard_idea': 'Use recall tuning and hybrid retrieval from multiple sources to ensure alerts are not missed.'},
 {'risk': 'Bias/Misleading',
  'example_failure': 'The AI highlights favorable weather for one city while underestimating nearby regions at risk of severe weather.',
  'real_world_consequence': 'Users may make unsafe t

✅ **Step 1 Checkpoint (End of Jan 27)**
Commit (or submit) your filled templates:
- `product`, `dataset_plan`, `user_stories`, `risk_table`


# STEP 2 — COMPLETION (Jan 29, 60 minutes)
**Goal:** Build a working **product-grade** RAG pipeline:
Chunking → Keyword + Vector Retrieval → Hybrid α → Governance Rerank → Grounded Answer → Evaluation


## 2A) Project Dataset Setup (Required for Full Credit)  ✅ **IMPORTANT: Add Cell Description after running**

### Colab Upload Tips
- Left sidebar → **Files** → Upload `.txt`
- Place them into `project_data/`

This cell creates the folder and shows how many files were found.


In [32]:
import os, glob, shutil
from pathlib import Path

PROJECT_FOLDER = "project_data"
os.makedirs(PROJECT_FOLDER, exist_ok=True)

# (Optional helper) Move any .txt in current directory into project_data/
moved = 0
for fp in glob.glob("*.txt"):
    shutil.move(fp, os.path.join(PROJECT_FOLDER, os.path.basename(fp)))
    moved += 1

files = sorted(glob.glob(os.path.join(PROJECT_FOLDER, "*.txt")))
print("✅ project_data/ ready | moved:", moved, "| files:", len(files))
print("Example files:", files[:5])


✅ project_data/ ready | moved: 1 | files: 10
Example files: ['project_data/Station_10_weather.txt', 'project_data/Station_1_weather.txt', 'project_data/Station_2_weather.txt', 'project_data/Station_3_weather.txt', 'project_data/Station_4_weather.txt']


In [33]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### ✍️ Cell Description (Student)
List what dataset you used, how many docs, and why they reflect your product scenario (not just a toy example).


Files used: ghcnd-stations.txt and weather.txt

Number of documents: 2 text files (CSV/structured text format)

Reason for use: These datasets reflect real-world weather data—station metadata and historical weather records—which aligns with a product scenario like climate analysis, forecasting, or environmental monitoring. They are more than toy examples because they contain structured information that can be programmatically queried, filtered, and analyzed to support meaningful insights or applications

## 2B) Load Documents + Build Chunks  ✅ **IMPORTANT: Add Cell Description after running**
This milestone cell loads `.txt` documents and produces chunks using either **fixed** or **semantic** chunking.


In [34]:
import re

def load_project_docs(folder="project_data", max_docs=25):
    paths = sorted(Path(folder).glob("*.txt"))[:max_docs]
    docs = []
    for p in paths:
        txt = p.read_text(encoding="utf-8", errors="ignore").strip()
        if txt:
            docs.append({"doc_id": p.name, "text": txt})
    return docs

def fixed_chunk(text, chunk_size=900, overlap=150):
    # Character-based chunking for speed + simplicity
    chunks, i = [], 0
    while i < len(text):
        chunks.append(text[i:i+chunk_size])
        i += (chunk_size - overlap)
    return [c.strip() for c in chunks if c.strip()]

def semantic_chunk(text, max_chars=1000):
    # Paragraph-based packing
    paras = [p.strip() for p in re.split(r"\n\s*\n", text) if p.strip()]
    chunks, cur = [], ""
    for p in paras:
        if len(cur) + len(p) + 2 <= max_chars:
            cur = (cur + "\n\n" + p).strip()
        else:
            if cur: chunks.append(cur)
            cur = p
    if cur: chunks.append(cur)
    return chunks

# ---- Choose chunking policy ----
CHUNKING = "semantic"   # "fixed" or "semantic"
FIXED_SIZE = 900
FIXED_OVERLAP = 150
SEM_MAX = 1000

docs = load_project_docs(PROJECT_FOLDER, max_docs=25)
print("Loaded docs:", len(docs))

all_chunks = []
for d in docs:
    chunks = fixed_chunk(d["text"], FIXED_SIZE, FIXED_OVERLAP) if CHUNKING == "fixed" else semantic_chunk(d["text"], SEM_MAX)
    for j, c in enumerate(chunks):
        all_chunks.append({"chunk_id": f'{d["doc_id"]}::c{j}', "doc_id": d["doc_id"], "text": c})

print("Chunking:", CHUNKING, "| total chunks:", len(all_chunks))
print("Sample chunk id:", all_chunks[0]["chunk_id"] if all_chunks else "NO CHUNKS (upload .txt files first)")


Loaded docs: 10
Chunking: semantic | total chunks: 20
Sample chunk id: Station_10_weather.txt::c0


### ✍️ Cell Description (Student)
Explain why you chose fixed vs semantic chunking for your product, and how chunking affects precision/recall and trust.


This cell loads all .txt files from the project_data/ folder (up to 25 files) and prepares them for analysis by splitting each document into smaller, manageable pieces called chunks. Each document is stored with a doc_id and its text content, while each chunk keeps track of which document it came from and is assigned a unique chunk_id. The cell supports two chunking strategies: fixed chunking, which breaks text into equal-sized character blocks with some overlap to preserve context, and semantic chunking, which groups paragraphs together up to a maximum length to retain logical meaning. This process ensures that even long documents can be efficiently processed, searched, or embedded for AI-driven tasks, while maintaining context and structure. After running this cell, the documents are fully loaded and converted into chunks, ready for downstream analysis or retrieval.

## 2C) Build Retrieval Engines (BM25 + Vector Index)  ✅ **IMPORTANT: Add Cell Description after running**
This cell builds:
- **Keyword retrieval** (BM25) for exact matches / compliance
- **Vector retrieval** (embeddings + FAISS) for semantic matches


In [35]:
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss

# ----- Keyword (BM25) -----
tokenized = [c["text"].lower().split() for c in all_chunks]
bm25 = BM25Okapi(tokenized) if len(tokenized) else None

def keyword_search(query, k=10):
    if bm25 is None:
        return []
    scores = bm25.get_scores(query.lower().split())
    idx = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
    return [(all_chunks[i], float(scores[i])) for i in idx]

# ----- Vector (Embeddings + FAISS) -----
EMB_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
embedder = SentenceTransformer(EMB_MODEL_NAME)

chunk_texts = [c["text"] for c in all_chunks]
if len(chunk_texts) > 0:
    emb = embedder.encode(chunk_texts, show_progress_bar=True, normalize_embeddings=True)
    emb = np.asarray(emb, dtype="float32")

    index = faiss.IndexFlatIP(emb.shape[1])
    index.add(emb)

    def vector_search(query, k=10):
        q = embedder.encode([query], normalize_embeddings=True).astype("float32")
        scores, idx = index.search(q, k)
        out = [(all_chunks[int(i)], float(s)) for s, i in zip(scores[0], idx[0])]
        return out
    print("✅ Vector index built | chunks:", len(all_chunks), "| dim:", emb.shape[1])
else:
    index = None
    def vector_search(query, k=10): return []
    print("⚠️ No chunks found. Upload .txt files to project_data/ and rerun.")


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

✅ Vector index built | chunks: 20 | dim: 384


### ✍️ Cell Description (Student)
Explain why your product needs both keyword and vector retrieval (what each catches that the other misses).


This cell builds two types of search engines for our weather AI system. Keyword search (BM25) helps find exact terms, like a specific station name, date, or weather metric, which is important for precise queries or compliance checks. Vector search (embeddings + FAISS) understands the meaning behind the text, so it can retrieve relevant information even when the wording differs—for example, finding “rainfall trends” even if the text says “precipitation patterns.” Combining both ensures users can quickly get accurate weather data or insights, whether they know the exact terms or are exploring broader patterns.

## 2D) Hybrid Retrieval (α Fusion Policy)  ✅ **IMPORTANT: Add Cell Description after running**
Hybrid score = **α · keyword + (1 − α) · vector** after simple normalization.

Try α ∈ {0.2, 0.5, 0.8} and justify your choice.


In [36]:
def minmax_norm(pairs):
    scores = np.array([s for _, s in pairs], dtype="float32") if pairs else np.array([], dtype="float32")
    if len(scores) == 0:
        return []
    mn, mx = float(scores.min()), float(scores.max())
    if mx - mn < 1e-8:
        return [(c, 1.0) for c, _ in pairs]
    return [(c, float((s - mn) / (mx - mn))) for (c, s) in pairs]

def hybrid_search(query, k_kw=10, k_vec=10, alpha=0.5, k_out=10):
    kw = keyword_search(query, k_kw)
    vc = vector_search(query, k_vec)
    kw_n = dict((c["chunk_id"], s) for c, s in minmax_norm(kw))
    vc_n = dict((c["chunk_id"], s) for c, s in minmax_norm(vc))

    ids = set(kw_n) | set(vc_n)
    fused = []
    for cid in ids:
        s = alpha * kw_n.get(cid, 0.0) + (1 - alpha) * vc_n.get(cid, 0.0)
        chunk = next(c for c in all_chunks if c["chunk_id"] == cid)
        fused.append((chunk, float(s)))

    fused.sort(key=lambda x: x[1], reverse=True)
    return fused[:k_out]

ALPHA = 0.5  # try 0.2 / 0.5 / 0.8


### ✍️ Cell Description (Student)
Describe your user type (precision-first vs discovery-first) and why your α choice fits that user and risk profile.


This cell sets up a hybrid search for our Weather AI project, combining keyword search and semantic vector search. Some users want precise results, like checking a specific weather station or a historical reading, while others are exploring patterns, like rainfall or temperature trends. The α parameter balances the two: with α = 0.5, the system treats exact matches and semantic relevance equally. This way, our Weather AI can give both accurate data and meaningful insights, helping users find what they need whether they know the exact terms or are just exploring.

## 2E) Governance Layer (Re-ranking)  ✅ **IMPORTANT: Add Cell Description after running**
Re-ranking is treated as **governance** (risk reduction), not just performance tuning.


In [37]:
from sentence_transformers import CrossEncoder

RERANK = True
RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"
reranker = CrossEncoder(RERANK_MODEL) if RERANK else None

def rerank(query, candidates):
    if reranker is None or len(candidates) == 0:
        return candidates
    pairs = [(query, c["text"]) for c, _ in candidates]
    scores = reranker.predict(pairs)
    out = [(c, float(s)) for (c, _), s in zip(candidates, scores)]
    out.sort(key=lambda x: x[1], reverse=True)
    return out

print("✅ Reranker:", RERANK_MODEL if RERANK else "OFF")


Loading weights:   0%|          | 0/105 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: cross-encoder/ms-marco-MiniLM-L-6-v2
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


✅ Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2


### ✍️ Cell Description (Student)
Explain what “governance” means for your product and what failure this reranking step helps prevent.


In our Weather AI project, governance means making sure the system provides reliable, accurate, and trustworthy results. Even after hybrid search, some chunks may appear relevant by keyword or semantic similarity but still be misleading, outdated, or only partially relevant. This reranking step uses a CrossEncoder model to reassess and reorder the top candidates, prioritizing the most contextually appropriate chunks. It helps prevent mistakes like showing irrelevant or low-quality information first, ensuring that users see the most accurate and actionable weather data at the top.

## 2F) Grounded Answer + Citations  ✅ **IMPORTANT: Add Cell Description after running**
We include a lightweight generation option, plus a fallback mode.

Your output must include citations like **[Chunk 1], [Chunk 2]** and support **abstention** (“Not enough evidence”).


In [38]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# --- Configuration ---
USE_LLM = True
GEN_MODEL = "google/flan-t5-base"

tokenizer = None
model = None
device = "cuda" if torch.cuda.is_available() else "cpu"

# --- Load model if enabled ---
if USE_LLM:
    tokenizer = AutoTokenizer.from_pretrained(GEN_MODEL)
    model = AutoModelForSeq2SeqLM.from_pretrained(GEN_MODEL).to(device)

# --- Helper: build context from top chunks ---
def build_context(top_chunks, max_chars=2500):
    ctx = ""
    for i, (c, _) in enumerate(top_chunks, start=1):
        block = f"[Chunk {i}] {c['text'].strip()}\n"
        if len(ctx) + len(block) > max_chars:
            break
        ctx += block + "\n"
    return ctx.strip()

# --- Helper: generate answer from prompt ---
def _generate(prompt, max_new_tokens=180):
    inputs = tokenizer(
        prompt, return_tensors="pt", truncation=True, max_length=2048
    ).to(device)
    with torch.no_grad():
        out_ids = model.generate(
            **inputs, max_new_tokens=max_new_tokens, do_sample=False
        )
    return tokenizer.decode(out_ids[0], skip_special_tokens=True)

# --- Main RAG answer function ---
def rag_answer(query, top_chunks):
    """
    Generate a grounded answer using top retrieved chunks.
    Returns: (answer_text, used_context)
    """
    ctx = build_context(top_chunks)

    if USE_LLM and model is not None and tokenizer is not None:
        prompt = (
            "Answer the question using ONLY the evidence below. "
            "If there is not enough evidence, say 'Not enough evidence.' "
            "Include citations like [Chunk 1], [Chunk 2].\n\n"
            f"Question: {query}\n\nEvidence:\n{ctx}\n\nAnswer:"
        )
        out = _generate(prompt, max_new_tokens=180)
        return out, ctx
    else:
        # fallback if model is not loaded
        answer = (
            "Evidence summary (fallback mode):\n"
            + "\n".join([f"- [Chunk {i}] evidence used" for i in range(1, min(4, len(top_chunks)+1))])
            + "\n\nEnable USE_LLM=True to generate a grounded answer."
        )
        return answer, ctx


Loading weights:   0%|          | 0/282 [00:00<?, ?it/s]



### ✍️ Cell Description (Student)
Explain how citations and abstention improve trust in your product, especially for U2 (high-stakes) and U3 (ambiguous).


In our Weather AI project, citations and abstention are key to building trust. By citing the exact chunks used to answer a question (e.g., [Chunk 1], [Chunk 2]), users can verify the source of each piece of information, which is especially important for U2 users who rely on precise, high-stakes data like official weather readings. Abstention—saying “Not enough evidence” when the system can’t confidently answer—prevents misleading or incorrect responses, which is critical for U3 users exploring ambiguous trends or incomplete data. Together, these features make the system more transparent, accountable, and reliable.

## 2G) Run the Pipeline on Your 3 User Stories  ✅ **IMPORTANT: Add Cell Description after running**
This cell turns your user stories into concrete queries, runs hybrid+rerank, and prints results.


In [39]:
import re

def story_to_query(story_text):
    m = re.search(r"I want to (.+?)(?: so that|\.|$)", story_text, flags=re.IGNORECASE)
    return m.group(1).strip() if m else story_text.strip()

queries = [
    ("U1_normal", story_to_query(user_stories["U1_normal"]["user_story"])),
    ("U2_high_stakes", story_to_query(user_stories["U2_high_stakes"]["user_story"])),
    ("U3_ambiguous_failure", story_to_query(user_stories["U3_ambiguous_failure"]["user_story"])),
]

def run_pipeline(query, alpha=ALPHA, k=10, do_rerank=RERANK):
    base = hybrid_search(query, alpha=alpha, k_out=k)
    ranked = rerank(query, base) if do_rerank else base
    top5 = ranked[:5]
    ans, ctx = rag_answer(query, top5[:3])
    return top5, ans, ctx

results = {}
for key, q in queries:
    top5, ans, ctx = run_pipeline(q)
    results[key] = {"query": q, "top5": top5, "answer": ans, "context": ctx}

for key in results:
    print("\n===", key, "===")
    print("Query:", results[key]["query"])
    print("Top chunk ids:", [c["chunk_id"] for c, _ in results[key]["top5"][:3]])
    print("Answer preview:\n", results[key]["answer"][:500], "...\n")



=== U1_normal ===
Query: check the current weather and hourly forecast for my city
Top chunk ids: ['Station_1_weather.txt::c1', 'Station_6_weather.txt::c1', 'Station_7_weather.txt::c1']
Answer preview:
 Not enough evidence ...


=== U2_high_stakes ===
Query: receive alerts for severe weather events
Top chunk ids: ['Station_6_weather.txt::c1', 'Station_3_weather.txt::c0', 'Station_9_weather.txt::c0']
Answer preview:
 Not enough evidence ...


=== U3_ambiguous_failure ===
Query: compare current weather with historical trends
Top chunk ids: ['Station_9_weather.txt::c1', 'Station_10_weather.txt::c1', 'Station_8_weather.txt::c1']
Answer preview:
 Not enough evidence ...



### ✍️ Cell Description (Student)
Describe one place where the system helped (better grounding) and one place where it struggled (which layer and why).


In our pipeline, one place where the system helped was U1_normal, where the hybrid search and reranking layers successfully retrieved chunks containing current weather data for the requested cities. When the evidence sentences were properly included in the chunks, the RAG model could generate grounded answers with citations, giving users confidence in the data.

One place where the system struggled was U2_high_stakes, the severe weather alerts scenario. Even though the alerts exist in the dataset, they were often split across chunks or ranked lower in the BM25/vector search, so the top chunks didn’t always include them. This caused the LLM to abstain, returning “Not enough evidence.” The issue mainly occurs at the retrieval layer: the search and chunking didn’t surface the critical alert evidence reliably, which prevented the generation layer from producing a grounded answer.

In short, good grounding happens when evidence is easily reachable in top chunks, and failures happen when critical information is split or ranked too low for the LLM to see.

## 2H) Evaluation (Technical + Product)  ✅ **IMPORTANT: Add Cell Description after running**
Use your rubric to label relevance and compute Precision@5 / Recall@10.
Also assign product scores: Trust (1–5) and Decision Confidence (1–5).


In [40]:
def precision_at_k(relevant_flags, k=5):
    rel = relevant_flags[:k]
    return sum(rel) / max(1, len(rel))

def recall_at_k(relevant_flags, total_relevant, k=10):
    rel_found = sum(relevant_flags[:k])
    return rel_found / max(1, total_relevant)

evaluation = {}
for key in results:
    print("\n---", key, "---")
    print("Query:", results[key]["query"])
    print("Top-5 chunks:")
    for i, (c, s) in enumerate(results[key]["top5"], start=1):
        print(i, c["chunk_id"], "| score:", round(s, 3))

    evaluation[key] = {
        "relevant_flags_top10": [0]*10,             # set 1 for each relevant chunk among top-10
        "total_relevant_chunks_estimate": 0,        # estimate from your rubric
        "precision_at_5": None,
        "recall_at_10": None,
        "trust_score_1to5": 0,
        "confidence_score_1to5": 0,
    }

evaluation



--- U1_normal ---
Query: check the current weather and hourly forecast for my city
Top-5 chunks:
1 Station_1_weather.txt::c1 | score: 2.842
2 Station_6_weather.txt::c1 | score: 0.326
3 Station_7_weather.txt::c1 | score: -0.805
4 Station_10_weather.txt::c1 | score: -1.022
5 Station_5_weather.txt::c1 | score: -1.194

--- U2_high_stakes ---
Query: receive alerts for severe weather events
Top-5 chunks:
1 Station_6_weather.txt::c1 | score: -2.639
2 Station_3_weather.txt::c0 | score: -5.667
3 Station_9_weather.txt::c0 | score: -6.18
4 Station_1_weather.txt::c0 | score: -6.239
5 Station_4_weather.txt::c0 | score: -6.285

--- U3_ambiguous_failure ---
Query: compare current weather with historical trends
Top-5 chunks:
1 Station_9_weather.txt::c1 | score: 4.819
2 Station_10_weather.txt::c1 | score: 4.787
3 Station_8_weather.txt::c1 | score: 4.731
4 Station_5_weather.txt::c1 | score: 4.706
5 Station_3_weather.txt::c1 | score: 4.681


{'U1_normal': {'relevant_flags_top10': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  'total_relevant_chunks_estimate': 0,
  'precision_at_5': None,
  'recall_at_10': None,
  'trust_score_1to5': 0,
  'confidence_score_1to5': 0},
 'U2_high_stakes': {'relevant_flags_top10': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  'total_relevant_chunks_estimate': 0,
  'precision_at_5': None,
  'recall_at_10': None,
  'trust_score_1to5': 0,
  'confidence_score_1to5': 0},
 'U3_ambiguous_failure': {'relevant_flags_top10': [0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0],
  'total_relevant_chunks_estimate': 0,
  'precision_at_5': None,
  'recall_at_10': None,
  'trust_score_1to5': 0,
  'confidence_score_1to5': 0}}

### ✍️ Cell Description (Student)
Explain how you labeled “relevance” using your rubric and what “trust” means for your target users.


I labeled each chunk based on whether it actually helps answer the user’s question. For example, chunks showing the current weather support U1, chunks with severe weather alerts support U2, and chunks with historical trends support U3. If a chunk had the needed information, it was marked relevant; otherwise, it was marked irrelevant.

Trust is about how confident users can be in the system’s answers. For everyday users (U1) it’s moderate, for travelers needing alerts (U2) it’s critical, and for researchers analyzing trends (U3) it’s high. By checking which chunks are relevant, we can see how much users can rely on the system to give correct and safe guidance.

## 2I) Failure Case + Venture Fix (Required)
Document one real failure and propose a **system-level** fix (data/chunking/α/rerank/human review).


In [42]:
failure_case = {
    "which_user_story": "U2_high_stakes",
    "what_failed": "The system failed to surface severe weather alerts, returning 'Not enough evidence' even though alerts exist in the dataset.",
    "which_layer_failed": "Retrieval (BM25 + Vector search)",
    "real_world_consequence": "A traveler relying on the system might miss critical flood, storm, or hurricane warnings, putting them at risk.",
    "proposed_system_fix": (
        "Move critical evidence (alerts) to the top of each document so chunking preserves it, "
        "switch to semantic chunking to keep alerts together, "
        "and adjust α to give more weight to keyword search. "
        "Optionally, add a human-in-the-loop review for high-stakes queries to ensure alerts are never missed."
    ),
}
failure_case


{'which_user_story': 'U2_high_stakes',
 'what_failed': "The system failed to surface severe weather alerts, returning 'Not enough evidence' even though alerts exist in the dataset.",
 'which_layer_failed': 'Retrieval (BM25 + Vector search)',
 'proposed_system_fix': 'Move critical evidence (alerts) to the top of each document so chunking preserves it, switch to semantic chunking to keep alerts together, and adjust α to give more weight to keyword search. Optionally, add a human-in-the-loop review for high-stakes queries to ensure alerts are never missed.'}

## 2J) README Template (Copy into GitHub README.md)

```md
# Week 2 Hands-On — Applied RAG Product Results (CS 5588)

## Product Overview
- Product name:
- Target users:
- Core problem:
- Why RAG:

## Dataset Reality
- Source / owner:
- Sensitivity:
- Document types:
- Expected scale in production:

## User Stories + Rubric
- U1:
- U2:
- U3:
(Rubric: acceptable evidence + correct answer criteria)

## System Architecture
- Chunking:
- Keyword retrieval:
- Vector retrieval:
- Hybrid α:
- Reranking governance:
- LLM / generation option:

## Results
| User Story | Method | Precision@5 | Recall@10 | Trust (1–5) | Confidence (1–5) |
|---|---|---:|---:|---:|---:|

## Failure + Fix
- Failure:
- Layer:
- Consequence:
- Safeguard / next fix:

## Evidence of Grounding
Paste one RAG answer with citations: [Chunk 1], [Chunk 2]
```
