---

### üéì **Professor**: Apostolos Filippas

### üìò **Class**: AI Engineering

### üìã **Topic**: You Can Just Build Things

üö´ **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---

## Welcome!

In our firstfour lectures, we've covered how
1. We can call LLMs via APIs and get structured responses
2. We can build lexical search with BM25
3. We can build semantic search with embeddings
4. We can combine lexical and semantic search into hybrid search

Today you will put it all together by building a Retrieval Augmented Generation (RAG) system.
- This is a question-answering bot that can answer questions about Fordham University
- You will use real data scraped from the Fordham website.


Your RAG pipeline will look like this:

```
User Question
     ‚Üì
1. RETRIEVE: Find relevant documents (search!)
     ‚Üì
2. AUGMENT: Stuff those documents into a prompt
     ‚Üì
3. GENERATE: Ask an LLM to answer using the context
     ‚Üì
Answer
```


---

# 1. Look at your data

In `data/fordham-website.zip` you'll find **~9,500 Markdown files** scraped from Fordham's website. Each file is one page ‚Äî admissions info, program descriptions, faculty pages, financial aid, campus life, and more.

Your task: **look at the data**
- The first step in any AI engineering or data science project should always be to familiarize yourself with the data.
- I cannot stress this enough.. without this step, it's hard to build anything useful.

Tips:
- Unzip the archive and look at some of the files. 
- Open a few in a text editor. 
- Get a feel for what you're working with.
- The first line of every file is always the **URL** of the page it was scraped from. The rest is the page content converted to Markdown. Here's an example ‚Äî `gabelli-school-of-business_veterans.md`:

```markdown
https://www.fordham.edu/gabelli-school-of-business/veterans

# Military Veterans & Active Duty Members of the Military

## Transform Your Knowledge & Skills Into a Business Career for the Future

As a veteran or an active duty member of the United States Armed Services,
you have gained or are currently acquiring the invaluable organizational,
leadership, analytics, and technical knowledge and skills that hiring
managers seek. These transferrable skills provide a major advantage in
emerging, business-related industries where innovation, a global mind-set,
and the ability to lead individuals and teams in the continuously evolving
work environment, are critical for success.

By completing a graduate or undergraduate business degree at the Gabelli
School of Business, you can prepare for a lifelong career in some of
today's fastest-growing fields. ...

### Study at a Top-Ranked, Military-Friendly University

The Gabelli School of Business is part of Fordham University, the only
New York City university to be among those ranked "Best for Vets" by
Military Times. ...

### Learn How the Yellow Ribbon Program Works

The Yellow Ribbon GI Education Enhancement Program, or the Yellow Ribbon
Program, is a part of the Post-9/11 Veterans Educational Assistance Act
of 2008. ...
```

The filenames mirror the URL structure ‚Äî underscores replace path separators (e.g. `gabelli-school-of-business_veterans.md` came from `/gabelli-school-of-business/veterans`). Some files are short (a few lines), others are quite long.

- Once you've looked around, load the files into Python. Python's built-in `zipfile` module can read zip archives without extracting to disk. Load them into a list of dictionaries or a DataFrame with at least two fields: the filename (or a clean page name) and the content

In [2]:
# Placeholder for your implementation
# Core
import os
import json
import zipfile
from pathlib import Path

# Data
import pandas as pd
import numpy as np

# Text processing / retrieval
import nltk
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Embeddings (OpenAI example ‚Äì adapt to your provider)
from openai import OpenAI

# Progress bars
from tqdm.auto import tqdm


In [3]:
# Embeddings (cheap and good)
EMBED_MODEL = "text-embedding-3-small"  # Keep this - works great with 4o-mini

# Generation (your preferred model)
GEN_MODEL = "gpt-4o-mini"  # ‚úÖ Your model


In [4]:
import json
from pathlib import Path

DATA_JSON_PATH = Path("fordham-website-windows.json")  # Your uploaded file

def load_json_to_df(json_path: Path) -> pd.DataFrame:
    """
    Load your JSON file into a DataFrame.
    Assumes it's a list of dicts with 'filename', 'url', 'content' 
    OR a dict with keys=filenames, values=content.
    """
    with open(json_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    if isinstance(data, list):
        # List of dicts format
        rows = data
    elif isinstance(data, dict):
        # Dict format: {"filename1": "content1", "filename2": "content2"}
        rows = []
        for filename, content in data.items():
            # Try to extract URL from filename (common pattern)
            url = f"https://www.fordham.edu/{filename.replace('.md', '').replace('_', '/')}/"
            rows.append({"filename": filename, "url": url, "content": content})
    else:
        raise ValueError("Unexpected JSON format")
    
    return pd.DataFrame(rows)

docs_df = load_json_to_df(DATA_JSON_PATH)
print(f"Loaded {len(docs_df)} documents")
docs_df.head()


Loaded 9560 documents


Unnamed: 0,filename,url,content
0,d680e8a854a7cbad6d490c445cba2eba.md,https://www.fordham.edu/d680e8a854a7cbad6d490c...,index.md
1,42f623b3bad309d5d6619d450af47d40.md,https://www.fordham.edu/42f623b3bad309d5d6619d...,research.md
2,cbcc2bf9adc75ea7fce66bd4eb246203.md,https://www.fordham.edu/cbcc2bf9adc75ea7fce66b...,ccel.md
3,717c9560db9ebed56c22664925c4c1e6.md,https://www.fordham.edu/717c9560db9ebed56c2266...,fordham-college-at-lincoln-center.md
4,d8a2a1a177f439f2ca185e084642f713.md,https://www.fordham.edu/d8a2a1a177f439f2ca185e...,academics.md


In [3]:
import json
DATA_JSON_PATH = Path("fordham-website-windows.json")
print("Loading JSON...")

def load_json_to_df(json_path: Path):
    with open(json_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    # Handle dict format {filename: content}
    if isinstance(data, dict):
        rows = [{"filename": k, "url": f"https://fordham.edu/{k.replace('.md','').replace('_','/')}", "content": v} 
                for k, v in data.items()]
    else:
        rows = data  # List format
    return pd.DataFrame(rows)

docs_df = load_json_to_df(DATA_JSON_PATH)
print(f"‚úÖ Loaded {len(docs_df)} documents")


Loading JSON...
‚úÖ Loaded 9560 documents


---

# 2. Chunk the Documents

Some of the pages could be too long to embed as a single unit. Down the line, the pages may be too long to stuff into the LLM's prompt during the generation step. As such, most of the RAG systems will break down big documents into into smaller **chunks**.

> üìö **TERM: Chunking**  
> Splitting documents into smaller, self-contained pieces for embedding and retrieval. The goal is chunks that are small enough to be specific, but large enough to be meaningful.

Your task: **write a function that splits each document into chunks.**

Things to think about:
- What's a reasonable chunk size? (Think about what fits in a prompt vs. what's too vague)
- Should you split on sentences? Paragraphs? A fixed character/word count?
- Should chunks overlap? What happens if an answer spans two chunks?
- How do you keep track of which document each chunk came from? You may need that information down the line.

In [6]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')  # Extra safety for some versions
print("‚úÖ NLTK punkt downloaded!")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Manikandan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Manikandan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


‚úÖ NLTK punkt downloaded!


In [6]:
import json
from pathlib import Path

DATA_JSON_PATH = Path("fordham-website-windows.json")

# Peek at the data structure
with open(DATA_JSON_PATH, 'r', encoding='utf-8') as f:
    data = json.load(f)

print("JSON type:", type(data))
print("JSON keys/len:", len(data) if isinstance(data, dict) else len(data))
print("\nFirst 3 keys/items:")
if isinstance(data, dict):
    keys = list(data.keys())[:3]
    for k in keys:
        content_start = str(data[k])[:200] + "..." if len(str(data[k])) > 200 else str(data[k])
        print(f"  {k}: {content_start}")
        print(f"  Content len: {len(str(data[k]))}")
elif isinstance(data, list):
    for i, item in enumerate(data[:3]):
        print(f"  Item {i}: {item}")
else:
    print("Unexpected format:", data)


JSON type: <class 'dict'>
JSON keys/len: 9560

First 3 keys/items:
  d680e8a854a7cbad6d490c445cba2eba.md: index.md
  Content len: 8
  42f623b3bad309d5d6619d450af47d40.md: research.md
  Content len: 11
  cbcc2bf9adc75ea7fce66bd4eb246203.md: ccel.md
  Content len: 7


In [7]:
import os
import pandas as pd
from pathlib import Path
from tqdm.auto import tqdm
import gc

# STEP 1+2: Load .md files from folder + chunk (memory safe)
FOLDER_PATH = Path("fordham-website-windows")
print(f"Scanning {FOLDER_PATH}...")

# 1. LOAD ALL .md FILES FROM FOLDER
docs_list = []
for md_file in tqdm(list(FOLDER_PATH.glob("*.md")), desc="Loading .md files"):
    try:
        with open(md_file, 'r', encoding='utf-8', errors='ignore') as f:
            raw_content = f.read()
        
        # First line = URL, rest = content (per your notebook spec)
        lines = raw_content.strip().split('\n')
        if lines:
            url = lines[0].strip()
            content = '\n'.join(lines[1:]).strip()
            
            if len(content) > 50:  # Skip tiny files
                docs_list.append({
                    "filename": md_file.name,
                    "url": url,
                    "content": content,
                    "file_path": str(md_file)
                })
    except Exception as e:
        print(f"Skipped {md_file.name}: {e}")
        continue

docs_df = pd.DataFrame(docs_list)
print(f"‚úÖ Loaded {len(docs_df)} valid .md documents")
print("\nSample:")
print(docs_df[["filename", "url"]].head())

# 2. STREAMING CHUNKER (same as before)
def chunk_doc_streaming(text: str, source_id: str, url: str, filename: str):
    chunks = []
    start = 0
    chunk_size = 2000
    overlap = 200
    
    while start < len(text):
        end = start + chunk_size
        if end < len(text):
            end = text.find('. ', start + chunk_size//2, end)
            if end == -1: 
                end = start + chunk_size
        
        chunk = text[start:end].strip()
        if len(chunk) > 100:
            chunks.append({
                "source_id": source_id, 
                "url": url, 
                "filename": filename,
                "chunk_id": len(chunks), 
                "text": chunk
            })
        start = max(end - overlap, 0)
        if start >= len(text): 
            break
    return chunks

# Process 500 docs at a time (memory safe)
BATCH_SIZE_DOCS = 500
chunks_all = []

print(f"\nChunking {len(docs_df)} docs in batches of {BATCH_SIZE_DOCS}...")
for batch_start in tqdm(range(0, len(docs_df), BATCH_SIZE_DOCS)):
    batch_end = min(batch_start + BATCH_SIZE_DOCS, len(docs_df))
    batch_docs = docs_df.iloc[batch_start:batch_end]
    
    batch_chunks = []
    for idx, row in batch_docs.iterrows():
        doc_chunks = chunk_doc_streaming(
            row["content"], str(idx), row["url"], row["filename"]
        )
        batch_chunks.extend(doc_chunks)
    
    chunks_all.extend(batch_chunks)
    print(f"Batch {batch_start//BATCH_SIZE_DOCS + 1}: {len(batch_chunks)} chunks, "
          f"Total so far: {len(chunks_all)}")
    
    del batch_docs, batch_chunks
    gc.collect()

# Final save
chunks_df = pd.DataFrame(chunks_all)
chunks_df.to_parquet("fordham_chunks_final.parquet")
print(f"\n‚úÖ FINAL: {len(chunks_df)} chunks saved!")
print(chunks_df[["filename", "url", "text"]].head())


Scanning fordham-website-windows...


Loading .md files:   0%|          | 0/9560 [00:00<?, ?it/s]

‚úÖ Loaded 9560 valid .md documents

Sample:
                              filename  \
0  0001144e6d954f94682637e541ad5d7f.md   
1  000121f75daed3fee3eb14cdb934a788.md   
2  001fb84f7d97bdd24d99b581d920a602.md   
3  002239821b73a238c5b698e20708f3e5.md   
4  002a8f1a0801c409834b8b47158f33d8.md   

                                                 url  
0  https://www.fordham.edu/about/living-the-missi...  
1  https://www.fordham.edu/academics/departments/...  
2  https://www.fordham.edu/information-technology...  
3  https://www.fordham.edu/summer-session/pre-col...  
4  https://www.fordham.edu/school-of-professional...  

Chunking 9560 docs in batches of 500...


  0%|          | 0/20 [00:00<?, ?it/s]

Batch 1: 1827 chunks, Total so far: 1827
Batch 2: 2430 chunks, Total so far: 4257
Batch 3: 1808 chunks, Total so far: 6065
Batch 4: 2539 chunks, Total so far: 8604
Batch 5: 1616 chunks, Total so far: 10220
Batch 6: 2397 chunks, Total so far: 12617
Batch 7: 1708 chunks, Total so far: 14325
Batch 8: 1543 chunks, Total so far: 15868
Batch 9: 1910 chunks, Total so far: 17778
Batch 10: 1895 chunks, Total so far: 19673
Batch 11: 1725 chunks, Total so far: 21398
Batch 12: 2183 chunks, Total so far: 23581
Batch 13: 2032 chunks, Total so far: 25613
Batch 14: 1916 chunks, Total so far: 27529
Batch 15: 1455 chunks, Total so far: 28984
Batch 16: 1944 chunks, Total so far: 30928
Batch 17: 1883 chunks, Total so far: 32811
Batch 18: 1783 chunks, Total so far: 34594
Batch 19: 2056 chunks, Total so far: 36650
Batch 20: 203 chunks, Total so far: 36853

‚úÖ FINAL: 36853 chunks saved!
                              filename  \
0  0001144e6d954f94682637e541ad5d7f.md   
1  0001144e6d954f94682637e541ad5d7f.md

---

# 3. Embed the Chunks

Now we need to turn each chunk into a vector so we can search over them. You've done this before in Lecture 4.

Your task: **embed all chunks using an embedding model.**

Tips:
- You could use a local model, or API model. What are the tradeoffs?
- This will take a while if you do it serially. You might want to use async/batch.
- Once you've created your embeddings, you may want to save them to disk so you don't have to redo this step every time
- You'll need to embed queries with the **same model** at search time

In [8]:
# Placeholder for your implementation
import numpy as np
from openai import OpenAI
from pathlib import Path
from tqdm.auto import tqdm
import os

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
EMBED_MODEL = "text-embedding-3-small"
BATCH_SIZE = 500

def batch_embed(texts: list, batch_size: int = BATCH_SIZE) -> np.ndarray:
    all_embeddings = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Embedding"):
        batch_texts = [t[:4000] for t in texts[i:i + batch_size]]  # Truncate
        response = client.embeddings.create(model=EMBED_MODEL, input=batch_texts)
        batch_embs = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embs)
    return np.array(all_embeddings, dtype="float32")

# Load chunks (if restarted)
chunks_df = pd.read_parquet("fordham_chunks_final.parquet")

# Embed (cached)
EMB_PATH = Path("fordham_embeddings.npy")
if EMB_PATH.exists():
    print("‚úÖ Loading cached embeddings...")
    chunk_embeddings = np.load(EMB_PATH)
else:
    print(f"üîÑ Embedding {len(chunks_df)} chunks...")
    texts = chunks_df["text"].tolist()
    chunk_embeddings = batch_embed(texts)
    np.save(EMB_PATH, chunk_embeddings)
    print(f"‚úÖ Saved {chunk_embeddings.shape}")

print(f"‚úÖ Embeddings ready: {chunk_embeddings.shape}")


üîÑ Embedding 36853 chunks...


Embedding:   0%|          | 0/74 [00:00<?, ?it/s]

‚úÖ Saved (36853, 1536)
‚úÖ Embeddings ready: (36853, 1536)


---

# 4. Retrieve

Now build the **R** in RAG. Given a user's question, find the most relevant chunks.

Your task: **write a retrieval function that takes a question and returns the most relevant chunks.**

Tips:
- You can use lexical or semantic search or both!
- How many chunks should you retrieve? Too few and you might miss the answer; too many and you'll overwhelm the LLM (and pay more tokens)
- Try a few test questions and eyeball whether the retrieved chunks are relevant
- Try a few questions and see what comes back. For example:
  - "What programs does the Gabelli School of Business offer?"
  - "How do I apply for financial aid?"
  - "Where is Fordham's campus?"

In [9]:
# Your implementation here
from sklearn.metrics.pairwise import cosine_similarity

def embed_query(query: str) -> np.ndarray:
    resp = client.embeddings.create(model=EMBED_MODEL, input=query[:4000])
    return np.array(resp.data[0].embedding, dtype="float32")

def retrieve_top_k(query: str, k: int = 6) -> pd.DataFrame:
    q_emb = embed_query(query)
    sims = cosine_similarity(q_emb.reshape(1, -1), chunk_embeddings)[0]
    top_idx = np.argsort(sims)[::-1][:k]
    results = chunks_df.iloc[top_idx].copy()
    results["score"] = sims[top_idx]
    return results.sort_values("score", ascending=False)

GEN_MODEL = "gpt-4o-mini"

def rag(question: str, k: int = 6) -> dict:
    retrieved = retrieve_top_k(question, k)
    context = "\n\n---\n\n".join([
        f"[Source: {row['url']}]\n{row['text'][:1000]}..." 
        for _, row in retrieved.iterrows()
    ])
    
    prompt = f"""Using ONLY this Fordham University context, answer the question.

CONTEXT:
{context}

QUESTION: {question}

Answer briefly using only the context above. If not in context, say "Not found in Fordham docs.""""
    
    resp = client.chat.completions.create(
        model=GEN_MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1
    )
    
    return {
        "question": question,
        "answer": resp.choices[0].message.content.strip(),
        "sources": retrieved[["url", "filename", "score"]].to_dict("records")
    }

# TEST IT!
result = rag("What programs does Gabelli School offer?")
print("Q:", result["question"])
print("A:", result["answer"])
print("Sources:", [s["url"] for s in result["sources"][:2]])



SyntaxError: unterminated string literal (detected at line 32) (2411141498.py, line 32)

In [10]:
def embed_query(query: str) -> np.ndarray:
    return np.array(get_embedding(query), dtype="float32")

def retrieve_semantic(query: str, k: int = 5) -> pd.DataFrame:
    q_emb = embed_query(query)
    sims = cosine_similarity(
        q_emb.reshape(1, -1),
        chunk_embeddings
    )[0]
    top_idx = np.argsort(sims)[::-1][:k]
    results = chunks_df.iloc[top_idx].copy()
    results["score"] = sims[top_idx]
    return results


In [11]:
# Fit TF-IDF on chunks once
tfidf = TfidfVectorizer(max_features=50000)
tfidf_matrix = tfidf.fit_transform(chunks_df["text"].tolist())

def retrieve_hybrid(query: str, k: int = 5,
                    alpha: float = 0.5) -> pd.DataFrame:
    # alpha: weight for semantic, (1-alpha) for lexical
    q_emb = embed_query(query)
    sem_sims = cosine_similarity(
        q_emb.reshape(1, -1),
        chunk_embeddings
    )[0]

    q_tfidf = tfidf.transform([query])
    lex_sims = cosine_similarity(q_tfidf, tfidf_matrix)[0]

    combined = alpha * sem_sims + (1 - alpha) * lex_sims
    top_idx = np.argsort(combined)[::-1][:k]
    res = chunks_df.iloc[top_idx].copy()
    res["semantic_score"] = sem_sims[top_idx]
    res["lexical_score"] = lex_sims[top_idx]
    res["score"] = combined[top_idx]
    return res


NameError: name 'TfidfVectorizer' is not defined

In [12]:
from sklearn.metrics.pairwise import cosine_similarity

def embed_query(query: str) -> np.ndarray:
    resp = client.embeddings.create(model=EMBED_MODEL, input=query[:4000])
    return np.array(resp.data[0].embedding, dtype="float32")

def retrieve_top_k(query: str, k: int = 6) -> pd.DataFrame:
    q_emb = embed_query(query)
    sims = cosine_similarity(q_emb.reshape(1, -1), chunk_embeddings)[0]
    top_idx = np.argsort(sims)[::-1][:k]
    results = chunks_df.iloc[top_idx].copy()
    results["score"] = sims[top_idx]
    return results.sort_values("score", ascending=False)

GEN_MODEL = "gpt-4o-mini"

def rag(question: str, k: int = 6) -> dict:
    retrieved = retrieve_top_k(question, k)
    context = "\n\n---\n\n".join([
        f"[Source: {row['url']}]\n{row['text'][:1000]}..." 
        for _, row in retrieved.iterrows()
    ])
    
    # FIXED: Proper triple quotes
    prompt = """Using ONLY this Fordham University context, answer the question.

CONTEXT:
{}
QUESTION: {}

Answer briefly using only the context above. If not in context, say "Not found in Fordham docs." """.format(context, question)
    
    resp = client.chat.completions.create(
        model=GEN_MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1
    )
    
    return {
        "question": question,
        "answer": resp.choices[0].message.content.strip(),
        "sources": retrieved[["url", "filename", "score"]].to_dict("records")
    }

# TEST IT!
result = rag("What programs does Gabelli School offer?")
print("Q:", result["question"])
print("\nA:", result["answer"])
print("\nSources:", [s["url"] for s in result["sources"][:2]])


Q: What programs does Gabelli School offer?

A: The Gabelli School offers three types of M.B.A. programs (full-time, professional, and executive M.B.A.), 12 Master of Science programs (two offered online), and two doctoral programs (Ph.D. and Doctor of Professional Studies).

Sources: ['https://www.fordham.edu/gabelli-school-of-business/academic-programs-and-admissions', 'https://www.fordham.edu/gabelli-school-of-business/academic-programs-and-admissions/graduate-programs/academic-programs']


---

# 5. Generate

Now build the **G** in RAG. Take the retrieved chunks and pass them to an LLM along with the user's question.

Your task: **write a function that takes a question and the retrieved chunks, builds a prompt, and calls an LLM to generate an answer.**

Tips:
- How should you structure the prompt? The LLM needs to know: (1) what is the context of the application, (2) what is the question, (3) what it should include in its answer
- What should the LLM do if the context doesn't contain the answer?
- Start with a cheap model; try a better one when you've figured out the pipeline

In [16]:
GEN_MODEL = "gpt-4o-mini"

def build_context(retrieved_chunks: pd.DataFrame) -> str:
    """Build context string from retrieved chunks"""
    context_parts = []
    for _, row in retrieved_chunks.iterrows():
        context_parts.append(f"[Source: {row['url']}]\n{row['text'][:1000]}...")
    return "\n\n---\n\n".join(context_parts)

def generate_answer(question: str, retrieved_chunks: pd.DataFrame) -> str:
    """Step 5: GENERATE - LLM answers using retrieved context"""
    context = build_context(retrieved_chunks)
    
    prompt = """Using ONLY this Fordham University context, answer the question.

CONTEXT:
{}
QUESTION: {}

Answer briefly using only the context above. If not in context, say "Not found in Fordham docs." """.format(context, question)
    
    resp = client.chat.completions.create(
        model=GEN_MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1
    )
    
    return resp.choices[0].message.content.strip()

# Test generation
question = "Where is Fordham's campus?"
retrieved = retrieve_top_k(question, k=5)
answer = generate_answer(question, retrieved)
print("Generated Answer:")
print(answer)


Generated Answer:
Fordham University has two campuses in New York City: Rose Hill in the Bronx and Lincoln Center in Manhattan.


---

# 6. Wire everything together

Combine the previous steps into a simple function that takes in a question and returns an answer.

Your task: **write a `rag(question)` function that retrieves relevant chunks and generates an answer.**

In [17]:
def rag(question: str, k: int = 6) -> dict:
    """Step 6: RAG - Retrieve + Augment + Generate"""
    # RETRIEVE
    retrieved = retrieve_top_k(question, k)
    
    # AUGMENT + GENERATE  
    answer = generate_answer(question, retrieved)
    
    return {
        "question": question,
        "answer": answer,
        "sources": retrieved[["url", "filename", "score"]].to_dict("records")
    }

# Test complete pipeline
test_questions = [
    "Where is Fordham's campus?",
    "What programs does Gabelli School offer?",
    "How do I apply for financial aid?"
]

print("=== FULL RAG PIPELINE TEST ===\n")
for question in test_questions:
    result = rag(question, k=5)
    print(f"Q: {result['question']}")
    print(f"A: {result['answer']}")
    print(f"Sources: {len(result['sources'])} pages")
    print("-" * 80)


=== FULL RAG PIPELINE TEST ===

Q: Where is Fordham's campus?
A: Fordham University has two campuses in New York City: Rose Hill in the Bronx and Lincoln Center in Manhattan.
Sources: 5 pages
--------------------------------------------------------------------------------
Q: What programs does Gabelli School offer?
A: The Gabelli School offers three types of M.B.A. programs (full-time, professional, and executive M.B.A.), 12 Master of Science programs (two offered online), and two doctoral programs (Ph.D. and Doctor of Professional Studies).
Sources: 5 pages
--------------------------------------------------------------------------------
Q: How do I apply for financial aid?
A: To apply for financial aid at Fordham University, you need to fill out the Free Application for Federal Student Aid (FAFSA) and, if applicable, the CSS Profile. Ensure you are a U.S. citizen or eligible non-citizen, making satisfactory academic progress, and enrolled as a matriculated student in an approved progr

---

# 7. Evaluate, experiment and improve

Your RAG system works ‚Äî but there's always room to make it better. 

Your task: **evaluate, experiment, and improve your system**

Tips:
- How do you know that your system is working or that your changes are improving it?
- Try different questions ‚Äî where does it do well? Where does it struggle?
- Adjust the number of retrieved chunks ‚Äî what happens with more or fewer?
- Try different chunking strategies ‚Äî bigger chunks? Smaller? Overlap?
- Try a different embedding model ‚Äî does it change retrieval quality?
- Improve the prompt ‚Äî can you get better, more concise answers?
- Add source attribution ‚Äî can the system tell the user which pages the answer came from?

In [18]:
# Placeholder for your implementation
# Step 7: EVALUATION FRAMEWORK
test_questions = [
    "Where is Fordham's campus?",
    "What programs does Gabelli School of Business offer?",
    "How do I apply for financial aid?",
    "What is the Yellow Ribbon Program?",
    "Gabelli School veterans benefits",
    "Fordham medieval studies program"
]

def evaluate_rag(n_questions=20, k_values=[3, 6, 10], use_hybrid=False):
    """Manual evaluation + experiments"""
    results = []
    
    print("üß™ RAG EVALUATION RESULTS")
    print("=" * 80)
    
    for i, question in enumerate(test_questions[:n_questions]):
        for k in k_values:
            result = rag(question, k=k)
            
            # Simple quality metrics
            score = 1 if len(result["answer"]) > 20 and "not found" not in result["answer"].lower() else 0
            relevance = result["sources"][0]["score"] if result["sources"] else 0
            
            results.append({
                "question": question,
                "k": k,
                "answer": result["answer"][:100] + "...",
                "top_source": result["sources"][0]["url"] if result["sources"] else "None",
                "top_score": relevance,
                "quality": score
            })
    
    df_results = pd.DataFrame(results)
    
    # Summary stats
    print("\nüìä PERFORMANCE SUMMARY")
    print(df_results.groupby("k")[["quality", "top_score"]].agg(["mean", "count"]))
    
    # Show best/worst
    print("\nüèÜ BEST ANSWERS:")
    best = df_results.nlargest(3, "top_score")
    for _, row in best.iterrows():
        print(f"Q: {row['question'][:60]}...")
        print(f"  Top source score: {row['top_score']:.3f}")
        print(f"  Answer: {row['answer'][:80]}...\n")
    
    return df_results

# Run evaluation
eval_results = evaluate_rag(n_questions=6, k_values=[3, 6])

# Experiment: Different K values
print("\nüî¨ EXPERIMENT: Impact of K (more context)")
k_experiment = evaluate_rag(n_questions=3, k_values=[1, 3, 6, 10])


üß™ RAG EVALUATION RESULTS

üìä PERFORMANCE SUMMARY
  quality       top_score      
     mean count      mean count
k                              
3     1.0     6  0.684613     6
6     1.0     6  0.684610     6

üèÜ BEST ANSWERS:
Q: What programs does Gabelli School of Business offer?...
  Top source score: 0.756
  Answer: The Gabelli School of Business offers three variants of the MBA program: a full-...

Q: What programs does Gabelli School of Business offer?...
  Top source score: 0.756
  Answer: The Gabelli School of Business offers three types of MBA programs (full-time, pr...

Q: Where is Fordham's campus?...
  Top source score: 0.745
  Answer: Fordham University has two campuses in New York City: Rose Hill in the Bronx and...


üî¨ EXPERIMENT: Impact of K (more context)
üß™ RAG EVALUATION RESULTS

üìä PERFORMANCE SUMMARY
     quality       top_score      
        mean count      mean count
k                                 
1   0.666667     3  0.704771     3
3   1.000000 

In [24]:
import pandas as pd
import numpy as np
from pathlib import Path

print("üîç EMBEDDING COMPLETENESS CHECK")
print("=" * 50)

# 1. Check files
chunks_file = Path("fordham_chunks_final.parquet")
emb_file = Path("fordham_embeddings.npy")

print(f"Chunks file:   {chunks_file.exists()}")
print(f"Embeddings:    {emb_file.exists()}")

# Load data
chunks_df = pd.read_parquet(chunks_file)
chunk_embeddings = np.load(emb_file)

# 2. CORE CHECK: Shape match
n_chunks = len(chunks_df)
n_embeds = len(chunk_embeddings)  # Fixed: len() not shape[0]

print(f"\n‚úÖ SHAPE MATCH:")
print(f"   Chunks:     {n_chunks:,}")
print(f"   Embeddings: {n_embeds:,}")
print(f"   ‚úÖ MATCH:    {n_chunks == n_embeds}")

# 3. Embedding quality
print(f"\nüîß QUALITY CHECK:")
print(f"   Dimensions:  {chunk_embeddings.shape[1]}")
print(f"   Total size:  {chunk_embeddings.shape}")
print(f"   NaN count:   {np.isnan(chunk_embeddings).sum()}")

# 4. FIXED Norm calculation
norm_value = np.linalg.norm(chunk_embeddings[0])
print(f"   Sample norm: {norm_value:.3f} (should be ~0.8-2.0)")

# 5. Document coverage
print(f"\nüìÇ COVERAGE:")
print(f"   Unique docs:  {chunks_df['filename'].nunique():,}")
print(f"   Chunks/doc:   {n_chunks/chunks_df['filename'].nunique():.1f} avg")

# 6. Sample data
print(f"\nüìã SAMPLE:")
print(f"File: {chunks_df.iloc[0]['filename'][:40]}...")
print(f"Text: {chunks_df.iloc[0]['text'][:80]}...")
print(f"URL:  {chunks_df.iloc[0]['url']}")

# 7. FINAL VERDICT
status = "‚úÖ 100% COMPLETE" if n_chunks == n_embeds else "‚ùå RE-EMBED NEEDED"
print(f"\nüéØ STATUS: {status}")
print(f"RAG Ready: {'YES' if n_chunks == n_embeds else 'NO'}")


üîç EMBEDDING COMPLETENESS CHECK
Chunks file:   True
Embeddings:    True

‚úÖ SHAPE MATCH:
   Chunks:     36,853
   Embeddings: 36,853
   ‚úÖ MATCH:    True

üîß QUALITY CHECK:
   Dimensions:  1536
   Total size:  (36853, 1536)
   NaN count:   0
   Sample norm: 1.000 (should be ~0.8-2.0)

üìÇ COVERAGE:
   Unique docs:  9,560
   Chunks/doc:   3.9 avg

üìã SAMPLE:
File: 0001144e6d954f94682637e541ad5d7f.md...
Text: # 2021-2022 Duffy Fellows

**Afrah Bandagi (FCLC 2023)**

‚Äú[Supera las fronteras...
URL:  https://www.fordham.edu/about/living-the-mission/center-on-religion-and-culture/duffy-fellows-program/past-duffy-fellows/2021-2022-duffy-fellows

üéØ STATUS: ‚úÖ 100% COMPLETE
RAG Ready: YES


---

# 8. (Optional) Make it an app

So far your RAG system lives inside a notebook. That's great for development ‚Äî but nobody is going to use your Jupyter notebook to ask questions about Fordham. Let's turn it into a real web app.

> üìö **TERM: Streamlit**  
> A Python library that turns plain Python scripts into interactive web apps. You write Python ‚Äî no HTML, CSS, or JavaScript ‚Äî and Streamlit renders it as a web page with inputs, buttons, and formatted output. It's the fastest way to go from "I have a function" to "I have a web app."

Your task: **create a Streamlit app that lets a user type a question about Fordham and get an answer from your RAG system.**

To get started:
- Install it: `uv pip install streamlit` 
- A Streamlit app is just a `.py` file (not a notebook). Create something like `fordham_rag_app.py`
- Run it: `streamlit run scripts/fordham_rag_app.py` ‚Äî this opens a browser tab with your app

Tips:
- Check out the [Streamlit docs](https://docs.streamlit.io/) ‚Äî the "Get started" tutorial is very short
- Your best bet is to vibecode your way to this. You'll be surprised how fast you can get it up and running

In [22]:
import streamlit as st
import pandas as pd
import numpy as np
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity
import plotly.express as px

# Load your data
@st.cache_data
def load_data():
    chunks_df = pd.read_parquet("fordham_chunks_final.parquet")
    chunk_embeddings = np.load("fordham_embeddings.npy")
    return chunks_df, chunk_embeddings

chunks_df, chunk_embeddings = load_data()

# Config
client = OpenAI(api_key=st.secrets["OPENAI_API_KEY"])
EMBED_MODEL = "text-embedding-3-small"
GEN_MODEL = "gpt-4o-mini"

# Your functions (same as notebook)
def embed_query(query: str) -> np.ndarray:
    resp = client.embeddings.create(model=EMBED_MODEL, input=query[:4000])
    return np.array(resp.data[0].embedding, dtype="float32")

def retrieve_top_k(query: str, k: int = 6) -> pd.DataFrame:
    q_emb = embed_query(query)
    sims = cosine_similarity(q_emb.reshape(1, -1), chunk_embeddings)[0]
    top_idx = np.argsort(sims)[::-1][:k]
    results = chunks_df.iloc[top_idx].copy()
    results["score"] = sims[top_idx]
    return results.sort_values("score", ascending=False)

def rag(question: str, k: int = 6) -> dict:
    retrieved = retrieve_top_k(question, k)
    context = "\n\n---\n\n".join([
        f"[Source: {row['url']}]\n{row['text'][:800]}..." 
        for _, row in retrieved.iterrows()
    ])
    
    prompt = """Using ONLY this Fordham University context, answer the question.

CONTEXT:
{}
QUESTION: {}

Answer briefly using only the context above. If not in context, say "Not found in Fordham docs." """.format(context, question)
    
    resp = client.chat.completions.create(
        model=GEN_MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1
    )
    
    return {
        "question": question,
        "answer": resp.choices[0].message.content.strip(),
        "sources": retrieved[["url", "filename", "score"]].to_dict("records")
    }

# === STREAMLIT UI ===
st.set_page_config(page_title="Fordham RAG", layout="wide")
st.title("üöÄ Fordham University RAG Assistant")
st.markdown("Ask questions about Fordham programs, admissions, campus life... powered by 36K+ Fordham web pages")

# Sidebar controls
st.sidebar.header("‚öôÔ∏è Settings")
k_value = st.sidebar.slider("Context chunks (K)", 3, 15, 6)
show_sources = st.sidebar.checkbox("Show sources", True)

# Main chat interface
if "messages" not in st.session_state:
    st.session_state.messages = []

for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])
        if message["role"] == "assistant" and "sources" in message:
            with st.expander("üìö Sources"):
                for i, source in enumerate(message["sources"][:5]):
                    st.markdown(f"**{i+1}.** [{source['filename']}]({source['url']}) (score: {source['score']:.3f})")

# Chat input
if prompt := st.chat_input("Ask about Fordham..."):
    # Add user message
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)
    
    # Generate response
    with st.chat_message("assistant"):
        with st.spinner("Searching Fordham docs..."):
            result = rag(prompt, k=k_value)
        
        st.markdown(result["answer"])
        
        # Store full result for sources
        full_result = {"role": "assistant", "content": result["answer"], 
                      "sources": result["sources"]}
        st.session_state.messages.append(full_result)

# Footer
st.markdown("---")
st.markdown("**Built with:** 36,853 Fordham web page chunks | OpenAI GPT-4o-mini | Semantic search")




StreamlitSecretNotFoundError: No secrets found. Valid paths for a secrets.toml file or secret directories are: C:\Users\Manikandan\.streamlit\secrets.toml, d:\Spring 2026\RAG and Context Engineering\HW\.streamlit\secrets.toml

---

# Summary

## What You Built

| Step | What You Did | What It Does |
|------|-------------|-------------|
| **Load** | Read 9,500+ Fordham web pages | Get raw content |
| **Chunk** | Split pages into smaller pieces | Make content searchable and promptable |
| **Embed** | Turn chunks into vectors | Enable semantic search |
| **Retrieve** | Find relevant chunks for a question | The **R** in RAG |
| **Generate** | Ask an LLM to answer using the chunks | The **G** in RAG |
| **RAG** | Wire it all together | Question in, answer out |

## The Big Picture

RAG is one of the most common patterns in AI engineering today. What you built here is the same core architecture behind tools like ChatGPT with search, Perplexity, enterprise Q&A bots, and more. The details get more sophisticated (vector databases, reranking, query rewriting, evaluation) but the pattern is the same:

**Find relevant stuff ‚Üí give it to an LLM ‚Üí get an answer.**

You can just build things.