# Simple Retrieval-Augmented Generation

Lecture 10 | CMU ANLP Fall 2025 | Instructor: Sean Welleck

This notebook shows a minimal implementation of Retrieval-Augmented Generation (RAG) for answering questions about the course.


## Setup

In [None]:
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
from typing import List, Tuple
import json

In [12]:
# Load embedding model
embed_model = SentenceTransformer('Qwen/Qwen3-Embedding-0.6B')

# Load language model
lm_model_name = "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(lm_model_name)
lm_model = AutoModelForCausalLM.from_pretrained(lm_model_name)
lm_model.eval()

# Set padding token
tokenizer.pad_token = tokenizer.eos_token

## Document Preparation

Load pre-parsed documents from JSON file.

In [13]:
with open('documents.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

documents = data['documents']
source_url = data['source_url']

print(f"Loaded {len(documents)} documents from {source_url}")
print(f"\nExample documents:")
for i, doc in enumerate(documents[:5] + documents[-5:]):
    print(f"{i+1}. {doc}")
        

Loaded 43 documents from https://cmu-l3.github.io/anlp-fall2025/

Example documents:
1. Advanced Natural Language Processing / Fall 2025 Advanced natural language processing is an introductory graduate-level course on natural language processing aimed at students who are interested in doing cutting-edge research in the field. In it, we describe fundamental tasks in natural language processing as well as methods to solve these tasks. The course focuses on modern methods using neural networks, and covers the basic modeling, learning, and inference algorithms required therefore.
2. The class culminates in a project in which students attempt to reimplement and improve upon a research paper in a topic of their choosing.
3. Course Details Instructor Sean Welleck Teaching Assistants Joel Mire Chen Wu Dareen Alharthi Neel Bhandari Akshita Gupta Ashish Marisetty Manan Sharma Sanidhya Vijayvargiya Logistics Class times: TR 2:00pm - 3:20pm Room: TEP 1403 Course identifier: LTI 11-711 Piazza: Piaz

#### Simple chunking

In [14]:
def chunk_text(text, chunk_size, overlap):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk:
            chunks.append(chunk)
    return chunks

all_chunks = []
for doc in documents:
    chunks = chunk_text(doc, chunk_size=100, overlap=10)
    all_chunks.extend(chunks)

print(f"Created {len(all_chunks)} chunks")
print(f"\nExample chunk:\n{all_chunks[0]}")
print(f"\nExample chunk:\n{all_chunks[-1]}")

Created 48 chunks

Example chunk:
Advanced Natural Language Processing / Fall 2025 Advanced natural language processing is an introductory graduate-level course on natural language processing aimed at students who are interested in doing cutting-edge research in the field. In it, we describe fundamental tasks in natural language processing as well as methods to solve these tasks. The course focuses on modern methods using neural networks, and covers the basic modeling, learning, and inference algorithms required therefore.

Example chunk:
Quiz: There will be a quiz covering the reading material and/or lecture material that you can fill out on Canvas. The quiz will be released by the end of the day of the class (11:59pm) and will be due at the end of the following day (11:59pm).


## Embedding and Indexing

Create embeddings for all chunks.

In [15]:
chunk_embeddings = embed_model.encode(all_chunks, prompt_name="query")
print(f"Created embeddings with shape: {chunk_embeddings.shape}")

Created embeddings with shape: (48, 1024)


## Retrieval

Find relevant chunks for a query.

In [16]:
def retrieve(query: str, k: int = 3) -> List[Tuple[str, float]]:
    query_embedding = embed_model.encode([query], prompt_name="query")[0]
    similarities = np.dot(chunk_embeddings, query_embedding)
    similarities = similarities / (np.linalg.norm(chunk_embeddings, axis=1) * np.linalg.norm(query_embedding))
    top_indices = np.argsort(similarities)[-k:][::-1]
    results = [(all_chunks[i], similarities[i]) for i in top_indices]
    return results

# Test retrieval
query = "Who is the instructor?"
results = retrieve(query, k=3)

print(f"Query: {query}\n")
for i, (chunk, score) in enumerate(results):
    print(f"Result {i+1} (score: {score:.3f}):\n{chunk}\n")

Query: Who is the instructor?

Result 1 (score: 0.605):
Course Details Instructor Sean Welleck Teaching Assistants Joel Mire Chen Wu Dareen Alharthi Neel Bhandari Akshita Gupta Ashish Marisetty Manan Sharma Sanidhya Vijayvargiya

Result 2 (score: 0.488):
Course Details Instructor Sean Welleck Teaching Assistants Joel Mire Chen Wu Dareen Alharthi Neel Bhandari Akshita Gupta Ashish Marisetty Manan Sharma Sanidhya Vijayvargiya Logistics Class times: TR 2:00pm - 3:20pm Room: TEP 1403 Course identifier: LTI 11-711 Piazza: Piazza Code: GitHub Office hours: Location Day Time Sean Welleck GHC 6513 Tuesday 4:00-5:00 PM Joel Mire WEH 3110 Tuesday 3:30-4:30 PM Chen Wu GHC 5417 Tuesday 4:00-5:00 PM Dareen Alharthi GHC 5417 Monday 10:00-11:00 AM Neel Bhandari GHC 5417 Friday 12:00-1:00 PM Akshita Gupta GHC 5417 Friday 4:00-5:00 PM Ashish Marisetty GHC 5417 Friday 2:00-3:00 PM Manan Sharma GHC 8115 Monday

Result 3 (score: 0.450):
If you don't have much experience with NLP, it will be helpful to con

### Generators

In [41]:
def generate_with_context(query: str, context_chunks: List[str], max_new_tokens: int = 100) -> str:
    context = "\n".join([f"- {chunk}" for chunk in context_chunks])
    
    # Create messages for chat template
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant answering questions about a course. Use only the provided context to answer questions. Be concise and accurate. Only generate your answer."
        },
        {
            "role": "user",
            "content": f"""Here is some context about the course:

{context}

Based on this context, please answer the following question:
{query}"""
        }
    ]
    
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
    
    # Generate
    with torch.no_grad():
        outputs = lm_model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.1,
            top_p=0.95,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Decode and extract answer
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    try:
        answer = response.split("<|im_start|>assistant")[-1].split("</think>")[-1].strip()
    except IndexError:
        parts = response.split(query)
        answer = parts[-1].strip() if len(parts) > 1 else response
    return answer

def generate_without_context(query: str, max_new_tokens: int = 100) -> str:
    
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant answering questions about a course, Advanced NLP Fall 2025. Use only the provided context to answer questions. Be concise and accurate. Only generate your answer."
        },
        {
            "role": "user",
            "content": query
        }
    ]
    
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
    inputs = tokenizer(prompt, return_tensors="pt")
    
    # Generate
    with torch.no_grad():
        outputs = lm_model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.1,
            top_p=0.95,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Decode and extract answer
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    try:
        answer = response.split("<|im_start|>assistant")[-1].split("</think>")[-1].strip()
    except IndexError:
        parts = response.split(query)
        answer = parts[-1].strip() if len(parts) > 1 else response
    
    return answer

## RAG Pipeline

Retrieve then generate.

In [42]:
def rag_answer(query: str, k: int = 3) -> str:
    results = retrieve(query, k=k)
    context_chunks = [chunk for chunk, _ in results]
    answer = generate_with_context(query, context_chunks)
    return answer, results

## Comparison: With vs Without RAG

*Do you notice any errors in the RAG outputs? Also try making additional queries, and find ones that lead to errors.*

In [47]:
test_queries = [
    "Who's the instructor?",
    "Who teaches the course?",
    "What is the late policy?",
    "How much are quizzes worth?",
    "When is assignment 3.1 released?",
    "What time does the class meet?"
]

for query in test_queries:
    print("=" * 60)
    print(f"Query: {query}\n")
    
    # Without RAG
    print("WITHOUT RAG:")
    answer_no_rag = generate_without_context(query)
    print(f"{answer_no_rag}\n")
    
    # With RAG
    print("WITH RAG:")
    answer_rag, retrieved = rag_answer(query, k=2)
    
    print("Retrieved chunks:")
    for i, (chunk, score) in enumerate(retrieved):
        print(f"  {i+1}. (score: {score:.3f}) {chunk}...")
    
    print(f"\nGenerated answer:")
    print(f"{answer_rag}\n")

Query: Who's the instructor?

WITHOUT RAG:
The instructor is Dr. Emily Carter.

WITH RAG:
Retrieved chunks:
  1. (score: 0.603) Course Details Instructor Sean Welleck Teaching Assistants Joel Mire Chen Wu Dareen Alharthi Neel Bhandari Akshita Gupta Ashish Marisetty Manan Sharma Sanidhya Vijayvargiya...
  2. (score: 0.487) Course Details Instructor Sean Welleck Teaching Assistants Joel Mire Chen Wu Dareen Alharthi Neel Bhandari Akshita Gupta Ashish Marisetty Manan Sharma Sanidhya Vijayvargiya Logistics Class times: TR 2:00pm - 3:20pm Room: TEP 1403 Course identifier: LTI 11-711 Piazza: Piazza Code: GitHub Office hours: Location Day Time Sean Welleck GHC 6513 Tuesday 4:00-5:00 PM Joel Mire WEH 3110 Tuesday 3:30-4:30 PM Chen Wu GHC 5417 Tuesday 4:00-5:00 PM Dareen Alharthi GHC 5417 Monday 10:00-11:00 AM Neel Bhandari GHC 5417 Friday 12:00-1:00 PM Akshita Gupta GHC 5417 Friday 4:00-5:00 PM Ashish Marisetty GHC 5417 Friday 2:00-3:00 PM Manan Sharma GHC 8115 Monday...

Generated answer:
Sean