# Part 2: Retrieval Augmented Generation (RAG)

In this part of the assignment, you will implement a Retrieval Augmented Generation (RAG) system that combines information retrieval with text generation. RAG systems address a key limitation of language models: their inability to access information beyond their training data. By retrieving relevant documents from a knowledge base and incorporating them into the generation process, RAG systems can provide more accurate, up-to-date, and factually grounded responses.

## Learning Objectives
You will:
1. Build a question-answering system using the [SQUAD Dataset](https://rajpurkar.github.io/SQuAD-explorer/) consisting of natural language questions, answers, and contexts.
2. Prompt a Phi-2 causal Transformer Language Model to generate answers
3. Use an encoder Transformer language model to compute dense embeddings of a dataset of contexts drawn from wikipedia articles
4. Implement RAG by embedding a query vector, searching for and rerieving relevant contexts, and providing context to the Phi-2 model to improve generated answers
5. Evaluate the performance of question answering system with and without RAG for accuracy and efficiency

Note: This assignment is intended to utilize GPU resources such as `CUDA` through the CS department cluster, Google colab (or local GPU resources for those running on machines with GPU support). The **code below assumes CUDA**; you will need to modify it if working with the [`mps` backend](https://docs.pytorch.org/docs/stable/notes/mps.html).

To start, run the following code to download the Phi-2 model and tokenizer.

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    dtype=torch.float16,
    device_map="cuda",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
print("Model Download Successful\n\n")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model Download Successful




Now run the following to demonstrate basic generation with the model. Note that this code snippet uses sampling with temperature for generation -- you can run the code cell multiple times and get different short stories.

In [2]:
print("Example model generation:")
print("========================")
inputs = tokenizer("Once upon a time", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100,
                         pad_token_id=tokenizer.eos_token_id,
                         do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example model generation:
Once upon a time, in the small town of Oakville, there lived a young man named John. He was known for his extraordinary intelligence and his passion for environmental science. John had always been fascinated by the intricate web of nutrient cycling and how it affected the delicate balance of ecosystems.

John's day began like any other. He woke up early in the morning, eager to delve into his research on atmospheric oxygen production. After a quick breakfast, he made his way to the laboratory where he spent countless hours studying


Run the following to download the [SQUAD Dataset](https://rajpurkar.github.io/SQuAD-explorer/). We take the `validation` split consisting of just over 10,000 examples, each with a question, an answer, and a a *context*: This is a short passage from a Wikipedia article in which the answer can be found. We use the smaller validation set for efficiency, and note that we will not be doing any model training in this assignment part, only retrieval augmented generation with the pretrained Phi-2 model.

After downloading the dataset, the code prints a single example.

In [3]:
from datasets import load_dataset

# Load SQuAD dataset
squad = load_dataset("squad", split="validation")
print("\nDownload Successful\n\n")

# Get one example
example = squad[0]
question = example["question"]
context = example["context"]
answer = example["answers"]["text"][0]

print("Question:", question)
print("\nGround truth answer:", answer)
print("\nContext containing the answer:", context)
print("\n" + "="*50)


Download Successful


Question: Which NFL team represented the AFC at Super Bowl 50?

Ground truth answer: Denver Broncos

Context containing the answer: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.



## Task 1

First we need to discuss prompt engineering. Phi-2 has a tendency (by default, from its pretraining) to continue generating long sequences well beyond the immediate question or task posed in the prompt. Indeed, often it will continue asking and answering additional lists of questions.

However, Phi-2 has also been instruction tuned. Following the instruction-tuning prompt format helps the model to better follow the specific intent of a user query.

The code below loops through three different example questions. For each, generate two different responses from the Phi-2 model:
- One using the question itself as the only prompt/input to the model
- Another using the instruction format from the [Phi-2 Documentation](https://huggingface.co/microsoft/phi-2). Specifically, the prompt/input to the model should be `f"Instruct: {question}\nOutput:"`.

Then briefly explain the qualitative differences you observe. Describe how instruction-tuning using SFT can give rise to the differences you observe. Answer in 1-2 paragraphs.

In [4]:
for i in [0, 1000, 2000]:
  example = squad[i]
  question = example["question"]

  print("\n" + "="*50)
  print("Question", i, ":", question)
  print("\n" + "="*50)
  print("Basic Prompting:\n")
  # TODO: Generate and print answer to question
  # using basic prompting of just the question
  input_1 = tokenizer(question, return_tensors = "pt").to("cuda")
  output_1 = model.generate(**input_1, max_new_tokens = 128)
  print(tokenizer.decode(output_1[0], skip_special_tokens = True).strip())

  print("\n" + "="*50)
  print("Instruction Prompting:\n")
  # TODO: Generate and print answer to question
  # using instruction prompting of just the question
  instruction_prompt = f"Instruct: {question}\nOutput:"
  input_2 = tokenizer(instruction_prompt, return_tensors = "pt").to("cuda")
  output_2 = model.generate(**input_2, max_new_tokens = 128)
  print(tokenizer.decode(output_2[0], skip_special_tokens = True).strip())

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Question 0 : Which NFL team represented the AFC at Super Bowl 50?

Basic Prompting:



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Which NFL team represented the AFC at Super Bowl 50?
Answer: The Denver Broncos.

Exercise: Who was the head coach of the Denver Broncos during Super Bowl 50?
Answer: John Fox.

Exercise: How many times had the Denver Broncos appeared in a Super Bowl before Super Bowl 50?
Answer: Four times.

Exercise: Who was the quarterback for the Denver Broncos during Super Bowl 50?
Answer: Peyton Manning.

Exercise: Who was the head coach of the Carolina Panthers during Super Bowl 50?
Answer: Ron Rivera.

Exercise: How many times had the Carolina Panthers appeared in a Super Bowl before Super

Instruction Prompting:



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Instruct: Which NFL team represented the AFC at Super Bowl 50?
Output: The Denver Broncos represented the AFC at Super Bowl 50.

Question 1000 : Where is a palm house with subtropic plants from all over the world on display?

Basic Prompting:



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Where is a palm house with subtropic plants from all over the world on display?

Answer: The Palm House at Kew Gardens in London.

Exercise 3:
How did the Palm House at Kew Gardens get its name?

Answer: It was named after the palm trees that were originally planted in the greenhouse.

Exercise 4:
What is the purpose of the Palm House at Kew Gardens?

Answer: It is a place for people to see and learn about different types of palm trees from all over the world.

Exercise 5:
Why is the Palm House at Kew Gardens important?

Answer: It is important because it is a

Instruction Prompting:



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Instruct: Where is a palm house with subtropic plants from all over the world on display?
Output: The Palm House.

Question 2000 : What is dramatic gesturing an example of?

Basic Prompting:



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What is dramatic gesturing an example of?
Answer: Dramatic gesturing is an example of nonverbal communication.

Exercise 3:
What is the purpose of using a microphone in a presentation?
Answer: The purpose of using a microphone in a presentation is to amplify the speaker's voice so that everyone in the audience can hear them clearly.

Exercise 4:
What is the purpose of using a projector in a presentation?
Answer: The purpose of using a projector in a presentation is to display visual aids such as slides or images on a screen for the audience to see.

Exercise 5:
What is the purpose of using a

Instruction Prompting:

Instruct: What is dramatic gesturing an example of?
Output: What is an example of dramatic gesturing?


**TODO**: Briefly explain the qualitative differences you observe. Describe how instruction-tuning using SFT can give rise to the differences you observe. Answer in 1-2 paragraphs.

**Your Answer:** Looking at the results, the first thing I tend to notice when I used basic prompting is that it tends to generate longer and less focused responses. Like for a given question, the model does managed to give the correct answer but it also often ask and answer related questions that were not originally asked. What this really shows is that without giving some explicit intructions, the model tend to continue on generating text since it was pretrained on long and open-ended text sequences.

On the other hand, I find that with instruction prompting, I notice that it produced much more concise and direct answers to the question. So instead of continuing with extra questions like with the basic prompting, the model is more focused on giving a single repsonse that is well-structured. This is most likely due to the fact that the Phi-2 model was trained using SFT in which it learned from examples that follows the instruction (Instruct/Output) format. With this, the model can understand user instructions more clearly and its responses would be more focused on answering the questions or instructions it was being asked.

## Task 2

In this task we will measure an empirical baseline of performance to motivate the implementation of a RAG.

Several key utility functions will be provided for you. The first is `get_answer`, defined and documented below. You do not need to edit this code, but you should read and familiarize yourself with the function as you will be using it next.

In [5]:
def get_answer(model, tokenizer, question, context=None, max_len=50):
  """
    Generate an answer to a question using the language model.
    This function constructs a prompt in the instruction-tuned format that
    Phi-2 was trained on, generates a response, and returns only the newly
    generated tokens (excluding the input prompt). For coherence, uses greedy
    decoding, i.e., no sampling or temperature.

    Args:
        model: The Phi-2 language model
        tokenizer: The tokenizer for Phi-2
        question (str): The question to answer
        context (str, optional): Context passage to help answer the question.
                                If None, model answers without context.
        max_len (int, optional): Maximum number of tokens to generate.

    Returns:
        str: The model's generated answer, stripped of whitespace
    """
  if context is None:
    prompt = f"Instruct: {question}\nOutput:"
  else:
    prompt = f"Context: {context}\nInstruct: {question}\nOutput:"

  inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
  outputs = model.generate(**inputs, max_new_tokens=max_len, pad_token_id=tokenizer.eos_token_id)
  outputs = outputs[:, inputs['input_ids'].shape[-1]:] # Take only newly generated tokens
  response = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
  return response

First we demonstrate a qualitative examples where providing the relevant context helps the model to answer correctly/truthfully. Run the following code to see three (hand-picked, not random) examples of model answers **with** versus **without** the relevant passage from a wikipedia article provided in context to the model.

In [6]:
for i in [100, 500, 900]:
  example = squad[i]
  question = example["question"]
  context = example["context"]
  answer = example["answers"]["text"][0]

  print("Question:", question)
  print("\nCorrect Answer:", answer)

  print("\nModel WITHOUT context:")
  print(get_answer(model, tokenizer, question))

  # Test 2: With context

  print("\nModel WITH context:")
  print(get_answer(model, tokenizer, question, context))

  print("\n" + "="*50)

Question: Who were special guests for the Super Bowl halftime show?

Correct Answer: Beyoncé and Bruno Mars

Model WITHOUT context:
Beyoncé and Jay-Z.

Model WITH context:
The special guests for the Super Bowl halftime show were Beyoncé and Bruno Mars.

Question: What was media day called for Super Bowl 50?

Correct Answer: Super Bowl Opening Night.

Model WITHOUT context:
What was the name of the media day for Super Bowl 50?

Model WITH context:
Super Bowl 50 media day was called Super Bowl Opening Night.

Question: When was Warsaw ranked as the 32nd most liveable city in the world?

Correct Answer: 2012

Model WITHOUT context:
In 2015.

Model WITH context:
Warsaw was ranked as the 32nd most liveable city in the world in 2012.



In the generations above, you should observe that:

- Even without context, the model correctly identifies that Beyoncé performed, but it goes on to **hallucinate** that Jay-Z performed with Beyoncé, which is incorrect.

- In the second example, without context, the model reverts to a pretraining behavior of just rephrasing the question, whereas with context it correctly restates the question **and** answers it.

- In the third example, without context, again the model hallucinates and gives a reasonable sounding but incorrect answer.

Now let us evaluate model performance **quantitatively** with and without context provided. To do so, the `evaluate_answer` function is defined and documented for you below. You do not need to modify it but you should review the function which is used to evaluate the model later.

In [7]:
def evaluate_answer(predicted, ground_truth):
    """
    Evaluate if a predicted answer matches the ground truth answer.

    This function uses flexible string matching to handle various answer formats:
    - Exact matches
    - Ground truth contained in prediction (e.g., "Paris" in "The capital is Paris")
    - Prediction contained in ground truth (e.g., "Broncos" matches "Denver Broncos")
    - Order-independent word matching (e.g., "Bruno Mars and Beyoncé" matches
      "Beyoncé and Bruno Mars")

    Args:
        predicted (str): The model's predicted answer
        ground_truth (str): The correct answer from the dataset

    Returns:
        int: 1 if the answer is considered correct, 0 otherwise

    Note:
        This is a simplistic evaluation metric that tends to underestimate
        model performance. It is intended for demonstration purposes only.
    """

    pred = predicted.lower().strip()
    gt = ground_truth.lower().strip()

    # Exact match
    if pred == gt:
        return 1

    # Ground truth contained in prediction
    if gt in pred:
        return 1

    # Prediction contained in ground truth
    if pred in gt:
        return 1

    # Check if all words from ground truth appear in prediction (order-independent)
    gt_words = set(gt.split())
    pred_words = set(pred.split())
    if gt_words.issubset(pred_words):
        return 1

    return 0

Now that we can evaluate one prediction against the correct answer, the below code uses this as a helper function to evaluate a model on a dataset of questions and answers.

Read the function and its documentation, then run two evaluations:

- First, run `evaluate_model` using `mode="no_context"` (the default) to assess the model's performance without any context provided beyond the question itself.
- Second, run `evaluate_model` using `mode="gold_context"` to assess the model's performance when it is provided with the gold standard context (gold standard in the sense that it contains the correct answer).

In both cases, **set `max_examples=100`** to keep things efficient (you could run a longer evaluation if interested, but this is the minimum required for the assignment -- running an evaluation over larger portions of the dataset would take tens of minutes or more).

You should see that the model performs quite poorly with no context but substantially better with the gold standard context.

In [8]:
import numpy as np
from tqdm import tqdm
import time

def evaluate_model(model, tokenizer, dataset, mode="no_context",
                     context_embeddings=None, contexts=None,
                     embedding_model=None, embedding_tokenizer=None,
                     max_examples=None):
    """
    Unified evaluation function for comparing different QA approaches.

    This function evaluates question-answering performance across three modes:
    - "no_context": Model answers questions without any context
    - "gold_context": Model uses the ground-truth context from the dataset
    - "rag": Model uses retrieved context from the RAG system

    For each example, the function:
    1. Determines the appropriate context based on mode. mode="rag" requires
    a retrieve_top_context implementation along with context_embeddings, contexts,
    embedding_model, and embedding_tokenizer.
    2. Generates an answer using get_answer()
    3. Evaluates correctness using evaluate_answer()
    4. Tracks timing for performance analysis

    Args:
        model: The Phi-2 language model
        tokenizer: The tokenizer for Phi-2
        dataset: SQUAD dataset containing questions, contexts, and answers
        mode (str): Evaluation mode - "no_context", "gold_context", or "rag"
        context_embeddings (numpy.ndarray, optional): Precomputed embeddings of all
            contexts. Required for mode="rag".
        contexts (list, optional): List of all context strings. Required for mode="rag".
        embedding_model (optional): Sentence embedding model. Required for mode="rag".
        embedding_tokenizer (optional): Tokenizer for embedding model. Required for mode="rag".
        max_examples (int, optional): Limit evaluation to first N examples.
            If None, evaluates entire dataset.

    Returns:
        tuple: (accuracy, avg_time, results)
            - accuracy (float): Proportion of correct answers (0.0 to 1.0)
            - avg_time (float): Average time per question in seconds
            - results (list): List of dicts with detailed results for each example

    Raises:
        ValueError: If mode is not one of the three valid options
        ValueError: If mode="rag" but required RAG parameters are None

    Note:
        - Use max_examples for faster iteration during development
        -
        - The "gold_context" mode represents the upper bound of what's possible
          with perfect retrieval (always retrieving context containing the answer)
        - Timing includes both retrieval (for RAG) and generation
    """
    correct, total, total_time = 0, 0, 0
    results = []

    # Limit dataset size if specified
    data = dataset if max_examples is None else dataset.select(range(max_examples))

    start_time = time.time()

    for example in tqdm(data, desc=f"Evaluating {mode}"):
        question = example["question"]
        ground_truth = example["answers"]["text"][0]
        gold_context = example["context"]

        # Determine context based on mode
        if mode == "no_context":
            context = None
        elif mode == "gold_context":
            context = gold_context
        elif mode == "rag":
            context = retrieve_top_context(question, context_embeddings, contexts,
                                           embedding_model, embedding_tokenizer)
        else:
            raise ValueError(f"Invalid mode: {mode}")

        # Get prediction
        prediction = get_answer(model, tokenizer, question, context)

        elapsed_time = time.time() - start_time

        # Evaluate
        score = evaluate_answer(prediction, ground_truth)
        correct += score
        total += 1

        results.append({
            "question": question,
            "ground_truth": ground_truth,
            "prediction": prediction,
            "correct": score,
            "time": elapsed_time
        })

    total_time += time.time() - start_time
    accuracy = correct / total
    avg_time = total_time / total

    return accuracy, avg_time, results

In [9]:
# TODO: Run evaluate_model twice:
# 1. First with mode="no_context" and max_examples=100
# 2. Then with mode="gold_context" and max_examples=100
# Print the accuracy and average time for each

torch.manual_seed(2025)
model.eval()
with torch.no_grad():
  no_accuracy, no_avg_time, no_results = evaluate_model(model, tokenizer, squad, mode = "no_context", max_examples = 100)
  print("No Context Accuracy:", no_accuracy)
  print("No Context Average Time:", no_avg_time)
  print()
  gold_accuracy, gold_avg_time, gold_results = evaluate_model(model, tokenizer, squad, mode = "gold_context", max_examples = 100)
  print()
  print("Gold Context Accuracy:", gold_accuracy)
  print("Gold Context Average Time:", gold_avg_time)

Evaluating no_context: 100%|██████████| 100/100 [00:50<00:00,  1.98it/s]


No Context Accuracy: 0.21
No Context Average Time: 0.5055790519714356



Evaluating gold_context: 100%|██████████| 100/100 [01:01<00:00,  1.62it/s]


Gold Context Accuracy: 0.87
Gold Context Average Time: 0.6156034994125367





## Task 3

Of course, it is not surprising that context helps -- we are providing the answer to the question directly as input to the model, so the attention mechanisms can learn in-context. We would not be able to do this if the SQUAD dataset did not provide the gold standard context (a short text segment from a wikipedia article) containing the answer to each question.

But what if we don't have the gold standard context? In real-world applications, we might have a large collection of documents (like Wikipedia articles, company documentation, or research papers) and we need to automatically find the most relevant context for each question. This is where Retrieval Augmented Generation (RAG) comes in.

The key idea of RAG is to use an encoder Transformer model (BERT-style) to compute **dense embeddings** to represent both questions and contexts in a shared vector space, where semantically similar texts are close together. We can then use similarity search to find the most relevant context for any given question.

In this task, you will first implement the `compute_embeddings` function that takes a list of texts and returns their dense vector representations. We will use a pre-trained sentence transformer model (all-MiniLM-L6-v2) which has been specifically trained to produce embeddings that capture semantic similarity.

**Implementation hints for `compute_embeddings`:**
- The function should process each text individually (you could extend this to use batching for efficiency, but it's not required)
- Use `tokenizer()` to convert text to input tensors, with `padding=True`, `truncation=True`, and `max_length=512`
- Pass the tokenized inputs through the model to get outputs
- The model returns a complex object - you want `outputs.last_hidden_state`, which has shape (batch_size, sequence_length, hidden_dim)
- Use **mean pooling** across the sequence dimension: `.mean(dim=1)` to get a single vector per text by simply averaging the contextual embeddings of each token
- You can wrap model inference in `torch.no_grad()` to save memory
- Convert the final embeddings to numpy arrays on CPU: `.cpu().numpy()`
- Stack all embeddings into a single numpy array using `np.vstack()`

In [10]:
from transformers import AutoTokenizer, AutoModel
import torch

# Load a small encoder model for embeddings
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedding_tokenizer = AutoTokenizer.from_pretrained(embedding_model_name)
embedding_model = AutoModel.from_pretrained(embedding_model_name).to("cuda")

def compute_embeddings(texts, model, tokenizer, batch_size = 32):
    """
    Compute dense embeddings for a list of texts.

    TODO: Implement this function. Hints:
    - Use model(**inputs) to get outputs
    - Use mean pooling: outputs.last_hidden_state.mean(dim=1)
    - Remember to use torch.no_grad() and move to CPU

    Returns:
        numpy array of shape (len(texts), embedding_dim)
    """
    # YOUR CODE HERE
    model.eval()
    embeddings = []
    with torch.no_grad():
      for i in range(0, len(texts), batch_size):
        b_texts = texts[i: i + batch_size]
        inputs = tokenizer(b_texts, return_tensors = "pt", padding = True, truncation = True, max_length = 512).to("cuda")
        outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state.mean(dim = 1).cpu().numpy())
    return np.vstack(embeddings)

Now that you can compute embeddings, the next step is to implement retrieval. Given a question, you need to:
1. Compute its embedding
2. Compare it to all context embeddings
3. Return the most similar context

We'll use **cosine similarity** to measure how similar two embeddings are. Cosine similarity ranges from -1 (opposite) to 1 (identical direction) and is computed as: `dot(A, B) / (||A|| * ||B||)`.

**Implementation hints for `retrieve_top_context`:**
- Use your `compute_embeddings` function to get the query embedding
- Use `np.dot(context_embeddings, query_embedding)` to compute all dot products at once with the optimized `np` (numpy) implementation.
- Use `np.linalg.norm(context_embeddings, axis=1)` to compute norms of all context embeddings
- Use `np.linalg.norm(query_embedding)` for the query norm
- Divide the dot products by the product of norms to get cosine similarities
- Use `np.argmax(similarities)` to find the index of the highest similarity
- Return the context at that index

**Important:** Use numpy's vectorized operations rather than Python for loops for efficiency. This allows you to compare the query against all 10,000+ contexts in milliseconds rather than seconds (which is what will happen if you use Python for loops).

In [11]:
def retrieve_top_context(query, context_embeddings, contexts, model, tokenizer):
    """
    Retrieve top-k most similar contexts to query.

    TODO: Implement this function. Steps:
    1. Compute query embedding using compute_embeddings
    2. Calculate cosine similarities with all context embeddings
       Formula: dot(A,B) / (norm(A) * norm(B))
    3. Return the context whose embedding has the highest similarity score
    to the query embedding

    Args:
        query: str, the query text
        context_embeddings: numpy array of context embeddings
        contexts: list of context strings
        model: embedding model
        tokenizer: embedding tokenizer

    Returns:
        top_context: most similar context string
    """
    # YOUR CODE HERE
    query_embedding = compute_embeddings([query], model, tokenizer, batch_size = 1)[0]
    dot_products = np.dot(context_embeddings, query_embedding)
    norm_context_embedding = np.linalg.norm(context_embeddings, axis = 1)
    norm_query_embedding = np.linalg.norm(query_embedding)
    similarities = dot_products / (norm_context_embedding * norm_query_embedding)
    return contexts[np.argmax(similarities)]

Now test your implementation. The code below (which you can just run, and do not need to edit) will use your above defined `compute_embeddings` and `retrieve_top_context` functions to:
1. Extract all contexts from the SQUAD dataset
2. Compute embeddings for all ~10,500 contexts (this may take a minute or two)
3. Evaluate your retrieval system on 1000 examples (this may take a minute or two, assuming an efficient implementation)

**What to expect:**
- **Accuracy:** Your retrieval should achieve at least **50%** accuracy at retrieving the gold standard context. This means that for at least half the questions, you successfully retrieve the exact Wikipedia passage that contains the answer. This is actually quite good - remember that there are thousands of possible passages, and many questions could plausibly be answered by multiple passages.
  
- **Runtime:** Your retrieval should average less than **100ms per query** on a GPU/with CUDA. This time comes from computing the query embedding (forward propagation in the small encoder model) as well as the time for the similarity search itself.

If your accuracy is much lower than 50% or your runtime is much slower, double-check your implementation. Common issues include:
- Shape mismatches or incorrect averaging for computing the embeddings
- Using Python loops instead of numpy vectorized operations
- Not using `.to("cuda")` for the embedding model

In [12]:
# Run this code to extract all contexts from the dataset
# and compute their embeddings

contexts = [example["context"] for example in squad]

print(f"Computing embeddings for {len(contexts)} contexts...")
context_embeddings = compute_embeddings(contexts, embedding_model, embedding_tokenizer)

print(f"Embeddings shape: {context_embeddings.shape}")


Computing embeddings for 10570 contexts...
Embeddings shape: (10570, 384)


In [13]:
# Run this code to evaluate how effective your embeddings
# and retrieval are in terms of accuracy and runtime

correct = 0
start_time = time.time()

max_examples = 1000

for i in range(max_examples):
    example = squad[i]
    query = example['question']
    retrieved_context = retrieve_top_context(query, context_embeddings, contexts,
                                             embedding_model, embedding_tokenizer)
    if retrieved_context == example['context']:
        correct += 1

total_time = time.time() - start_time
print(f"Accuracy (proportion of retrieving gold standard context): {(correct/max_examples):.2%}")
print(f"Runtime efficiency (avg retrieval time per query): {(total_time/max_examples)*1000:.2f} ms")

Accuracy (proportion of retrieving gold standard context): 50.60%
Runtime efficiency (avg retrieval time per query): 15.66 ms


## Task 4

You've now implemented all the components of the RAG system. Now we will evaluate the overall system for question answering. The code below will:
1. For each question, retrieve the most similar context using your embeddings
2. Provide that retrieved context to the Phi-2 model
3. Generate an answer and evaluate its correctness

For the sake of efficiency the evaluation is just run on 100 examples.

**What to expect:**
- **Accuracy:** Your RAG system should achieve **at least 50%** accuracy, substantially better than the 21% without context, though not as good as the 87% with gold standard contexts. This gap is expected - sometimes retrieval finds a relevant but imperfect context, and sometimes it retrieves the wrong passage entirely.

- **Runtime:** Total time should average **less than 1 second per query** (using a GPU with cuda).

After running the evaluation, **briefly answer the reflection questions that follow in 2-3 sentences each**.

In [14]:
acc, avg_time, _ = evaluate_model(model, tokenizer, dataset=squad, mode="rag",
                                  context_embeddings=context_embeddings, contexts=contexts,
                                  embedding_model=embedding_model, embedding_tokenizer=embedding_tokenizer,
                                  max_examples=100)

print("Using RAG:")
print(f"Accuracy: {acc:.2%}")
print(f"Average time per question: {avg_time:.3f} seconds\n")

Evaluating rag: 100%|██████████| 100/100 [00:55<00:00,  1.81it/s]

Using RAG:
Accuracy: 64.00%
Average time per question: 0.551 seconds






**TODO**: Briefly answer these reflection questions in 2-3 sentences each

**Q1.** How much does RAG improve over no context in terms of correctness? How much runtime overhead do you observe?

**A1.** I find that RAG does significantly increase the model's accuracy from 21% with no context up to 64% with RAG. This result does showcase that giving the model helpful and relevant context does help the model to generate accurate answers. Also, the runtime overhead is quite small since each query takes on average about 0.55 seconds which is well below the 1 second threshold.

**Q2.** Under what circumstances would you recommend implementing RAG for a real-world question answering system?

**A2.** Pretty much RAG works well when dealing with large data or data that are frequently updated, such as Wikipedia articles, online news, research papers, or company databases to name a few. With RAG, it does help the model to find the most up-to-date data without the need to be retrained, and this makes it great for circumstances where information changes often.  

**Q3.** What are the advantages and disadvantages of RAG as opposed to supervised fine-tuning for improving model performance on a particular dataset (such as these wikipedia articles)?

**A3.** One advantage of RAG that I can think of is that it can use outside information and can stay updated without the need to retrain the entire model. Also, it is flexible and can easily handle new data or information. A disadvantage of RAG is that it could occasionally fetch the wrong or incomplete information which can make the model's answers less accurate compared to a model that has been carefully fine-tuned on a particular dataset.