# COMS W4705 - Homework 4
## Question Answering with Retrieval Augmented Generation

Anubhav Jangra \<aj3228@columbia.edu\>, Emile Al-Billeh \<ea3048@columbia.edu\>, Daniel Bauer \<bauer@cs.columbia.edu\>

In this assignment, you will use a pretrained LLM for question answering on a subset of the Stanford QA Dataset (SQuAD). Here is an example question from SQuAD: 

> *Who is the headmaster of the Christian Brothers of Ireland Stella Maris College?*

Specific domain knowledge to answer questions like this may not be available in the data that the LLM was pre-trained on. As a result, if we simply prompt the the LLM to answer this question, it may tell us that it does not know the answer, or worse, it may hallucinate an incorrect answer. Even if we are lucky and the LLM has have enough information to answer this question from pre-training, but the information may be outdated (the headmaster is likely to change from time to time). 

Luckily, SQuAD provides a context snippet for each question that may contain the answer, such as

> *The Christian Brothers of Ireland Stella Maris College is a private, co-educational, not-for-profit Catholic school located in the wealthy residential southeastern neighbourhood of Carrasco. Established in 1955, it is regarded as one of the best high schools in the country, blending a rigorous curriculum with strong extracurricular activities. **The school's headmaster, history professor Juan Pedro Toni**, is a member of the Stella Maris Board of Governors and the school is a member of the International Baccalaureate Organization (IBO). Its long list of distinguished former pupils includes economists, engineers, architects, lawyers, politicians and even F1 champions. The school has also played an important part in the development of rugby union in Uruguay, with the creation of Old Christians Club, the school's alumni club.*

If we include the context as part of the prompt to the LLM, the model should be able to correctly answer the question (SQuAD contains "unanswerable questions", for which the provided context does not provide sufficient information to answer the question -- we will ignore these for the purpose of this assignment).

We will consider a scenario in which we don't know which context belongs to which question and we will use **Retrieval Augmented Generation (RAG)** techniques to identify the relevant context from the set of all available contexts. 

Specifically we will experiment with the following systems: 

* A baseline "vanilla QA" system in which we try to answer the question without any additional context (i.e. using the pre-trained LLM only).
* An "oracle" system, in which we provide the correct context for each question. This establishes an upper bound for the retrieval approaches. 
* Two different approaches for retrieving relevant contexts:
  * based on token overlap between the question and each context.
  * based on cosine similarity between question embeddings and candidate context embeddings (obtained using BERT).
    
We will evaluate each system using a number of metrics commonly used for QA tasks: 
* Exact Match (EM), which measures the percentage of predictions that exactly match the ground truth answers.
* F1 score, measured on the token overlap between the predicted and ground truth answers.
* ROUGE (specifically, ROUGE2)

Follow the instructions in this notebook step-by step. Much of the code is provided and just needs to be run, but some sections are marked with todo. Make sure to complete all these sections.


Requirements: 
Access to a GPU is required for this assignment. If you have a recent mac, you can try using mps. Otherwise, I recommend renting a GPU instance through a service like vast.ai or lambdalabs. Google Colab can work in a pinch, but you would have to deal with quotas and it's somewhat easy to lose unsaved work.

First, we need to ensure that transformers is installed, as well as the accelerate package.

In [2]:
pip install transformers

Collecting transformers
  Downloading transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Downloading huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2025.11.3-cp314-cp314-macosx_11_0_arm64.whl.metadata (40 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Downloading tokenizers-0.22.1-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Downloading safetensors-0.7.0-cp38-abi3-macosx_11_0_arm64.whl.metadata (4.1 kB)
Collecting tqdm>=4.27 (from transformers)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting hf-xet<2.0.0,>=1.1.3 (from huggingface-hub<1.0,>=0.34.0->transformers)
  Downloading hf_xet-1.2.0-cp37-abi3-macosx_11_0_arm64.whl.metadata (4.9 kB)
Downloading transformers-4.57.3-py3-none-any.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

In [3]:
pip install accelerate

Collecting accelerate
  Downloading accelerate-1.12.0-py3-none-any.whl.metadata (19 kB)
Downloading accelerate-1.12.0-py3-none-any.whl (380 kB)
Installing collected packages: accelerate
[0mSuccessfully installed accelerate-1.12.0
Note: you may need to restart the kernel to use updated packages.


Now all the relevant imports should succeed: 

In [1]:
import os
import json
import tqdm
import copy
import torch
import torch.nn.functional as F

import re
import string
import collections

import transformers

Set GPU for Mac:

In [2]:
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

In [3]:
print(device)

mps


## Data Preparation

This section creates the benchmark data we need to evaluate the QA systems. It has already been implemented for you. We recommend that you run it only once, save the benchmark data in a json file and then load it when needed. The following code may not work in Windows. We are providing the pre-generated benchmark data for download as an alternative. 

In [4]:
data_dir = "./squad_data"
if not os.path.exists(data_dir):
    os.mkdir(data_dir)

### Downloading the Data and Creating the Benchmark Dataset

In [5]:
training_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json"
val_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json"

os.system(f"curl -L {training_url} -o {data_dir}/squad_train.json")  

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 40.1M  100 40.1M    0     0  13.1M      0  0:00:03  0:00:03 --:--:-- 13.1M


0

In [6]:
# load the raw dataset
train_data = json.load(open(f"{data_dir}/squad_train.json"))

# Some details about the dataset

# SQuAD is split up into questions about a number of different topics
print(f"Number of topics: {len(train_data['data'])}")

# Let's explore just one topic. Each topic comes with a number of context paragraphs. 
print("="*30)
print(f"For topic \"{train_data['data'][0]['title']}\"")
print(f"Number of available context paragraphs: {len(train_data['data'][0]['paragraphs'])}")
print("="*30)

print("The first paragraph is:")
print(train_data['data'][0]['paragraphs'][0]['context'])
print("="*30)

# Each paragraph comes with a number of question/answer pairs about the text in the paragraph
print("The first five question-answer pairs are:")
for qa in train_data['data'][0]['paragraphs'][0]['qas'][:5]:
    print(f"Question: {qa['question']}")
    print(f"Answer: {qa['answers'][0]['text']}")
    print("-"*20)

Number of topics: 442
For topic "Beyoncé"
Number of available context paragraphs: 66
The first paragraph is:
Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
The first five question-answer pairs are:
Question: When did Beyonce start becoming popular?
Answer: in the late 1990s
--------------------
Question: What areas did Beyonce compete in when she was

In [7]:
print("Total number of paragraphs in the training set:", sum([len(topic['paragraphs']) for topic in train_data['data']]))
print("Total number of question-answer pairs in the training set:", sum([len(paragraph['qas']) for topic in train_data['data'] for paragraph in topic['paragraphs']]))

Total number of paragraphs in the training set: 19035
Total number of question-answer pairs in the training set: 130319


In [8]:
# not all questions are answerable given the information in the paragraph. Part of the original SQuaD 2 task is to identify such
# unanswerable questions. We will ignore them for the purpose of this assignment. 
print("Avg number of answers per question:", 
      sum([len(qa['answers']) for topic in train_data['data'] for paragraph in topic['paragraphs'] for qa in paragraph['qas']]) / 
      sum([len(paragraph['qas']) for topic in train_data['data'] for paragraph in topic['paragraphs']]))
print("Count of answerable vs unanswerable questions:")
answerable_count = 0
unanswerable_count = 0
for topic in train_data['data']:
    for paragraph in topic['paragraphs']:
        for qa in paragraph['qas']:
            if len(qa['answers']) > 0:
                answerable_count += 1
            else:
                unanswerable_count += 1
print(f"Answerable questions: {answerable_count} ({answerable_count / (answerable_count + unanswerable_count) * 100:.2f}%)")
print(f"Unanswerable questions: {unanswerable_count} ({unanswerable_count / (answerable_count + unanswerable_count) * 100:.2f}%)")

Avg number of answers per question: 0.6662190471074824
Count of answerable vs unanswerable questions:
Answerable questions: 86821 (66.62%)
Unanswerable questions: 43498 (33.38%)


In [9]:
# Finally, create the RAG QA benchmark consisting of 250 answerable questions. 

# We will use all available context paragraphs for RAG
rag_contexts = [paragraph['context'] for topic in train_data['data'] for paragraph in topic['paragraphs']]

qa_pairs = []
for topic in train_data['data']:
    for paragraph in topic['paragraphs']:
        for qa in paragraph['qas']:
            if len(qa['answers']) > 0:
                qa_pairs.append({
                    "question": qa['question'],
                    "answer": qa['answers'][0]['text'],
                    "context": paragraph['context']
                })
            
# randomly sample 250 answerable questions for the benchmark
import random
random.seed(42) # IMPORTANT so everyone is working on the same set of sampled QA pairs
sampled_qa_pairs = random.sample(qa_pairs, 250)


evaluation_benchmark = {'qas': sampled_qa_pairs, 
                        'contexts': rag_contexts}
random.shuffle(evaluation_benchmark['qas'])
random.shuffle(evaluation_benchmark['contexts'])

# save the evaluation benchmark to a file
json.dump(evaluation_benchmark, open(f"{data_dir}/rag_qa_benchmark.json", "w"), indent=2)

### Loading the Benchmark Dataset / Understanding the Data Format

Use the following code to load the benchmark data from a file. Take a look at the example output to see how the data is structured. 

In [10]:
# load the benchmark and display some samples
evaluation_benchmark = json.load(open(f"{data_dir}/rag_qa_benchmark.json"))

print("Sample RAG contexts:")
for context in evaluation_benchmark['contexts'][:2]:
    print(context)
    print("-"*20)
print("="*30)
print("Sample RAG QA pairs:")
for qa in evaluation_benchmark['qas'][:5]:
    print(f"Question: {qa['question']}")
    print(f"Answer: {qa['answer']}")
    print("-"*20)

Sample RAG contexts:
Tajikistan's rivers, such as the Vakhsh and the Panj, have great hydropower potential, and the government has focused on attracting investment for projects for internal use and electricity exports. Tajikistan is home to the Nurek Dam, the highest dam in the world. Lately, Russia's RAO UES energy giant has been working on the Sangtuda-1 hydroelectric power station (670 MW capacity) commenced operations on 18 January 2008. Other projects at the development stage include Sangtuda-2 by Iran, Zerafshan by the Chinese company SinoHydro, and the Rogun power plant that, at a projected height of 335 metres (1,099 ft), would supersede the Nurek Dam as highest in the world if it is brought to completion. A planned project, CASA 1000, will transmit 1000 MW of surplus electricity from Tajikistan to Pakistan with power transit through Afghanistan. The total length of transmission line is 750 km while the project is planned to be on Public-Private Partnership basis with the suppo

The `evaluation_benchmark` is a dictionary with two keys: 
* `evaluation_benchmark['qas']`  provides a list of *qa_items* (see below).
* `evaluation_benchmark['contexts']` provides a list of available candidate contexts. Note that this includes all contexts from SQuAD, not just the ones for the 250 questions we sampled for the benchmark.

Each *qa_item* is a dictionary with the following keys: 
* `qa_item['question']` is the question string
* `qa_item['answer']` is the target answer string
* `qa_item['context']` is the gold context for this question

For example: 


In [11]:
qa_items = evaluation_benchmark['qas']
len(qa_items)

250

In [12]:
qa_item = qa_items[0]
qa_item['question']

'Who is the headmaster of the Christian Brothers of Ireland Stella Maris College?'

In [13]:
qa_item['answer']

'professor Juan Pedro Toni'

In [14]:
qa_item['context']

"The Christian Brothers of Ireland Stella Maris College is a private, co-educational, not-for-profit Catholic school located in the wealthy residential southeastern neighbourhood of Carrasco. Established in 1955, it is regarded as one of the best high schools in the country, blending a rigorous curriculum with strong extracurricular activities. The school's headmaster, history professor Juan Pedro Toni, is a member of the Stella Maris Board of Governors and the school is a member of the International Baccalaureate Organization (IBO). Its long list of distinguished former pupils includes economists, engineers, architects, lawyers, politicians and even F1 champions. The school has also played an important part in the development of rugby union in Uruguay, with the creation of Old Christians Club, the school's alumni club."

## Part 1 - Question Answering Evaluation Functions

In this section. we will define a number of evaluation functions that measure the quality of the QA output, compared to a single target answer for each question. 

Because the evaluation will happen at a token leve, we will perform some simple pre-processing: 

In [15]:
def normalize_answer(s):
  """Lower text and remove punctuation, articles and extra whitespace."""
  def remove_articles(text):
    regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
    return re.sub(regex, ' ', text)
  def white_space_fix(text):
    return ' '.join(text.split())
  def remove_punc(text):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in text if ch not in exclude)
  def lower(text):
    return text.lower()
  return white_space_fix(remove_articles(remove_punc(lower(s))))

def get_tokens(s):
  if not s: return []
  return normalize_answer(s).split()

First, Exact Match (EM) measures the percentage of predictions that match any one of the ground truth answers exactly after normalization.
The following function returns 1 if the predicted answer is correct and 0 otherwise. 

In [16]:
def compute_exact(a_gold, a_pred):
  return int(normalize_answer(a_gold) == normalize_answer(a_pred))

The next function calculates the $F_1$ score of the set of predicted tokens against the set of target tokens. 
$F_1$ is the harmonic mean of precision and recall, providing a balance between the two. Specifically 

$F_1 = \frac{2 \times \text{precision} \times \text{recall}}{\text{precision} + \text{recall}}$

where $\text{precision}$ is the fraction of predicted tokens that also appear in the target and $\text{recall}$ is the fraction of target tokens that also appear in the prediction. 

**TODO**: Write the function compute_f1(a_gold, a_pred) that returns the F1 score as defined above. It should work similar to the compute_exact method above. Test your function on a sample answer and prediction to verify that it works correctly. 

In [17]:
def compute_f1(a_gold, a_pred): # Complete the function
  gold_toks = get_tokens(a_gold)
  pred_toks = get_tokens(a_pred)

  if len(gold_toks) == 0 and len(pred_toks) == 0:
    return 1.0
  if len(gold_toks) == 0 or len(pred_toks) == 0:
    return 0.0

  common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
  num_same = sum(common.values())

  if num_same == 0:
    return 0.0

  precision = num_same / len(pred_toks)
  recall = num_same / len(gold_toks)
  f1 = (2 * precision * recall) / (precision + recall)
  return f1

In [18]:
# Test your function
a_gold = "professor Juan Pedro Toni"
a_pred = "Juan Pedro Toni"
print("F1:", compute_f1(a_gold, a_pred))

print("F1 exact:", compute_f1("the cat", "cat"))
print("F1 no overlap:", compute_f1("red car", "blue"))

F1: 0.8571428571428571
F1 exact: 1.0
F1 no overlap: 0.0


Finally, we are also want to compute ROUGE-2 scores (which extends the F1 score above to 2-grams). We can use the `rouge_score` package to do this for us. 

In [24]:
pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting absl-py (from rouge_score)
  Downloading absl_py-2.3.1-py3-none-any.whl.metadata (3.3 kB)
Collecting nltk (from rouge_score)
  Downloading nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Collecting click (from nltk->rouge_score)
  Downloading click-8.3.1-py3-none-any.whl.metadata (2.6 kB)
Collecting joblib (from nltk->rouge_score)
  Downloading joblib-1.5.3-py3-none-any.whl.metadata (5.5 kB)
Downloading absl_py-2.3.1-py3-none-any.whl (135 kB)
Downloading nltk-3.9.2-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m25.0 MB/s[0m  [33m0:00:00[0m
[?25hDownloading click-8.3.1-py3-none-any.whl (108 kB)
Downloading joblib-1.5.3-py3-none-any.whl (309 kB)
Building wheels for collected pac

In [19]:
from rouge_score import rouge_scorer

rouge_scorer = rouge_scorer.RougeScorer(['rouge2'], use_stemmer=False)

def compute_rouge2(a_gold, a_pred):
    if not a_gold or not a_pred:
        return 0.0
    scores = rouge_scorer.score(a_gold.lower(), a_pred.lower())
    return scores['rouge2'].fmeasure

Let's test the metrics: 

In [20]:
reference_answers = ["London", "The capital of England is London.", "London is the capital city of England."]
predicted_answers = ["London, capital of England"] * len(reference_answers)

print("Normalized Answers:")
for ref, pred in zip(reference_answers, predicted_answers):
    print(f"Original:")
    print(f"Reference: {ref} | Predicted: {pred}")
    print(f"Normalized:")
    print(f"Reference: {normalize_answer(ref)} | Predicted: {normalize_answer(pred)}")
    print("Exact Match:", compute_exact(normalize_answer(ref), normalize_answer(pred)))
    print("F1 Score:", compute_f1(normalize_answer(ref), normalize_answer(pred)))
    print("ROUGE-2 F1-score:", compute_rouge2(normalize_answer(ref), normalize_answer(pred)))
    print("-"*40)

Normalized Answers:
Original:
Reference: London | Predicted: London, capital of England
Normalized:
Reference: london | Predicted: london capital of england
Exact Match: 0
F1 Score: 0.4
ROUGE-2 F1-score: 0.0
----------------------------------------
Original:
Reference: The capital of England is London. | Predicted: London, capital of England
Normalized:
Reference: capital of england is london | Predicted: london capital of england
Exact Match: 0
F1 Score: 0.888888888888889
ROUGE-2 F1-score: 0.5714285714285715
----------------------------------------
Original:
Reference: London is the capital city of England. | Predicted: London, capital of England
Normalized:
Reference: london is capital city of england | Predicted: london capital of england
Exact Match: 0
F1 Score: 0.8
ROUGE-2 F1-score: 0.25
----------------------------------------


## Part 2 - Vanilla Question Answering

In this part, we will use an off-the-shelf pretrained LLM and attempt to answer the questions from its pretraining knowledge only. 
To make things simple, we will use the huggingface transformer pipeline abstraction. The pipeline will download the model and parameters for us on creation. When we pass an input prompt to the pipeline, it will automatically perform preprocessing (tokenization), inference, and postprocessing (removing EOS markers and padding).

### Loading the LLM
The LLM we will use is the 1B version of the instruction tuned OLMo2 model. OLMo is an open source language model created by Allen AI and the University of Washington. Unlike other open source models, OLMo is also open data. You can read more about it here: https://huggingface.co/allenai/OLMo-2-0425-1B-Instruct and here https://allenai.org/olmo.

In [21]:
qa_model = "allenai/OLMo-2-0425-1B-Instruct"

from transformers import pipeline

# Check which GPU device to use. Note, this will likely NOT work on a CPU. 
if torch.cuda.is_available():
    device = "cuda" 
elif torch.backends.mps.is_available():
    device = "mps" 
else:
    device = "cpu"

pipe = pipeline(
    "text-generation",
    model=qa_model,
    dtype=torch.bfloat16,
    device_map=device,
)

Device set to use mps


In [None]:
# I just wanted to make sure the model is on the correct device
print(pipe.model.device)

mps:0


We can now pass a prompt to the model and retreive the completed answer. 

In [23]:
prompt = "My favorite thing to do in fall is"
output = pipe(prompt, 
              max_new_tokens=128,
              do_sample=True, # set to False for greedy decoding below
              pad_token_id=pipe.tokenizer.eos_token_id)
print(output)

[{'generated_text': "My favorite thing to do in fall is to make a wreath. Here are the supplies you need:\n\n    - A wreath mold (you can find one at most craft stores)\n    - Yarn (green, orange, and red)\n    - Hot glue gun and glue sticks\n    - Cork or foam balls (for hanging)\n    - Decorative elements (like leaves, acorns, or dried flowers)\n\nHere's how to make a wreath:\n\n1. Start by cutting the yarn to the desired length for your wreath.\n2. Wrap the yarn around the wreath mold, leaving some extra length on each side for hanging. Secure each end of the yarn"}]


We can skip the prompt that is repeated in the output. 

In [24]:
output[0]['generated_text'][len(prompt):].strip()

"to make a wreath. Here are the supplies you need:\n\n    - A wreath mold (you can find one at most craft stores)\n    - Yarn (green, orange, and red)\n    - Hot glue gun and glue sticks\n    - Cork or foam balls (for hanging)\n    - Decorative elements (like leaves, acorns, or dried flowers)\n\nHere's how to make a wreath:\n\n1. Start by cutting the yarn to the desired length for your wreath.\n2. Wrap the yarn around the wreath mold, leaving some extra length on each side for hanging. Secure each end of the yarn"

### Using the LLM for Question Answering

**TODO:** Write a function `vanilla_qa(qa_item)` that take a qa_item in the format described above, inserts the question (and only the question!) into a suitable prompt, passes the prompt to the LLM and then returns the answer as a string. 

A prompt might look like this, but will need a bit of prompt engineering to make it work well. 

> *Answer the following question concisely.* 
>
> *Question: Who played he lead role in Alien?*
> 
> *Answer:*

Once you have a basic version of the vanilla QA system you can tune the prompt (see below). 

In [34]:
def vanilla_qa(qa_item): # Complete this function
    question = qa_item["question"]

    prompt = (
        # Before experimenting with different prompt formats:
        #"Answer the following question concisely.\n\n"
        #f"Question: {question}\n"
        #"Answer:"

        # After experimenting with different prompt formats:
        "Answer the question using only a short phrase.\n\n"
        "Example:\n"
        "Question: Who wrote Pride and Prejudice?\n"
        "Answer: Jane Austen\n\n"
        f"Question: {question}\n"
        "Answer:"
    )

    output = pipe(
        prompt,
        max_new_tokens=64,
        do_sample=False,  # set to false because of greedy decoding
        pad_token_id=pipe.tokenizer.eos_token_id,
    )

    generated = output[0]["generated_text"]
    answer = generated[len(prompt):].strip()

    return answer

The following code should return an answer (but possibly not the right one) to the first question in the dataset.

In [26]:
qa_item = evaluation_benchmark['qas'][0]
qa_item['question']

'Who is the headmaster of the Christian Brothers of Ireland Stella Maris College?'

In [27]:
vanilla_qa(qa_item) # inspect the item

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


"The headmaster of the Christian Brothers of Ireland Stella Maris College is Fr Michael O'Connell."

And the following function evaluates the performance of your `vanilla_qa` function on a list of qa_items. 

In [35]:
def evaluate_qa(qa_function, qa_items, verbose=False):
    results = []

    
    for i, qa_item in tqdm.tqdm(enumerate(qa_items), desc="Evaluating QA instances", total=len(qa_items)):

        question = qa_item['question'] 
        answer = qa_item['answer']
        context = qa_item['context']
        
        predicted_answer = qa_function(qa_item)

        exact_match = compute_exact(answer, predicted_answer)
        f1_score = compute_f1(answer, predicted_answer)
        rouge2_f1 = compute_rouge2(answer, predicted_answer)

        if verbose:
            print(f"Q: {question}")
            print(f"Gold Answer: {answer}")
            print(f"Predicted Answer: {answer}")
            print(f"Exact Match: {exact_match}, F1 Score: {f1_score}")
            print(f"ROUGE-2 F1 Score: {rouge2_f1}")
            print("-"*40)

        results.append({
            "question": question,
            "answer": answer,
            "predicted_answer": predicted_answer,
            "context": context if context else None,
            "exact_match": exact_match,
            "f1_score": f1_score,
            "rouge2_f1": rouge2_f1
        })
    return results

In [36]:
vanilla_evaluation_results = evaluate_qa(vanilla_qa, evaluation_benchmark['qas'])

Evaluating QA instances: 100%|██████████| 250/250 [00:48<00:00,  5.17it/s]


The function returns a list of evaluation results, one dictionary for each qa item.

In [37]:
vanilla_evaluation_results[0]

{'question': 'Who is the headmaster of the Christian Brothers of Ireland Stella Maris College?',
 'answer': 'professor Juan Pedro Toni',
 'predicted_answer': 'Christian Brothers of Ireland Stella Maris College',
 'context': "The Christian Brothers of Ireland Stella Maris College is a private, co-educational, not-for-profit Catholic school located in the wealthy residential southeastern neighbourhood of Carrasco. Established in 1955, it is regarded as one of the best high schools in the country, blending a rigorous curriculum with strong extracurricular activities. The school's headmaster, history professor Juan Pedro Toni, is a member of the Stella Maris Board of Governors and the school is a member of the International Baccalaureate Organization (IBO). Its long list of distinguished former pupils includes economists, engineers, architects, lawyers, politicians and even F1 champions. The school has also played an important part in the development of rugby union in Uruguay, with the cre

Finally, the `present_results` function aggregates the results for the various qa items and prints the overall result. 

In [38]:
def present_results(eval_results, exp_name=""):
    print(f"{exp_name} Evaluation Results:")
    exact_matches = [res['exact_match'] for res in eval_results]
    f1_scores = [res['f1_score'] for res in eval_results]
    rouge2_f1 = [res['rouge2_f1'] for res in eval_results]
    print(f"Exact Match: {sum(exact_matches) / len(exact_matches) * 100:.2f}%")
    print(f"F1 Score: {sum(f1_scores) / len(f1_scores) * 100:.2f}%")
    print(f"ROUGE2 F1: {sum(rouge2_f1) / len(rouge2_f1) * 100:.2f}%")

    # print out some evaluation results
    for res in eval_results[:5]:
        print(f"Question: {res['question']}")
        print(f"Gold Answer: {res['answer']}")
        print(f"Predicted Answer: {res['predicted_answer']}")
        print(f"Exact Match: {res['exact_match']}, F1 Score: {res['f1_score']}")
        print("ROUGE-2 F1-score:", res['rouge2_f1'])
        print("-"*40)

In [39]:
present_results(vanilla_evaluation_results, "Vanilla QA")

Vanilla QA Evaluation Results:
Exact Match: 7.20%
F1 Score: 13.85%
ROUGE2 F1: 4.11%
Question: Who is the headmaster of the Christian Brothers of Ireland Stella Maris College?
Gold Answer: professor Juan Pedro Toni
Predicted Answer: Christian Brothers of Ireland Stella Maris College
Exact Match: 0, F1 Score: 0.0
ROUGE-2 F1-score: 0.0
----------------------------------------
Question: What is the ratio of black and Asian schoolchildren to white schoolchildren?
Gold Answer: about six to four
Predicted Answer: 1:1
Exact Match: 0, F1 Score: 0.0
ROUGE-2 F1-score: 0.0
----------------------------------------
Question: When did Outcault's The Yellow Kid appear in newspapers?
Gold Answer: 1890s
Predicted Answer: 1894
Exact Match: 0, F1 Score: 0.0
ROUGE-2 F1-score: 0.0
----------------------------------------
Question: When did devolution in the UK begin?
Gold Answer: 1914
Predicted Answer: 1997
Exact Match: 0, F1 Score: 0.0
ROUGE-2 F1-score: 0.0
----------------------------------------
Question

**TODO:** Experiment with the prompt template and try to achieve an Exact Match score of at least 5%. You may want to try including an example in the prompt (single-shot prompting).

## Part 3 - Oracle Question Answering

We will now establish an upper bound for a retrieval augmented QA system by providing the correct ("gold") context for each question as part of the prompt. These contexts are available as part of each qa_item in the evaluation_benchmark['qas'] dictionary. 

**TODO**: Write a function `oracle_qa(qa_item)` that takes in a qa_item, inserts both the question **and** the gold context into a prompt template, then passes the prompt to the LLM and returns the answer. The function should behave like the `vanilla_qa` function above, so that we can evaluate it using the same evaluation steps. 

In [42]:
def oracle_qa(qa_item): # Write this function
    question = qa_item["question"]
    context = qa_item["context"]

    prompt = (
        "You are a question answering system.\n"
        "Using ONLY the context below, extract the shortest exact text span that answers the question.\n"
        "Return ONLY the answer span exactly as it appears in the context. Do not add any extra words.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}\n"
        "Answer:"
    )

    output = pipe(
        prompt,
        max_new_tokens=32,
        do_sample=False,
        pad_token_id=pipe.tokenizer.eos_token_id,
    )

    generated = output[0]["generated_text"]
    answer = generated[len(prompt):].strip()

    return answer

**TODO**: run the `evaluate_qa` function on your `oracle_qa` function and display the results. You should see Exact Match scores above 50% (if not, tinker with the prompt template). 

In [43]:
oracle_evaluation_results = evaluate_qa(oracle_qa, evaluation_benchmark['qas'])
present_results(oracle_evaluation_results)

Evaluating QA instances: 100%|██████████| 250/250 [01:49<00:00,  2.29it/s]

 Evaluation Results:
Exact Match: 53.60%
F1 Score: 69.84%
ROUGE2 F1: 38.07%
Question: Who is the headmaster of the Christian Brothers of Ireland Stella Maris College?
Gold Answer: professor Juan Pedro Toni
Predicted Answer: Juan Pedro Toni
Exact Match: 0, F1 Score: 0.8571428571428571
ROUGE-2 F1-score: 0.8
----------------------------------------
Question: What is the ratio of black and Asian schoolchildren to white schoolchildren?
Gold Answer: about six to four
Predicted Answer: about six to four
Exact Match: 1, F1 Score: 1.0
ROUGE-2 F1-score: 1.0
----------------------------------------
Question: When did Outcault's The Yellow Kid appear in newspapers?
Gold Answer: 1890s
Predicted Answer: 1890s
Exact Match: 1, F1 Score: 1.0
ROUGE-2 F1-score: 0.0
----------------------------------------
Question: When did devolution in the UK begin?
Gold Answer: 1914
Predicted Answer: Government of Ireland Act 1914
Exact Match: 0, F1 Score: 0.33333333333333337
ROUGE-2 F1-score: 0.0
--------------------




## Part 4 - Retrieval-Augmented Question Answering - Word Overlap

Next, we will experiment with various approaches for retrieving relevant contexts from the set of available contexts. We first get the list of all 19035 available candidate contexts from the evaluation_benchmark.

In [44]:
candidate_contexts = evaluation_benchmark["contexts"]

In [45]:
len(candidate_contexts)

19035

In [46]:
candidate_contexts[0]

"Tajikistan's rivers, such as the Vakhsh and the Panj, have great hydropower potential, and the government has focused on attracting investment for projects for internal use and electricity exports. Tajikistan is home to the Nurek Dam, the highest dam in the world. Lately, Russia's RAO UES energy giant has been working on the Sangtuda-1 hydroelectric power station (670 MW capacity) commenced operations on 18 January 2008. Other projects at the development stage include Sangtuda-2 by Iran, Zerafshan by the Chinese company SinoHydro, and the Rogun power plant that, at a projected height of 335 metres (1,099 ft), would supersede the Nurek Dam as highest in the world if it is brought to completion. A planned project, CASA 1000, will transmit 1000 MW of surplus electricity from Tajikistan to Pakistan with power transit through Afghanistan. The total length of transmission line is 750 km while the project is planned to be on Public-Private Partnership basis with the support of WB, IFC, ADB a

### Token Overlap Retriever 
Let's first experiment with a simple retriever based on word overlap. Given a question, we measure how many of its tokens appear in each of the candidate contexts. We then retrieve the k contexts with the highest overlap. 

**TODO:** Write the function `retrieve_overlap(question, contexts, top_k)` that takes in the question (a string) and a list of contexts (each context is a string). It should calculate the word overlap between the question and *each* context, and return a list of the *top_k* contexts with the highest overlap. 

In [47]:
# word overlap retriever -- write this function
def retrieve_overlap(question, contexts, top_k=5):
    question_tokens = set(get_tokens(question))
    scores_list = []

    for context in contexts:
        context_tokens = set(get_tokens(context))
        overlap = len(question_tokens & context_tokens)
        scores_list.append((overlap, context))

    scores_list.sort(key=lambda x: x[0], reverse=True)
    top_contexts = [context for i, context in scores_list[:top_k]]

    return top_contexts

The following function runs the retriever a list of qa_items. For each qa_item it obtains the list of retrieved contexts and adds them to the qa_item. 

In [48]:
def add_rag_context(qa_items, contexts, retriever, top_k=5):
    result_items = copy.deepcopy(qa_items)
    for inst in tqdm.tqdm(result_items, desc="Retrieving contexts"):
        question = inst['question']
        retrieved_contexts = retriever(question, contexts, top_k)
        inst['rag_contexts'] = retrieved_contexts   
    return result_items

In [49]:
rag_qa_pairs = add_rag_context(evaluation_benchmark['qas'], candidate_contexts, retrieve_overlap)

Retrieving contexts: 100%|██████████| 250/250 [03:34<00:00,  1.17it/s]


It returns a copy of the qa_item list that is now annotated with the additional 'rag_contexts'.

In [50]:
rag_qa_pairs[0]

{'question': 'Who is the headmaster of the Christian Brothers of Ireland Stella Maris College?',
 'answer': 'professor Juan Pedro Toni',
 'context': "The Christian Brothers of Ireland Stella Maris College is a private, co-educational, not-for-profit Catholic school located in the wealthy residential southeastern neighbourhood of Carrasco. Established in 1955, it is regarded as one of the best high schools in the country, blending a rigorous curriculum with strong extracurricular activities. The school's headmaster, history professor Juan Pedro Toni, is a member of the Stella Maris Board of Governors and the school is a member of the International Baccalaureate Organization (IBO). Its long list of distinguished former pupils includes economists, engineers, architects, lawyers, politicians and even F1 champions. The school has also played an important part in the development of rugby union in Uruguay, with the creation of Old Christians Club, the school's alumni club.",
 'rag_contexts': 

Before we run an end-to-end evaluation, we can check the accuracy of the word overlap retriever. In other words, for what fraction of questions is the gold context included in the top-k retrieved contexts. 

In [51]:
# evaluation metric of retriever
def evaluate_retriever(rag_qa_pairs):
    """
    Evaluates the retriever by computing the accuracy of retrieved contexts against reference contexts.
    """
    correct_retrievals = 0
    for qa_item in rag_qa_pairs:
        if qa_item['context'] in qa_item['rag_contexts']:
            correct_retrievals += 1
    accuracy = correct_retrievals / len(rag_qa_pairs)
    return accuracy

In our implementation, we got an accuracy of 0.372. 

In [52]:
evaluate_retriever(rag_qa_pairs)

0.52

**TODO**: Write a function `rag_qa(qa_item)` that behaves like the `vanilla_qa` and `oracle_qa` functions above. Create a prompt from the question and the top-k retrieved contexts (instead of the gold context you used in `oracle_qa`). You can assume that `qa_item` already 
contains the 'rag_contexts' field. 

In [53]:
def rag_qa(qa_item): # Write this function
    question = qa_item["question"]
    rag_contexts = qa_item["rag_contexts"]

    context_block = "\n\n".join([f"Context {i+1}: {ctx}" for i, ctx in enumerate(rag_contexts)])

    prompt = (
        "You are a question answering system.\n"
        "Using ONLY the contexts below, extract the shortest exact text span that answers the question.\n"
        "Return ONLY the answer span exactly as it appears in the contexts. Do not add any extra words.\n\n"
        f"{context_block}\n\n"
        f"Question: {question}\n"
        "Answer:"
    )

    output = pipe(
        prompt,
        max_new_tokens=32,
        do_sample=False,
        pad_token_id=pipe.tokenizer.eos_token_id,
    )

    generated = output[0]["generated_text"]
    answer = generated[len(prompt):].strip()
    # answer = answer.split("\n")[0].strip()

    return answer

**TODO**: Like you did for the vanilla and oracle qa system, evaluate the `rag_qa` function and display the results. In our implementation, we got an exact match of 19.6%. 

In [54]:
rag_overlap_eval = evaluate_qa(rag_qa, rag_qa_pairs)
present_results(rag_overlap_eval)

Evaluating QA instances: 100%|██████████| 250/250 [11:36<00:00,  2.78s/it]  

 Evaluation Results:
Exact Match: 31.20%
F1 Score: 42.02%
ROUGE2 F1: 21.98%
Question: Who is the headmaster of the Christian Brothers of Ireland Stella Maris College?
Gold Answer: professor Juan Pedro Toni
Predicted Answer: Juan Pedro Toni
Exact Match: 0, F1 Score: 0.8571428571428571
ROUGE-2 F1-score: 0.8
----------------------------------------
Question: What is the ratio of black and Asian schoolchildren to white schoolchildren?
Gold Answer: about six to four
Predicted Answer: 10.9%
Exact Match: 0, F1 Score: 0.0
ROUGE-2 F1-score: 0.0
----------------------------------------
Question: When did Outcault's The Yellow Kid appear in newspapers?
Gold Answer: 1890s
Predicted Answer: 1890s
Exact Match: 1, F1 Score: 1.0
ROUGE-2 F1-score: 0.0
----------------------------------------
Question: When did devolution in the UK begin?
Gold Answer: 1914
Predicted Answer: 1908
Exact Match: 0, F1 Score: 0.0
ROUGE-2 F1-score: 0.0
----------------------------------------
Question: Treating the mitrailleu




## Part 5 - Retrieval-Augmented Question Answering - Dense Retrieval

In this step, we will try to will encode each context and questions using BERT. We will then retrieve the k contexts whose embeddings have the highest cosine similarity to the question embedding.

### 5.1 Creating Embeddings for Contexts and Questions 

Here is an example for how to use BERT to encode a sentence. Instead of using the CLS embeddings (as discussed in class) we will pool together the token representations at the last layer by averaging. The resulting representation is a (1,768) tensor. 

In [57]:
# device = "cuda" 
device = "mps" if torch.backends.mps.is_available() else "cpu"

from transformers import BertTokenizer, BertModel # If you run into memory issues, you 

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased').to(device)

inputs = tokenizer("This is a sample sentence.", return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
with torch.no_grad(): 
    outputs = model(**inputs)
    hidden_states = outputs.last_hidden_state
    embedding = torch.mean(hidden_states, dim=1)  # (batch_size=1, embedding size =768)

In [58]:
embedding.shape

torch.Size([1, 768])

**TODO**: Write code to encode each candidate context. Stack the embeddings together into a single (19035, 768) pytorch tensor that we can save to disk and reload as needed (see above for how to access the candidate contexts). On some lower-resource systems you may have trouble instantiating both BERT and OLMo2 at the same time. Storing the encoded representations allows you to run just OLMo for the QA part.

In [91]:
embedding_list = []

with torch.no_grad(): 
    # context_embeddings = ... 
    context_embeddings = []

    for context in tqdm.tqdm(candidate_contexts, desc="Encoding contexts"):
        inputs = tokenizer(
            context,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=512
        ).to(device)

        outputs = model(**inputs)
        hidden_states = outputs.last_hidden_state
        embedding = torch.mean(hidden_states, dim=1)
        context_embeddings.append(embedding.cpu())

context_embeddings = torch.cat(context_embeddings, dim=0)

Encoding contexts: 100%|██████████| 19035/19035 [08:14<00:00, 38.47it/s]


In [92]:
torch.save(context_embeddings, "context_embeddings.pt")

**TODO**: Similarly encode each question and stack the embeddings together into a single (250, 768) pytorch tensor that we can save to disk and reload as needed.

In [93]:
question_embeddings = []

with torch.no_grad():
    # question_embeddings = ...
    for qa_item in tqdm.tqdm(evaluation_benchmark["qas"], desc="Encoding questions"):
        question = qa_item["question"]

        inputs = tokenizer(
            question,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=512
        ).to(device)

        outputs = model(**inputs)
        hidden_states = outputs.last_hidden_state
        embedding = torch.mean(hidden_states, dim=1)
        question_embeddings.append(embedding.cpu())

question_embeddings = torch.cat(question_embeddings, dim=0)


Encoding questions: 100%|██████████| 250/250 [00:04<00:00, 58.87it/s]


In [94]:
torch.save(question_embeddings, "question_embeddings.pt")

### 5.2 Similarity Retriever

In [95]:
context_embeddings = torch.load("context_embeddings.pt")
question_embeddings = torch.load("question_embeddings.pt")

**TODO**: Write a function `retrieve_cosine(question_embedding, contexts, context_embeddings)` that takes in the embedding for a single question (a [1,768] tensor), a list of contexts (each is a string), and the context embedding tensor [19035,768].
Note that the indices of the context list and the rows of the context_embeddings tensor line up. i.e. `context_embeddings[0]` is the embedding for `contexts[0]`, etc.
You can use `torch.nn.functional.cosine_similarity` (or `F.cosine_similarity` since we imported `torch.nn.functional` as `F`, which is conventional) to calculate the similarities efficiently. You may also ant to look at `torch.topk`, but other solutions are possible. 

In [96]:
def retrieve_cosine(question_emb, contexts, context_embeddings, top_k=5):
        if question_emb.dim() == 2:
            question_vec = question_emb.squeeze(0)
        else:
            question_vec = question_emb

        sims = F.cosine_similarity(context_embeddings, question_vec.unsqueeze(0), dim=1)
        vals, top_idx = torch.topk(sims, k=top_k)

        return [contexts[i] for i in top_idx.tolist()]

In [103]:
retrieve_cosine(question_embeddings[0], candidate_contexts, context_embeddings)

["The Christian Brothers of Ireland Stella Maris College is a private, co-educational, not-for-profit Catholic school located in the wealthy residential southeastern neighbourhood of Carrasco. Established in 1955, it is regarded as one of the best high schools in the country, blending a rigorous curriculum with strong extracurricular activities. The school's headmaster, history professor Juan Pedro Toni, is a member of the Stella Maris Board of Governors and the school is a member of the International Baccalaureate Organization (IBO). Its long list of distinguished former pupils includes economists, engineers, architects, lawyers, politicians and even F1 champions. The school has also played an important part in the development of rugby union in Uruguay, with the creation of Old Christians Club, the school's alumni club.",
 'The National Maritime College of Ireland is also located in Cork and is the only college in Ireland in which Nautical Studies and Marine Engineering can be underta

**TODO**: Write a new version of the add_rag_context function we provided above. This function should now additionally take the question embeddings and context embeddings as parameters, run the retrieval for each question (using the retrieve_cosine function above) and populate a new list of qa_items, include the selected 'rag_contexts'.

In [104]:
def add_rag_context(qa_items, contexts, retriever, question_embeddings, context_embeddings, top_k=5):
    result_list = copy.deepcopy(qa_items)

    for i, idx in tqdm.tqdm(enumerate(result_list), desc="Retrieving contexts", total=len(result_list)):
        question_emb = question_embeddings[i].unsqueeze(0)
        idx["rag_contexts"] = retriever(question_emb, contexts, context_embeddings, top_k)

    return result_list

In [105]:
rag_qa_items = add_rag_context(evaluation_benchmark['qas'], candidate_contexts, retrieve_cosine, question_embeddings, context_embeddings, top_k=5)

Retrieving contexts: 100%|██████████| 250/250 [00:02<00:00, 90.22it/s]


In [106]:
rag_qa_items[0]

{'question': 'Who is the headmaster of the Christian Brothers of Ireland Stella Maris College?',
 'answer': 'professor Juan Pedro Toni',
 'context': "The Christian Brothers of Ireland Stella Maris College is a private, co-educational, not-for-profit Catholic school located in the wealthy residential southeastern neighbourhood of Carrasco. Established in 1955, it is regarded as one of the best high schools in the country, blending a rigorous curriculum with strong extracurricular activities. The school's headmaster, history professor Juan Pedro Toni, is a member of the Stella Maris Board of Governors and the school is a member of the International Baccalaureate Organization (IBO). Its long list of distinguished former pupils includes economists, engineers, architects, lawyers, politicians and even F1 champions. The school has also played an important part in the development of rugby union in Uruguay, with the creation of Old Christians Club, the school's alumni club.",
 'rag_contexts': 

Run the `evaluate_retriever` function on the new qa_items. In our experiments, we got an accuracy of about 0.4.

In [107]:
evaluate_retriever(rag_qa_items)

0.416

Then, evaluate the rag_qa approach using the revised rag_qa_items. You should get an Exact match better than 20%.  

In [108]:
result = evaluate_qa(rag_qa, rag_qa_items)
present_results(result)

Evaluating QA instances: 100%|██████████| 250/250 [06:11<00:00,  1.49s/it]

 Evaluation Results:
Exact Match: 18.00%
F1 Score: 30.41%
ROUGE2 F1: 15.41%
Question: Who is the headmaster of the Christian Brothers of Ireland Stella Maris College?
Gold Answer: professor Juan Pedro Toni
Predicted Answer: Juan Pedro Toni
Exact Match: 0, F1 Score: 0.8571428571428571
ROUGE-2 F1-score: 0.8
----------------------------------------
Question: What is the ratio of black and Asian schoolchildren to white schoolchildren?
Gold Answer: about six to four
Predicted Answer: 1:1
Exact Match: 0, F1 Score: 0.0
ROUGE-2 F1-score: 0.0
----------------------------------------
Question: When did Outcault's The Yellow Kid appear in newspapers?
Gold Answer: 1890s
Predicted Answer: 1896
Exact Match: 0, F1 Score: 0.0
ROUGE-2 F1-score: 0.0
----------------------------------------
Question: When did devolution in the UK begin?
Gold Answer: 1914
Predicted Answer: with the Government of Ireland Act 1914
Exact Match: 0, F1 Score: 0.2857142857142857
ROUGE-2 F1-score: 0.0
---------------------------




## Part 6 - Experiments

**TODO** For the overlap and dense retrievers (from part 5 and 6), what happens when you change the number of retrieved contexts? Present a table of results for k=1, k=5 (already done), k=10, and k=20. 


I tried to write some code for this part, to rerun my previous code with different k vakues, and then show them in a table.

In [112]:
def add_rag_context_overlap(qa_items, contexts, retriever, top_k=5):
    result_items = copy.deepcopy(qa_items)
    for item in tqdm.tqdm(result_items, desc=f"Retrieving contexts (overlap, k={top_k})"):
        question = item["question"]
        retrieved_contexts = retriever(question, contexts, top_k)
        item["rag_contexts"] = retrieved_contexts
    return result_items

def add_rag_context_dense(qa_items, contexts, retriever, question_embeddings, context_embeddings, top_k=5):
    result_items = copy.deepcopy(qa_items)
    for i, item in tqdm.tqdm(enumerate(result_items), desc=f"Retrieving contexts (dense, k={top_k})", total=len(result_items)):
        q_emb = question_embeddings[i].unsqueeze(0)
        retrieved_contexts = retriever(q_emb, contexts, context_embeddings, top_k)
        item["rag_contexts"] = retrieved_contexts
    return result_items


In [113]:
def run_rag_experiment_overlap(qa_items, candidate_contexts, ks=(1, 5, 10, 20), label="Overlap"):
    results = []
    for k in ks:
        rag_items = add_rag_context_overlap(qa_items, candidate_contexts, retrieve_overlap, top_k=k)
        eval_results = evaluate_qa(rag_qa, rag_items)

        exact = sum(r["exact_match"] for r in eval_results) / len(eval_results) * 100
        f1 = sum(r["f1_score"] for r in eval_results) / len(eval_results) * 100
        rouge = sum(r["rouge2_f1"] for r in eval_results) / len(eval_results) * 100

        results.append({
            "Retriever": label,
            "k": k,
            "Exact Match (%)": round(exact, 2),
            "F1 (%)": round(f1, 2),
            "ROUGE-2 (%)": round(rouge, 2),
        })
    return results


def run_rag_experiment_dense(qa_items, candidate_contexts, question_embeddings, context_embeddings, ks=(1, 5, 10, 20), label="Dense"):
    results = []
    for k in ks:
        rag_items = add_rag_context_dense(
            qa_items, candidate_contexts, retrieve_cosine,
            question_embeddings, context_embeddings,
            top_k=k
        )
        eval_results = evaluate_qa(rag_qa, rag_items)

        exact = sum(r["exact_match"] for r in eval_results) / len(eval_results) * 100
        f1 = sum(r["f1_score"] for r in eval_results) / len(eval_results) * 100
        rouge = sum(r["rouge2_f1"] for r in eval_results) / len(eval_results) * 100

        results.append({
            "Retriever": label,
            "k": k,
            "Exact Match (%)": round(exact, 2),
            "F1 (%)": round(f1, 2),
            "ROUGE-2 (%)": round(rouge, 2),
        })
    return results

In [None]:
import pandas as pd

ks = (1, 5, 10)

overlap_results = run_rag_experiment_overlap(
    evaluation_benchmark["qas"],
    candidate_contexts,
    ks=ks,
    label="Overlap"
)

dense_results = run_rag_experiment_dense(
    evaluation_benchmark["qas"],
    candidate_contexts,
    question_embeddings,
    context_embeddings,
    ks=ks,
    label="Dense"
)

df = pd.DataFrame(overlap_results + dense_results)
df

Retrieving contexts (overlap, k=1): 100%|██████████| 250/250 [03:31<00:00,  1.18it/s]
Evaluating QA instances: 100%|██████████| 250/250 [02:15<00:00,  1.85it/s]
Retrieving contexts (overlap, k=5): 100%|██████████| 250/250 [03:32<00:00,  1.18it/s]
Evaluating QA instances: 100%|██████████| 250/250 [09:02<00:00,  2.17s/it]
Retrieving contexts (overlap, k=10): 100%|██████████| 250/250 [03:32<00:00,  1.18it/s]
Evaluating QA instances: 100%|██████████| 250/250 [18:54<00:00,  4.54s/it]
Retrieving contexts (dense, k=1): 100%|██████████| 250/250 [00:03<00:00, 66.09it/s]
Evaluating QA instances: 100%|██████████| 250/250 [02:20<00:00,  1.78it/s]
Retrieving contexts (dense, k=5): 100%|██████████| 250/250 [00:04<00:00, 57.26it/s]
Evaluating QA instances: 100%|██████████| 250/250 [05:52<00:00,  1.41s/it]
Retrieving contexts (dense, k=10): 100%|██████████| 250/250 [00:04<00:00, 59.85it/s]
Evaluating QA instances: 100%|██████████| 250/250 [11:47<00:00,  2.83s/it]


Unnamed: 0,Retriever,k,Exact Match (%),F1 (%),ROUGE-2 (%)
0,Overlap,1,23.2,32.03,14.63
1,Overlap,5,31.2,42.02,21.98
2,Overlap,10,31.2,42.28,23.02
3,Dense,1,11.2,18.23,8.13
4,Dense,5,18.0,30.41,15.41
5,Dense,10,23.2,35.56,18.8


## Part 7 -Improving the QA System 

**TODO**
In this part, we ask you to come up with one interesting or novel idea for improving the QA system. Your system does *not* have to outperform the models from part 4 or 5, but for full credit you should implement at least one new idea, beyond just changing parameters. You can either work on better retrieval or better QA/LLM performance. Show the full code for the necessary steps and evaluation results. 

Ideas for improving the retriever include: improved word overlap (better tokenization/ text normalization, using TF-IDF, ...), or choosing a different approach or different model (other than BERT) for calculating context and question embeddings.

For the LLM, you could try a different transformer model, including text-to-text models (e.g. T5).                                                                                                           


In [116]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

tfidf_vectorizer = TfidfVectorizer(
    lowercase=True,
    stop_words="english",
    ngram_range=(1, 2),
    max_features=50000
)

context_tfidf = tfidf_vectorizer.fit_transform(candidate_contexts)

In [117]:
def retrieve_tfidf(question, contexts, context_tfidf_matrix, vectorizer, top_k=5):
    q_vec = vectorizer.transform([question])
    scores = (context_tfidf_matrix @ q_vec.T).toarray().squeeze(1)

    top_idx = np.argsort(scores)[::-1][:top_k]
    return [contexts[i] for i in top_idx]

In [118]:
rag_qa_items_tfidf = copy.deepcopy(evaluation_benchmark["qas"])

for inst in tqdm.tqdm(rag_qa_items_tfidf, desc="Retrieving contexts (TF-IDF)"):
    inst["rag_contexts"] = retrieve_tfidf(
        inst["question"],
        candidate_contexts,
        context_tfidf,
        tfidf_vectorizer,
        top_k=5
    )

Retrieving contexts (TF-IDF): 100%|██████████| 250/250 [00:00<00:00, 677.01it/s]
