# Assignment 2: Retrieval-Augmented Question Answering with LLMs

In this assignment, we will use a large language model (LLM) in a retrieval-augmented setup to answer questions from the Natural Questions dataset. We will evaluate the performance of the LLM using different prompting strategies and compare the results. The steps involved are as follows:

1. **Evaluate an LLM on Natural Questions without context.**
2. **Evaluate an LLM on Natural Questions with provided context.**
3. **Set up a retriever to find relevant passages.**
4. **Evaluate the LLM using retrieved context instead of provided context.**


In [1]:
import pandas as pd
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

nq_data = pd.read_csv('nq_simplified.val.tsv', sep='\t', header=None, names=['question', 'answer', 'gold_context'], quoting=3)

nq_data.head()

Unnamed: 0,question,answer,gold_context
0,what purpose did seasonal monsoon winds have o...,enabled European empire expansion into the Ame...,The westerlies (blue arrows) and trade winds (...
1,who got the first nobel prize in physics,"Wilhelm Conrad Röntgen, of Germany",The award is presented in Stockholm at an annu...
2,when is the next deadpool movie being released,"May 18, 2018","Though the original creative team of Reynolds,..."
3,where did the idea of fortnite come from,as a cross between Minecraft and Left 4 Dead,"Fortnite is set in contemporary Earth, where t..."
4,which mode is used for short wave broadcast se...,MFSK Olivia,"All one needs is a pair of transceivers, each ..."


In [2]:
def rouge1(gold, predicted):
    assert(len(gold) == len(predicted))
    n_p = 0
    n_g = 0
    n_c = 0
    for g, p in zip(gold, predicted):
        g = set(cleanup(g).strip().split())
        p = set(cleanup(p).strip().split())
        n_g += len(g)
        n_p += len(p)
        n_c += len(p.intersection(g))
    pr = n_c / n_p
    re = n_c / n_g
    if pr > 0 and re > 0:
        f1 = 2*pr*re/(pr + re)
    else:
        f1 = 0.0
    return pr, re, f1

def cleanup(text):
    text = text.replace(',', ' ')
    text = text.replace('.', ' ')
    return text

## Step 1: Evaluating an LLM on Natural Questions

First, we will find an LLM on huginface hub that is small and fast to fitt on our system. 

In [18]:
"""
Load the model and tokenizer
 "mistralai/Mixtral-8x7B-Instruct-v0.1" "microsoft/Phi-3-mini-4k-instruct" "mtgv/MobileLLaMA-1.4B-Base" "TheBloke/Mistral-7B-Instruct-v0.2-AWQ" 
"TheBloke/Llama-2-7B-Chat-AWQ"

"""
model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ" 
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name,
    low_cpu_mem_usage=True,
    device_map="cuda:0")


Lets create a simple pipline for QA and use a small subset of the dataset to get the predicted answers and evaluate the model using ROUGE-1 scores.

In [19]:
%%time
# Create a pipeline for question answering
qa_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer, return_full_text=False, 
                       pad_token_id=tokenizer.eos_token_id, max_new_tokens=400,)

# Define a function to get answers from the model
def get_answers(questions, prompt, max_new_tokens):
    answers = []
    for question in questions:
        p = prompt.format(question) 
        response = qa_pipeline(p,max_new_tokens=max_new_tokens)
        answers.append(response[0]['generated_text'])
    return answers

# Use a small subset of the dataset for testing
prompts = [ 
    "{}",
    '[INST] {} [/INST]',
    '[INST] {} [/INST] Answer:',
    '<s>[INST] You are a question answering system. Please give a short answer to the following user question.[/INST] Question: {}',
    '<s>Question: {} [INST]Generate a short-form answer of either one word or one sentence to the question.[/INST]',
          ]
subset = nq_data.sample(50)
questions = subset['question'].tolist()
gold_answers = subset['answer'].tolist()
results=[]
for prompt in prompts:
    predicted_answers = get_answers(questions, prompt,max_new_tokens=50)
    pr, re, f1 = rouge1(gold_answers, predicted_answers)
    answer = [prompt, pr, re, f1, "without context", "Mistral-7B-Instruct-v0.2-AWQ", 50]
    results.append(answer)
result_1 = pd.DataFrame(results, columns=['Prompt', 'Rouge-1 Precision', 'Rouge-1 Recall', 'Rouge-1 F1', 'Method', 'Model', 'Sample Size',])
result = pd.concat([result, result_1], ignore_index=True)
display(result)

Unnamed: 0,Prompt,Rouge-1 Precision,Rouge-1 Recall,Rouge-1 F1,Method,Model,Sample Size
0,{},0.043831,0.247706,0.074483,without context,Llama-2-7B-Chat-AWQ,50
1,[INST] {} [/INST],0.050388,0.298165,0.086207,without context,Llama-2-7B-Chat-AWQ,50
2,[INST] {} [/INST] Answer:,0.054357,0.311927,0.09258,without context,Llama-2-7B-Chat-AWQ,50
3,<s>[INST] You are a question answering system....,0.076065,0.344037,0.124585,without context,Llama-2-7B-Chat-AWQ,50
4,<s>Question: {} [INST]Generate a short-form an...,0.067293,0.197248,0.10035,without context,Llama-2-7B-Chat-AWQ,50
5,"Context:{}, Question: {}, Answer:",0.10195,0.527523,0.170877,with context,Llama-2-7B-Chat-AWQ,50
6,[INST] Context:{}\n Question:{}[/INST],0.110522,0.573394,0.185322,with context,Llama-2-7B-Chat-AWQ,50
7,[INST] Context:{}\n Question:{}[/INST] Answer:,0.11343,0.573394,0.189394,with context,Llama-2-7B-Chat-AWQ,50
8,<s>Context:{}\n [INST] You are a question answ...,0.132495,0.62844,0.21885,with context,Llama-2-7B-Chat-AWQ,50
9,<s>Context:{}\n Question:{} [INST] Using the i...,0.153285,0.385321,0.219321,with context,Llama-2-7B-Chat-AWQ,50


CPU times: user 5min 15s, sys: 148 ms, total: 5min 15s
Wall time: 5min 15s


## Step 2: An Idealized Retrieval-Augmented LLM

Now, we will include the `gold_context` in our prompts and evaluate the model again. This will help us understand the upper limits of the system.


In [20]:
%%time
# Define a function to get answers from the model with context
def get_answers_with_context(questions, contexts, prompt, max_new_tokens):
    answers = []
    for question, context in zip(questions, contexts):
        p = prompt.format(context,question) 
        response = qa_pipeline(p, max_new_tokens=max_new_tokens)
        answers.append(response[0]['generated_text'])
    return answers

contexts = subset['gold_context'].tolist()

prompts = [ 
    'Context:{}, Question: {}, Answer:',
    '[INST] Context:{}\n Question:{}[/INST]',
    '[INST] Context:{}\n Question:{}[/INST] Answer:',
    '<s>Context:{}\n [INST] You are a question answering system. Based on the above context, generate a short answer to the following question.[/INST]\n Question: {}',
    '<s>Context:{}\n Question:{} [INST] Using the information provided in the context, generate a short-form answer of either one word or one sentence to the question. [/INST]',
          ]
results=[]
for prompt in prompts:
    predicted_answers_with_context = get_answers_with_context(questions, contexts, prompt, max_new_tokens=50)
    pr_context, re_context, f1_context = rouge1(gold_answers, predicted_answers_with_context)
    answer = [prompt, pr_context, re_context, f1_context, "with context", "Mistral-7B-Instruct-v0.2-AWQ", 50]
    results.append(answer)
result_2 = pd.DataFrame(results, columns=['Prompt', 'Rouge-1 Precision', 'Rouge-1 Recall', 'Rouge-1 F1', 'Method', 'Model', 'Sample Size'])
result = pd.concat([result, result_2], ignore_index=True)
result

CPU times: user 5min 10s, sys: 1.39 s, total: 5min 11s
Wall time: 5min 11s


Unnamed: 0,Prompt,Rouge-1 Precision,Rouge-1 Recall,Rouge-1 F1,Method,Model,Sample Size
0,{},0.043831,0.247706,0.074483,without context,Llama-2-7B-Chat-AWQ,50
1,[INST] {} [/INST],0.050388,0.298165,0.086207,without context,Llama-2-7B-Chat-AWQ,50
2,[INST] {} [/INST] Answer:,0.054357,0.311927,0.09258,without context,Llama-2-7B-Chat-AWQ,50
3,<s>[INST] You are a question answering system....,0.076065,0.344037,0.124585,without context,Llama-2-7B-Chat-AWQ,50
4,<s>Question: {} [INST]Generate a short-form an...,0.067293,0.197248,0.10035,without context,Llama-2-7B-Chat-AWQ,50
5,"Context:{}, Question: {}, Answer:",0.10195,0.527523,0.170877,with context,Llama-2-7B-Chat-AWQ,50
6,[INST] Context:{}\n Question:{}[/INST],0.110522,0.573394,0.185322,with context,Llama-2-7B-Chat-AWQ,50
7,[INST] Context:{}\n Question:{}[/INST] Answer:,0.11343,0.573394,0.189394,with context,Llama-2-7B-Chat-AWQ,50
8,<s>Context:{}\n [INST] You are a question answ...,0.132495,0.62844,0.21885,with context,Llama-2-7B-Chat-AWQ,50
9,<s>Context:{}\n Question:{} [INST] Using the i...,0.153285,0.385321,0.219321,with context,Llama-2-7B-Chat-AWQ,50


## Step 3: Setting Up the Retriever

In this step, we will set up a retriever to find relevant passages. We will use the SentenceTransformers model to create embeddings for the passages and set up a FAISS index for efficient retrieval.


In [21]:
#!pip install faiss-cpu sentence_transformers

In [22]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

retriever_model = SentenceTransformer('all-MiniLM-L6-v2')

with open('passages.txt', 'r', encoding='utf-8') as file:
    passages = file.readlines()

embedded_passages = retriever_model.encode(passages, convert_to_tensor=True)

index = faiss.IndexFlatL2(embedded_passages.shape[1])
index.add(embedded_passages.cpu())


In [23]:
def retrieve_best_passage(question):
    question_embedding = retriever_model.encode([question], convert_to_tensor=True)
    _, ix = index.search(question_embedding.cpu(), 1)
    return passages[ix[0][0]]

question = "Where did the first African American air force unit train?"
best_passage = retrieve_best_passage(question)
print(f"Best Passage: {best_passage}")


Best Passage: Black Americans were not permitted to fly for the U.S. armed services prior to 1940. The Air Corps at that time, which had never had a single black member, was part of an army that possessed exactly two black Regular line officers at the beginning of World War II: Brigadier Generals Benjamin O. Davis, Sr. and Benjamin O. Davis, Jr. The first Civilian Pilot Training Program (CPTP) students completed their instruction in May 1940. The creation of an all-black pursuit squadron resulted from pressure by civil rights organizations and the black press who pushed for the establishment of a unit at Tuskegee, an



## Step 4: Putting the Pieces Together

Finally, we will retrieve passages for each question and evaluate the model using these passages instead of the provided `gold_context`.

In [24]:
def get_answers_with_retrieved_context(questions, prompt, max_new_tokens):
    answers = []
    for question in questions:
        context = retrieve_best_passage(question)
        p = prompt.format(context,question)
        response = qa_pipeline(p, max_new_tokens=max_new_tokens)
        answers.append(response[0]['generated_text'])
    return answers
prompts = [ 
    'Context:{}, Question: {}, Answer:',
    '[INST] Context:{}\n Question:{}[/INST]',
    '[INST] Context:{}\n Question:{}[/INST] Answer:',
    '<s>Context:{}\n [INST] You are a question answering system. Based on the above context, generate a short answer to the following question.[/INST]\n Question: {}',
    '<s>Context:{}\n Question:{} [INST] Using the information provided in the context, generate a short-form answer of either one word or one sentence to the question. [/INST]',
    '<s> Context: {}\n Question: {} [INST] Select the passage from the provided context that best answers the question in either one word or a sentence. [/INST]',
          ]
results=[]
for prompt in prompts:
    predicted_answers_retrieved_context = get_answers_with_retrieved_context(questions,prompt, max_new_tokens=50)
    pr_retrieved, re_retrieved, f1_retrieved = rouge1(gold_answers, predicted_answers_retrieved_context)
    answer = [prompt, pr_retrieved, re_retrieved, f1_retrieved, "Retriever", "Mistral-7B-Instruct-v0.2-AWQ", 50]
    results.append(answer)
result_3 = pd.DataFrame(results, columns=['Prompt', 'Rouge-1 Precision', 'Rouge-1 Recall', 'Rouge-1 F1', 'Method', 'Model', 'Sample Size'])
result = pd.concat([result, result_3], ignore_index=True)
result

Unnamed: 0,Prompt,Rouge-1 Precision,Rouge-1 Recall,Rouge-1 F1,Method,Model,Sample Size
0,{},0.043831,0.247706,0.074483,without context,Llama-2-7B-Chat-AWQ,50
1,[INST] {} [/INST],0.050388,0.298165,0.086207,without context,Llama-2-7B-Chat-AWQ,50
2,[INST] {} [/INST] Answer:,0.054357,0.311927,0.09258,without context,Llama-2-7B-Chat-AWQ,50
3,<s>[INST] You are a question answering system....,0.076065,0.344037,0.124585,without context,Llama-2-7B-Chat-AWQ,50
4,<s>Question: {} [INST]Generate a short-form an...,0.067293,0.197248,0.10035,without context,Llama-2-7B-Chat-AWQ,50
5,"Context:{}, Question: {}, Answer:",0.10195,0.527523,0.170877,with context,Llama-2-7B-Chat-AWQ,50
6,[INST] Context:{}\n Question:{}[/INST],0.110522,0.573394,0.185322,with context,Llama-2-7B-Chat-AWQ,50
7,[INST] Context:{}\n Question:{}[/INST] Answer:,0.11343,0.573394,0.189394,with context,Llama-2-7B-Chat-AWQ,50
8,<s>Context:{}\n [INST] You are a question answ...,0.132495,0.62844,0.21885,with context,Llama-2-7B-Chat-AWQ,50
9,<s>Context:{}\n Question:{} [INST] Using the i...,0.153285,0.385321,0.219321,with context,Llama-2-7B-Chat-AWQ,50


## Summary

In this notebook, we evaluated the performance of an LLM on the Natural Questions dataset using different prompting strategies. We observed the following:

1. **Without Context**:
   - ROUGE-1 Precision: *result*
   - Recall: *result*
   - F1 Score: *result*

2. **With Provided Context**:
   - ROUGE-1 Precision: *result*
   - Recall: *result*
   - F1 Score: *result*

3. **With Retrieved Context**:
   - ROUGE-1 Precision: *result*
   - Recall: *result*
   - F1 Score: *result*

Including context, whether provided or retrieved, ...
