# Homework 2 Part 2

## Course Name: Large Language Models
#### Lecturers: Dr. Soleimani, Dr. Rohban, Dr. Asgari

---

#### Notebooks Supervised By: MohammadAli SadraeiJavaheri, Omid Ghahroodi
#### Notebook Prepared By: Mahdi Zakizadeh, Ali Razghandi

**Contact**: Ask your questions in Quera

---

### Instructions:
- Complete all exercises presented in this notebook.
- Ensure you run each cell after you've entered your solution.
- After completing the exercises, save the notebook and <font color='red'>follow the submission guidelines provided in the PDF.</font>


---

**Note**: Replace the placeholders (between <font color="green">`## Your code begins ##`</font> and <font color="green">`## Your code ends ##`</font>) with the appropriate details.


## 1. Introduction

The advent of Large Language Models (LLMs) has undeniably shifted the paradigm in the realm of natural language processing, offering capabilities that inch closer to human-like text understanding and generation. Among the vanguards of this shift is the Llama-2 model, a behemoth trained on diverse text corpora, promising adeptness in various NLP tasks. However, as we usher into this era of seemingly intelligent machines, a pertinent question arises - do these models truly understand the text, or do they merely excel in retrieving memorized pieces of information from their training data? This inquiry is not merely academic; the implications of the findings reverberate through the practical applications and the future trajectory of LLMs. In exploring the reasoning capabilities of Large Language Models, a noteworthy investigation was carried out by [Saparov and He](https://openreview.net/pdf?id=qFVVBzXxR2V). Their analytical journey led to the revelation that these models, to a significant extent, harness the knowledge acquired during the pre-training phase when confronted with reasoning tasks. Characterized as "greedy reasoners," these models exhibit a propensity to rely on the reservoir of memorized information, as opposed to showcasing authentic reasoning abilities.

Our exploration is set against the backdrop of the SQuAD dataset, a well-regarded benchmark in the question-answering domain. The choice of SQuAD is motivated by its structured evaluation metrics which offer a tangible measure of a model's ability to retrieve and reason over text. While SQuAD has been instrumental in driving progress in question answering, its conventional usage may not fully expose the nuanced capabilities of models like Llama-2. This homework aims to delve deeper by constructing adversarial datasets that challenge the model beyond mere retrieval, probing its ability to reason and refer to the provided context accurately. Through a systematic evaluation on both the original and adversarially-modified versions of the SQuAD dataset, we aspire to dissect the retrieval and reasoning prowess of Llama-2, shedding light on the model's strengths, weaknesses, and the path towards more robust and interpretable LLMs.

Let's begin by setting up our workspace and loading the Llama-2 model to explore its capabilities.



In [31]:
# @title Environment Setup
# Note: Do NOT make changes to this block.
# ----------------
%pip install ctransformers[cuda]>=0.2.24 transformers datasets
!apt-get -y install -qq aria2

from IPython.display import clear_output
import numpy as np
import random
import spacy
import transformers
from tqdm.notebook import tqdm

SEED=21

np.random.seed(SEED)
random.seed(SEED)

!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf -d /content/models -o llama-2-7b-chat.Q5_K_M.gguf

clear_output()
# ----------------

In [None]:
# @title Model Initialization
gpu_layers = 200000
context_length = 2048

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-13B-GGUF",
    model_file="/content/models/llama-2-7b-chat.Q5_K_M.gguf",
    model_type="llama",
    gpu_layers=gpu_layers,
    context_length=context_length,
)

prompt_template = """
[INST] <<SYS>
%s
<</SYS>>
%s[/INST]
"""

clear_output()

Now, it's a good practice to test the model with some inputs to get a feel for its responses before diving into the core analysis.


In [None]:
# @title Let's Test the Model
preprompt = "Your job is to answer the users' question accurately according to the context in shortest way possible. If the answer is not present in the provided context by the user, refuse to answer and yield \\\"Not enough info.\\\" If the answer is present in the context, only return the part of the context relevant to the question. The shorter your answer be, the more score you receive, even if you write a one word instead of a full sentence. Answering based on your prior knowledge is not considered as a good thing." # @param {type:"string"}
test_input = "Who is the current president of Iran?" # @param {type:"string"}

llm(prompt_template % (preprompt, test_input))


'Not enough info.'

### 2.1. Metrics

Evaluating the performance of Large Language Models (LLMs) on question-answering tasks necessitates employing metrics that accurately reflect the models' ability to provide correct and precise answers. Two widely acknowledged metrics for this purpose are Exact Match (EM) and F1 Score, which offer a lens through which the accuracy and the overall quality of the model’s responses can be gauged.

1. **Exact Match (EM)**:
   - The Exact Match metric measures the percentage of responses that match the ground truth answers exactly. It is a stringent metric that requires the predicted answer to be identical to the ground truth answer.
   - Mathematical Equation:
$\text{EM} = \left( \frac{\text{Number of exact matches}}{\text{Total number of questions}} \right) \times 100$

   - Example:
     Suppose we have $5$ questions, and the model answers $3$ of them exactly as in the ground truth. The EM score would be $(3/5) \times 100 = 60 \\% $.



In [None]:
def compute_exact_match_score(predictions: list[str], ground_truths: list[list[str]]):
    exact_match_score = 0
    for prediction, ground_truth in zip(predictions, ground_truths):
        prediction = prediction.lower().strip()
        exact_match_score += any(gt.lower().strip() == prediction for gt in ground_truth)
    em_percentage = (exact_match_score / len(predictions)) * 100
    return em_percentage

2. **F1 Score**:
   - The F1 Score is the harmonic mean of precision and recall, providing a balance between the two. It measures the overlap between the predicted answers and the ground truth, considering both the words that were correctly included and those that were omitted or added incorrectly.
   - Mathematical Equations:
   
  \begin{align}
  \text{Precision} = \frac{\text{Number of true positive words}}{\text{(Number of true positive words + Number of false positive words)}}
  \end{align}

  \begin{align}
  \text{Recall} = \frac{\text{Number of true positive words}}{\text{(Number of true positive words + Number of false negative words)}}
  \end{align}

  \begin{align}
  \text{F1 Score} = 2 \times \left( \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \right)
  \end{align}
  
   - Example:
     Suppose a predicted answer contains $4$ correct words out of $5$ total words, but misses $2$ words that are in the ground truth answer. The precision would be $4/(4+1) = 0.8$, the recall would be $4/(4+2) = 0.67$, and the F1 Score would be $2 \times (0.8 \times 0.67)/(0.8 + 0.67) ≈ 0.73$.

These metrics provide a nuanced view of the model's performance, offering insights into not only how often the model is correct (EM), but also how well it captures the nuances of the ground truth answers (F1 Score). Through these metrics, the evaluation phase aims to paint a comprehensive picture of the model's proficiency in the question-answering task amidst the structured framework provided by the SQuAD dataset.

In [None]:
def compute_f1_score(predictions: list[str], ground_truths: list[list[str]]):
    total_f1_score = 0
    for prediction, ground_truth in zip(predictions, ground_truths):
        prediction_words = prediction.lower().strip().split()
        best_f1 = 0
        for gt in ground_truth:
            gt_words = gt.lower().strip().split()
            common_words_count = sum(1 for word in prediction_words if word in gt_words)
            precision = common_words_count / len(prediction_words)
            recall = common_words_count / len(gt_words)
            f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
            best_f1 = max(best_f1, f1)
        total_f1_score += best_f1
    average_f1_score = total_f1_score / len(predictions)
    return average_f1_score


### 2.2. Loading the Dataset and Evaluating the Model

Now, let's put the model to the test on the vanilla dataset to see how it performs. The steps we are going to follow are quite straightforward: First, we'll load up the dataset, and then we'll feed it to the model and evaluate the results using the score functions you've implemented earlier. To keep things manageable and ensure a quick run time, we'll use a subset of the SQuAD dataset for this evaluation.

In the following step, we'll load a subset of the SQuAD dataset which will be used for evaluating the model. This dataset contains a variety of questions along with the correct answers which we'll compare against the model's responses. After running the code block, you should see a sample row from the dataset, giving you a glimpse of the kind of questions and answers it contains.

In [None]:
!pip install datasets

In [4]:
# @title Loading the SQuAD Dataset Subset
from datasets import load_dataset
dataset = load_dataset('squad', split="validation")
dataset_test = dataset.shard(num_shards=10, index=0)

clear_output()
dataset_test[0]

{'id': '56be4db0acb8001400a502ec',
 'title': 'Super_Bowl_50',
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'question': 'Which NFL team represented the AFC at Super Bowl 50?',
 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],


With the dataset ready, it's time to see how Llama-2 fares. We'll feed the questions from the dataset to the model and collect its answers. Then, we'll use the score functions to calculate the Exact Match and F1 scores for each response, giving us a clear picture of the model's performance on this dataset.

In [None]:
# @title Evaluating Llama-2 on the Dataset
predictions = []
ground_truths = []

for example in tqdm(dataset_test):
    input_text = f"Question: {example['question']} Context: {example['context']}"
    output_text = llm(prompt_template % (preprompt, input_text))
    predictions.append(output_text)
    ground_truths.append(example['answers']['text'])

em_score = compute_exact_match_score(predictions, ground_truths),
f1_score = compute_f1_score(predictions, ground_truths)

print(f"EM Score={em_score}, F1 Score={f1_score}")

  0%|          | 0/1057 [00:00<?, ?it/s]

EM Score=(16.93472090823084,), F1 Score=0.3671884374572664


In [None]:
import pickle

# Save the list to a file using pickle
with open('predictions.pkl', 'wb') as f:
    pickle.dump(predictions, f)

Having seen how the model performs on the vanilla dataset, let’s delve into some analytical reflections:
1. <font color="green"> What do you think is the better metric for evaluating Llama-2 on this dataset and why? </font>
2. <font color="green"> How can preprompt text affect the evaluation and the model's performance? </font>


In [None]:
dataset_test[12]

{'id': '56beb03c3aeaaa14008c920b',
 'title': 'Super_Bowl_50',
 'context': "The league eventually narrowed the bids to three sites: New Orleans' Mercedes-Benz Superdome, Miami's Sun Life Stadium, and the San Francisco Bay Area's Levi's Stadium.",
 'question': 'What venue in Miami was a candidate for the site of Super Bowl 50?',
 'answers': {'text': ['Sun Life Stadium',
   'Sun Life Stadium',
   'Sun Life Stadium'],
  'answer_start': [102, 102, 102]}}

In [None]:
predictions[12]

"Miami's Sun Life Stadium."

Please refer to the report.

## 3. Adversarial Dataset Construction

In this section, we venture into the realm of adversarial evaluation to delve deeper into the abilities of the Llama-2 model. The objective is to scrutinize how the model responds to scenarios that are crafted to challenge its reasoning and retrieval capacities. We propose three methods to create adversarial datasets, each aimed at examining different facets of the model's behavior.

1. **Answer Absence**: In this method, we modify the SQuAD dataset by crafting questions for which the answers do not exist in the provided context.

2. **Entity Substitution**: Here, we substitute entity words in the context with other entities to test whether the model relies on retrieval or refers to the context accurately for answering the question. For instance, changing the context from "The president of the USA lives in the White House. Barack Obama is the current president of the USA." to "The president of the USA lives in the White House. Gall Granuaile is the current president of the USA." and observing if the answer changes appropriately.

3. **Nonsense Word Substitution**: In this method, we replace certain words or entities with nonsensical words in a consistent and meaningful way, defining the nonsense words before asking the question. For example, replacing "White House" with "Glibber House" and explaining that "Glibber" means "White".

Before embarking on the evaluation using adversarial datasets, we encourage students to ponder upon a few analytical questions:
<font color="green">

3. What is your expectation regarding the model's performance on these adversarial datasets?
4. How might the model's behavior on standard versus adversarial datasets inform us about its reasoning and retrieval abilities?

</font>

Please refer to the report.

### 3.1. Answer Absence

#### Modifying the Dataset

For this section we need to modify the original dataset in the way that for each example there will be a new context that is totally different with the original context of the example.

To do so, we suggest that you use the title feature in each example and then swap the context between examples that do not have the same title.

*  Of course, this is just a suggestion and you can feel free to implement this section as you desire, as long as it meets the required criteria.

Some key points:

*   The goal is for each example to have a new context that differs from the original.
* Using the title of each example is one potential way to pair up examples for swapping contexts.
* Feel free to use any approach for generating new contexts as long as they meaningfully differ from the originals.
* The modified dataset should meet the specifications and requirements for the assignment.
* Be creative in how you modify the contexts - the approach suggested is just one option.


    

In [None]:
from collections import defaultdict
original_group_contexts = defaultdict(list)
for ex in dataset_test:
   original_group_contexts[ex['title']].append(ex['context'])

## Your code begins ##
np.random.seed(SEED)
keys_list = list(original_group_contexts.keys())
values_list = list(original_group_contexts.values())
np.random.shuffle(values_list)
adversarial_group_contexts = {key: value for key,value in zip(keys_list,values_list)}
## Your code ends ##

def create_adversarial_example(example):
    ## Your code begins ##
    title = example['title']
    adversarial_context = np.random.choice(adversarial_group_contexts[title])
    adversarial_example = {
        'title': title,
        'context': example['context'],
        'new_context': adversarial_context,
        'question': example['question'],
        'answers': example['answers']
    }
    return adversarial_example
    ## Your code ends ##

shuffled_context_dataset = dataset_test.map(create_adversarial_example)

Map:   0%|          | 0/1057 [00:00<?, ? examples/s]

In [None]:
for ex in shuffled_context_dataset:
    assert(ex['context'] != ex['new_context'])

In [None]:
shuffled_context_dataset[0]

{'id': '56be4db0acb8001400a502ec',
 'title': 'Super_Bowl_50',
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'question': 'Which NFL team represented the AFC at Super Bowl 50?',
 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],


#### Evaluating Model Performance on Modified Dataset

Now we will test the performance of our model on the modified dataset to determine how reliant it is on the original contexts.

- Compare the model predictions to the original correct answers.
- Calculate evaluation metrics.
- Analyze whether there is a significant decrease in model performance on the modified dataset and explain your thoughts.


In [None]:
###########################################
### Evaluating original answers section ###
###########################################

predictions2 = []
ground_truths2 = []

for example in tqdm(shuffled_context_dataset):
    input_text = f"Context: {example['new_context']} \nQuestion: {example['question']}"
    output_text = llm(prompt_template % (preprompt, input_text))
    predictions2.append(output_text)
    ground_truths2.append(example['answers']['text'])

em_score = compute_exact_match_score(predictions2, ground_truths2),
f1_score = compute_f1_score(predictions2, ground_truths2)

print(f"EM Score={em_score}, F1 Score={f1_score}")

  0%|          | 0/1057 [00:00<?, ?it/s]

EM Score=(0.9460737937559129,), F1 Score=0.01965991172715234


In [None]:
import pickle

# Save the list to a file using pickle
with open('predictions2.pkl', 'wb') as f:
    pickle.dump(predictions2, f)

#### Evaluating "Not Enough Info." Responses

In the prompt we specified that the model should respond "Not enough info." if the context lacks the information needed to answer the question.

Now we will evaluate the model's performance on these "not enough info." responses.

Which evaluation metric should we use and why?


In [None]:
###########################################
### Evaluating modified answers section ###
###########################################

## Your code begins ##
ground_truths2 = []

for example in tqdm(shuffled_context_dataset):
    ground_truths2.append(["Not enough info."])

em_score2 = compute_exact_match_score(predictions2, ground_truths2)

print(f"EM Score={em_score2}")
## Your code ends ##

  0%|          | 0/1057 [00:00<?, ?it/s]

EM Score=80.79470198675497


#### Analyzing Model Responses

Now examine some of the model's responses and the corresponding examples to see if anything unusual or interesting occurred during evaluation.

**Steps:**

1. Sample some model responses across the dataset.

2. Analyze the input example and model's response.

4. Dig deeper into the model's response and explain why this is the case.

5. Possible insights:

  - Is model hallucinating or fabricating information?

  - Does model seem biased or inconsistent?

  - Does the model rely too much on the context?


Please refer to the report.

In [28]:
dataset_test[20]

{'id': '56beb3083aeaaa14008c923e',
 'title': 'Super_Bowl_50',
 'context': 'Despite waiving longtime running back DeAngelo Williams and losing top wide receiver Kelvin Benjamin to a torn ACL in the preseason, the Carolina Panthers had their best regular season in franchise history, becoming the seventh team to win at least 15 regular season games since the league expanded to a 16-game schedule in 1978. Carolina started the season 14–0, not only setting franchise records for the best start and the longest single-season winning streak, but also posting the best start to a season by an NFC team in NFL history, breaking the 13–0 record previously shared with the 2009 New Orleans Saints and the 2011 Green Bay Packers. With their NFC-best 15–1 regular season record, the Panthers clinched home-field advantage throughout the NFC playoffs for the first time in franchise history. Ten players were selected to the Pro Bowl (the most in franchise history) along with eight All-Pro selections.',
 'que

In [26]:
predictions[20]

'The Panthers had the best record in the NFC with a 15-1 regular season record.'

In [27]:
predictions2[20]

'The best record in the NFC was held by the Dallas Cowboys with 374 wins and 206 losses.'

### 3.2. Entity Substitution

#### Modifying Entities in Examples

For this section, we need to modify the entities in each example with different entities from the same domain.

For example, the sentence "Joe Biden is the president of the US" could be changed to "Akbar is the king of England".

To do this, we recommend using the spaCy library and its named entity recognition (NER) capabilities.

**Steps:**

1. Load the `en_core_web_sm` model in spaCy.

2. Identify named entities in each example text.

3. Decide which entities could be swapped out.

4. Replace entities with new random ones from the same domain.


**Of course, this is just a suggestion and you can feel free to implement this section as you desire, as long as it meets the required criteria.**


In [33]:
import spacy

nlp = spacy.load("en_core_web_sm")

labels = nlp.get_pipe("ner").labels

for label in labels:
    print(label)
    print(spacy.explain(label))
    print('-------------------------------')


CARDINAL
Numerals that do not fall under another type
-------------------------------
DATE
Absolute or relative dates or periods
-------------------------------
EVENT
Named hurricanes, battles, wars, sports events, etc.
-------------------------------
FAC
Buildings, airports, highways, bridges, etc.
-------------------------------
GPE
Countries, cities, states
-------------------------------
LANGUAGE
Any named language
-------------------------------
LAW
Named documents made into laws.
-------------------------------
LOC
Non-GPE locations, mountain ranges, bodies of water
-------------------------------
MONEY
Monetary values, including unit
-------------------------------
NORP
Nationalities or religious or political groups
-------------------------------
ORDINAL
"first", "second", etc.
-------------------------------
ORG
Companies, agencies, institutions, etc.
-------------------------------
PERCENT
Percentage, including "%"
-------------------------------
PERSON
People, including fict

In [34]:
'''
EVENT
Named hurricanes, battles, wars, sports events, etc.
-------------------------------
FAC
Buildings, airports, highways, bridges, etc.
-------------------------------
GPE
Countries, cities, states
-------------------------------
LANGUAGE
Any named language
-------------------------------
LAW
Named documents made into laws.
-------------------------------
LOC
Non-GPE locations, mountain ranges, bodies of water
-------------------------------
NORP
Nationalities or religious or political groups
-------------------------------
ORG
Companies, agencies, institutions, etc.
-------------------------------
PERSON
People, including fictional
-------------------------------
PRODUCT
Objects, vehicles, foods, etc. (not services)
-------------------------------
WORK_OF_ART
Titles of books, songs, etc.
-------------------------------
'''
entities = {
    "EVENT": [
        "Hurricane Katrina (2005)",
        "Battle of Waterloo (1815)",
        "World War II (1939-1945)",
        "Super Bowl LVI (2022)",
        "Vietnam War (1955-1975)",
        "Hurricane Sandy (2012)",
        "Gulf War (1990-1991)",
        "French Open (annual event)",
        "Battle of Gettysburg (1863)",
        "FIFA World Cup 2022"
    ],
    "FAC": [
        "LaGuardia Airport (New York City)",
        "Golden Gate Bridge (San Francisco, CA)",
        "CN Tower (Toronto, Canada)",
        "Heathrow Airport Terminal 5 (London)",
        "Shanghai Metro (Shanghai, China)",
        "Hoover Dam (Nevada/Arizona, US)",
        "Burj Khalifa (Dubai, UAE)",
        "Cape Canaveral Space Force Station (Florida, US)",
        "CERN Hadron Collider (Geneva, Switzerland)",
        "Shanghai Tunnel (Shanghai, China)"
    ],
    "GPE": [
        "Paris, France",
        "Canada",
        "California, US",
        "India",
        "Mexico",
        "Germany",
        "New South Wales, Australia",
        "Jakarta, Indonesia",
        "Shanghai, China",
        "Texas, US"
    ],
    "LANGUAGE": [
        "English",
        "Mandarin Chinese",
        "Spanish",
        "Arabic",
        "Russian",
        "French",
        "German",
        "Japanese",
        "Hindi",
        "Portuguese"
    ],
    "LAW": [
        "United States Constitution",
        "Magna Carta (England, 1215)",
        "Code of Hammurabi (Babylonia, ~1754 BCE)",
        "Declaration of Independence (US, 1776)",
        "Bill of Rights (US, 1791)",
        "Geneva Conventions (1864, 1906, 1929, 1949)",
        "Universal Declaration of Human Rights (UN, 1948)",
        "Treaty of Versailles (1919)",
        "Patient Protection and Affordable Care Act (US, 2010)",
        "Civil Rights Act (US, 1964)"
    ],
    "LOC": [
        "Sahara Desert (Africa)",
        "Amazon River (South America)",
        "Mount Everest (Asia)",
        "Pacific Ocean",
        "Hudson River (New York, US)",
        "Urals Mountains (Russia)",
        "Lake Victoria (Africa)",
        "Strait of Gibraltar (border of Europe/Africa)",
        "Antarctica",
        "Mariana Trench (western Pacific Ocean)"
    ],

    "NORP": [
        "Arabs",
        "Hispanics",
        "Kurds",
        "Tamils",
        "Hutus",
        "Pashtuns",
        "Hmong",
        "Israelis",
        "Basques",
        "Chechens"
    ],
    "ORG": [
        "United Nations",
        "Microsoft Corporation",
        "Mayo Clinic",
        "Taliban",
        "NASA",
        "Starbucks",
        "FIFA",
        "Centers for Disease Control and Prevention (CDC)",
        "European Union",
        "Harvard University"
    ],
    "PERSON": [
        "Barack Obama",
        "Queen Elizabeth II",
        "Cristiano Ronaldo",
        "J.K. Rowling",
        "Elon Musk",
        "Taylor Swift",
        "Donald Trump",
        "Serena Williams",
        "Jeff Bezos",
        "Malala Yousafzai"
    ],
    "PRODUCT": [
        "iPhone",
        "Coca-Cola",
        "Boeing 747",
        "Harry Potter books",
        "Lego",
        "PlayStation 5",
        "Tesla Model S",
        "Ikea Billy bookcase",
        "Honda Civic",
        "Heinz ketchup"
    ],
    "WORK_OF_ART": [
        "Mona Lisa (painting by Leonardo da Vinci)",
        "Hamlet (play by Shakespeare)",
        "The Starry Night (painting by van Gogh)",
        "Thriller (album by Michael Jackson)",
        "The Odyssey (epic poem by Homer)",
        "The Divine Comedy (poem by Dante)",
        "Pride and Prejudice (novel by Jane Austen)",
        "La Gioconda (opera by Ponchielli)",
        "Broadway musical Hamilton",
        "Hey Jude (song by The Beatles)"
    ]
}

In [35]:
np.random.seed(SEED)

def change_example_entities(example):
    ## Your code begins ##
    doc = nlp(example['context'])
    entity_mapping = {}

    for ent in doc.ents:
        label = ent.label_
        text = ent.text
        if label in entities:
            replacement_entity = np.random.choice(entities[label])
            entity_mapping[text] = replacement_entity

    for old_entity, new_entity in entity_mapping.items():
        example['context'] = example['context'].replace(old_entity, new_entity)
        example['question'] = example['question'].replace(old_entity, new_entity)
        for i in range(len(example['answers']['text'])):
          example['answers']['text'][i] = example['answers']['text'][i].replace(old_entity, new_entity)

    return example
    ## Your code ends ##

changed_entiy_dataset = dataset_test.map(change_example_entities)

Map:   0%|          | 0/1057 [00:00<?, ? examples/s]

In [None]:
changed_entiy_dataset[0]

{'id': '56be4db0acb8001400a502ec',
 'title': 'Super_Bowl_50',
 'context': 'Gulf War (1990-1991) 50 was an Basques football game to determine the champion of NASA (United Nations) for the 2015 season. The Basques Football Conference (AFC) champion Jeff Bezos defeated Taliban (NFC) champion Cristiano Ronaldo 24–10 to earn their third Gulf War (1990-1991) title. The game was played on February 7, 2016, at CERN Hadron Collider (Geneva, Switzerland) in Mariana Trench (western Pacific Ocean) at New South Wales, Australia, Paris, France. As this was the 50th Gulf War (1990-1991), the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Gulf War (1990-1991) game with Hutus numerals (under which the game would have been known as "Gulf War (1990-1991) L"), so that the logo could prominently feature the Israelis numerals 50.',
 'question': 'Which United Nations team represented the AFC at Gulf War (1990-199

#### Evaluating Model on Modified Entities

Now we will evaluate our model's performance on the dataset with modified entities.

**Steps:**

1. Check model performance on original correct answers.

2. Check performance on modified answers.

   - Calculate metrics on answers changed to match context.

3. Examine some model responses.

   - Analyze model behavior on modified examples.

   - Explain anything interesting about model responses.

**Key Points**

- Evaluate on original answers as a baseline.

- Also evaluate on modified answers matching context.

- Compare metrics - does performance decrease?

- Inspect some responses for insightful model behaviors.



In [None]:
# @title Evaluating Llama-2 on the Dataset
predictions = []
ground_truths = []

for example in tqdm(dataset_test):
    input_text = f"Question: {example['question']} Context: {example['context']}"
    output_text = llm(prompt_template % (preprompt, input_text))
    predictions.append(output_text)
    ground_truths.append(example['answers']['text'])

em_score = compute_exact_match_score(predictions, ground_truths),
f1_score = compute_f1_score(predictions, ground_truths)

print(f"EM Score={em_score}, F1 Score={f1_score}")

  0%|          | 0/1057 [00:00<?, ?it/s]

EM Score=(16.93472090823084,), F1 Score=0.3671884374572664


In [None]:
###########################################
### Evaluating modified answers section ###
###########################################

## Your code begins ##
predictions3 = []
ground_truths3 = []

for example in tqdm(changed_entiy_dataset):
    input_text = f"Context: {example['context']} \nQuestion: {example['question']}"
    output_text = llm(prompt_template % (preprompt, input_text))
    predictions3.append(output_text)
    ground_truths3.append(example['answers']['text'])

em_score = compute_exact_match_score(predictions3, ground_truths3),
f1_score = compute_f1_score(predictions3, ground_truths3)

print(f"EM Score={em_score}, F1 Score={f1_score}")
## Your code ends ##

  0%|          | 0/1057 [00:00<?, ?it/s]

EM Score=(21.665089877010406,), F1 Score=0.3788809504407445


In [None]:
import pickle

# Save the list to a file using pickle
with open('predictions3.pkl', 'wb') as f:
    pickle.dump(predictions3, f)

In [46]:
dataset_test[78]

{'id': '56d9cb47dc89441400fdb836',
 'title': 'Super_Bowl_50',
 'context': "With 4:51 left in regulation, Carolina got the ball on their own 24-yard line with a chance to mount a game-winning drive, and soon faced 3rd-and-9. On the next play, Miller stripped the ball away from Newton, and after several players dove for it, it took a long bounce backwards and was recovered by Ward, who returned it five yards to the Panthers 4-yard line. Although several players dove into the pile to attempt to recover it, Newton did not and his lack of aggression later earned him heavy criticism. Meanwhile, Denver's offense was kept out of the end zone for three plays, but a holding penalty on cornerback Josh Norman gave the Broncos a new set of downs. Then Anderson scored on a 2-yard touchdown run and Manning completed a pass to Bennie Fowler for a 2-point conversion, giving Denver a 24–10 lead with 3:08 left and essentially putting the game away. Carolina had two more drives, but failed to get a first 

In [49]:
predictions[78]

'Player: Newton'

In [48]:
changed_entiy_dataset[78]

{'id': '56d9cb47dc89441400fdb836',
 'title': 'Super_Bowl_50',
 'context': "With J.K. Rowling left in regulation, Hudson River (New York, US) got the ball on their own 24-yard line with a chance to mount a game-winning drive, and soon faced 3rd-and-9. On the next play, Barack Obama stripped the ball away from Taliban, and after several players dove for it, it took a long bounce backwards and was recovered by Microsoft Corporation, who returned it five yards to the Panthers 4-yard line. Although several players dove into the pile to attempt to recover it, Taliban did not and his lack of aggression later earned him heavy criticism. Meanwhile, New South Wales, Australia's offense was kept out of the end zone for three plays, but a holding penalty on cornerback Donald Trump gave the Lego a new set of downs. Then Cristiano Ronaldo scored on a 2-yard touchdown run and Serena Williams completed a pass to J.K. Rowling for a 2-point conversion, giving New South Wales, Australia a 24–10 lead with

In [50]:
predictions3[78]

'Ronaldo'

### 3.3. Nonsense Word Substitution

In this segment of the adversarial dataset construction, our primary aim is to assess the model's ability to adapt to new, artificially coined terms and evaluate its reasoning capabilities based on the provided context. We will implement a systematic approach to generate nonsense words, replace identifiable entities in the dataset with these generated words, and provide a definition for each nonsense word. This process encapsulates the essence of exploring how well the model can understand and use newly defined terms to answer questions accurately.

The first task at hand is to design a function that generates nonsense words. The goal here is to create a word that doesn't carry any pre-existing meaning. The function `generate_nonsense_word` below is your starting point. Implement the function such that it creates and returns a nonsense word.

In [None]:
# @title Generate Nonsense Words (Your Implementation)
import string
random.seed(SEED)

def generate_nonsense_word():
    ## Your code begins ##
    length = random.randint(2,10)
    nonsense_word = ''.join(random.choice(string.ascii_lowercase) for _ in range(length))
    return nonsense_word

nonsense_word = generate_nonsense_word()
print("Generated Nonsense Word:", nonsense_word)
    ## Your code ends ##

Generated Nonsense Word: nwnu


Having devised a mechanism to create nonsense words, we transition into the heart of this section—creating the adversarial dataset. We will employ the Spacy library's Named Entity Recognition (NER) system to identify entities within the text. Each identified entity will be replaced by a generated nonsense word, and a definition will be provided for every replacement. The create_adversarial_example function below encapsulates this task. Implement the function, and upon executing it, you will observe a sample example from the adversarial dataset that illustrates the substitutions and definitions.

In [None]:
# @title Create Adversarial Dataset (Your Implementation)
nlp = spacy.load("en_core_web_sm")

def create_adversarial_example(example):
    doc = nlp(example['context'])

    ## Your code begins ##
    entity_replacements = {}
    for ent in doc.ents:
      entity = ent.text
      nonsense_word = generate_nonsense_word()
      entity_replacements[entity] = nonsense_word

    altered_context = example['context']
    altered_question = example['question']

    for entity, nonsense_word in entity_replacements.items():
        altered_context = altered_context.replace(entity, nonsense_word)
        altered_question = altered_question.replace(entity, nonsense_word)

    definitions = {v: k for k, v in entity_replacements.items()}
    ## Your code ends ##

    return {
      'altered_context': altered_context,
      'altered_question': altered_question,
      'definitions': ', '.join([f'{k} is another word for {v}' for k, v in definitions.items()]),
    }

adversarial_examples = dataset_test.map(create_adversarial_example)

clear_output()
adversarial_examples[0]

{'id': '56be4db0acb8001400a502ec',
 'title': 'Super_Bowl_50',
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'question': 'Which NFL team represented the AFC at Super Bowl 50?',
 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],


With the adversarial dataset in place, the stage is set for evaluating the model's performance. We aim to uncover how well the model navigates through the maze of newly introduced terms while clinging to the definitions provided. Implement the evaluation code block below to gauge the model's performance on this adversarial dataset. The insights garnered from this exercise will shed light on the model's ability to adapt to new information and reason based on provided definitions, which is a step closer to understanding the model's reasoning faculties.

In [None]:
preprompt2 = preprompt + ' Use definitions list to recover main words and return these main words.'

In [None]:
# @title Evaluating Llama-2 on the Adversarial Dataset
predictions4 = []
ground_truths4 = []

for example in tqdm(adversarial_examples):
    input_text = f"Question: {example['altered_question']} Context: {example['altered_context']} Definitions: {example['definitions']}"
    output_text = llm(prompt_template % (preprompt2, input_text))
    predictions4.append(output_text)
    ground_truths4.append(example['answers']['text'])

em_score = compute_exact_match_score(predictions4, ground_truths4),
f1_score = compute_f1_score(predictions4, ground_truths4)

print(f"EM Score={em_score}, F1 Score={f1_score}")

  0%|          | 0/1057 [00:00<?, ?it/s]

EM Score=(3.5004730368968775,), F1 Score=0.20720127477626685


In [None]:
import pickle

# Save the list to a file using pickle
with open('predictions4.pkl', 'wb') as f:
    pickle.dump(predictions4, f)

## 4. Conclusion
This exercise navigates through the curious interplay of reasoning and retrieval within Large Language Models, particularly focusing on the Llama-2 model. Through meticulous evaluation and crafting adversarial datasets, we aim to provide a window into the model's behavior, shedding light on its strengths, weaknesses, and its approach to deciphering and responding to questions under varying conditions.

Now, reflect upon the model's performance and share your insights:


5. <font color="green"> Did the model's performance align with your expectations? </font>
6. <font color="green"> How do the adversarial evaluations contribute to our understanding of the model's strengths and weaknesses in terms of reasoning and retrieval? </font>



Please refer to the report.