# Homework 2 Part 2

## Course Name: Large Language Models
#### Lecturers: Dr. Soleimani, Dr. Rohban, Dr. Asgari

---

#### Notebooks Supervised By: MohammadAli SadraeiJavaheri, Omid Ghahroodi
#### Notebook Prepared By: Mahdi Zakizadeh, Ali Razghandi

**Contact**: Ask your questions in Quera

---

### Instructions:
- Complete all exercises presented in this notebook.
- Ensure you run each cell after you've entered your solution.
- After completing the exercises, save the notebook and <font color='red'>follow the submission guidelines provided in the PDF.</font>


---

**Note**: Replace the placeholders (between <font color="green">`## Your code begins ##`</font> and <font color="green">`## Your code ends ##`</font>) with the appropriate details.


## 1. Introduction

The advent of Large Language Models (LLMs) has undeniably shifted the paradigm in the realm of natural language processing, offering capabilities that inch closer to human-like text understanding and generation. Among the vanguards of this shift is the Llama-2 model, a behemoth trained on diverse text corpora, promising adeptness in various NLP tasks. However, as we usher into this era of seemingly intelligent machines, a pertinent question arises - do these models truly understand the text, or do they merely excel in retrieving memorized pieces of information from their training data? This inquiry is not merely academic; the implications of the findings reverberate through the practical applications and the future trajectory of LLMs. In exploring the reasoning capabilities of Large Language Models, a noteworthy investigation was carried out by [Saparov and He](https://openreview.net/pdf?id=qFVVBzXxR2V). Their analytical journey led to the revelation that these models, to a significant extent, harness the knowledge acquired during the pre-training phase when confronted with reasoning tasks. Characterized as "greedy reasoners," these models exhibit a propensity to rely on the reservoir of memorized information, as opposed to showcasing authentic reasoning abilities.

Our exploration is set against the backdrop of the SQuAD dataset, a well-regarded benchmark in the question-answering domain. The choice of SQuAD is motivated by its structured evaluation metrics which offer a tangible measure of a model's ability to retrieve and reason over text. While SQuAD has been instrumental in driving progress in question answering, its conventional usage may not fully expose the nuanced capabilities of models like Llama-2. This homework aims to delve deeper by constructing adversarial datasets that challenge the model beyond mere retrieval, probing its ability to reason and refer to the provided context accurately. Through a systematic evaluation on both the original and adversarially-modified versions of the SQuAD dataset, we aspire to dissect the retrieval and reasoning prowess of Llama-2, shedding light on the model's strengths, weaknesses, and the path towards more robust and interpretable LLMs.

Let's begin by setting up our workspace and loading the Llama-2 model to explore its capabilities.



In [4]:
# @title Environment Setup
# Note: Do NOT make changes to this block.
# ----------------
%pip install ctransformers[cuda]>=0.2.24 transformers datasets
!apt-get -y install -qq aria2

from IPython.display import clear_output
import numpy as np
import random
import spacy
import transformers
from tqdm.notebook import tqdm

SEED=21

np.random.seed(SEED)
random.seed(SEED)

!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf -d /content/models -o llama-2-7b-chat.Q5_K_M.gguf

clear_output()
# ----------------

In [6]:
# @title Model Initialization
gpu_layers = 200000
context_length = 2048

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-13B-GGUF",
    model_file="/content/models/llama-2-7b-chat.Q5_K_M.gguf",
    model_type="llama",
    gpu_layers=gpu_layers,
    context_length=context_length,
)

prompt_template = """
[INST] <<SYS>
%s
<</SYS>>
%s[/INST]
"""

clear_output()

Now, it's a good practice to test the model with some inputs to get a feel for its responses before diving into the core analysis.


In [7]:
# @title Let's Test the Model
preprompt = "Your job is to answer the users' question accurately according to the context in shortest way possible. If the answer is not present in the provided context by the user, refuse to answer and yield \\\"Not enough info.\\\" If the answer is present in the context, only return the part of the context relevant to the question. The shorter your answer be, the more score you receive, even if you write a one word instead of a full sentence. Answering based on your prior knowledge is not considered as a good thing." # @param {type:"string"}
test_input = "Who is the current president of Iran?" # @param {type:"string"}

llm(prompt_template % (preprompt, test_input))


'Not enough info.'

### 2.1. Metrics

Evaluating the performance of Large Language Models (LLMs) on question-answering tasks necessitates employing metrics that accurately reflect the models' ability to provide correct and precise answers. Two widely acknowledged metrics for this purpose are Exact Match (EM) and F1 Score, which offer a lens through which the accuracy and the overall quality of the model’s responses can be gauged.

1. **Exact Match (EM)**:
   - The Exact Match metric measures the percentage of responses that match the ground truth answers exactly. It is a stringent metric that requires the predicted answer to be identical to the ground truth answer.
   - Mathematical Equation:
$\text{EM} = \left( \frac{\text{Number of exact matches}}{\text{Total number of questions}} \right) \times 100$

   - Example:
     Suppose we have $5$ questions, and the model answers $3$ of them exactly as in the ground truth. The EM score would be $(3/5) \times 100 = 60 \\% $.



In [8]:
def compute_exact_match_score(predictions: list[str], ground_truths: list[list[str]]):
    exact_match_score = 0
    for prediction, ground_truth in zip(predictions, ground_truths):
        prediction = prediction.lower().strip()
        exact_match_score += any(gt.lower().strip() == prediction for gt in ground_truth)
    em_percentage = (exact_match_score / len(predictions)) * 100
    return em_percentage

2. **F1 Score**:
   - The F1 Score is the harmonic mean of precision and recall, providing a balance between the two. It measures the overlap between the predicted answers and the ground truth, considering both the words that were correctly included and those that were omitted or added incorrectly.
   - Mathematical Equations:
   
  \begin{align}
  \text{Precision} = \frac{\text{Number of true positive words}}{\text{(Number of true positive words + Number of false positive words)}}
  \end{align}

  \begin{align}
  \text{Recall} = \frac{\text{Number of true positive words}}{\text{(Number of true positive words + Number of false negative words)}}
  \end{align}

  \begin{align}
  \text{F1 Score} = 2 \times \left( \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \right)
  \end{align}
  
   - Example:
     Suppose a predicted answer contains $4$ correct words out of $5$ total words, but misses $2$ words that are in the ground truth answer. The precision would be $4/(4+1) = 0.8$, the recall would be $4/(4+2) = 0.67$, and the F1 Score would be $2 \times (0.8 \times 0.67)/(0.8 + 0.67) ≈ 0.73$.

These metrics provide a nuanced view of the model's performance, offering insights into not only how often the model is correct (EM), but also how well it captures the nuances of the ground truth answers (F1 Score). Through these metrics, the evaluation phase aims to paint a comprehensive picture of the model's proficiency in the question-answering task amidst the structured framework provided by the SQuAD dataset.

In [9]:
def compute_f1_score(predictions: list[str], ground_truths: list[list[str]]):
    total_f1_score = 0
    for prediction, ground_truth in zip(predictions, ground_truths):
        prediction_words = prediction.lower().strip().split()
        best_f1 = 0
        for gt in ground_truth:
            gt_words = gt.lower().strip().split()
            common_words_count = sum(1 for word in prediction_words if word in gt_words)
            precision = common_words_count / len(prediction_words)
            recall = common_words_count / len(gt_words)
            f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
            best_f1 = max(best_f1, f1)
        total_f1_score += best_f1
    average_f1_score = total_f1_score / len(predictions)
    return average_f1_score


### 2.2. Loading the Dataset and Evaluating the Model

Now, let's put the model to the test on the vanilla dataset to see how it performs. The steps we are going to follow are quite straightforward: First, we'll load up the dataset, and then we'll feed it to the model and evaluate the results using the score functions you've implemented earlier. To keep things manageable and ensure a quick run time, we'll use a subset of the SQuAD dataset for this evaluation.

In the following step, we'll load a subset of the SQuAD dataset which will be used for evaluating the model. This dataset contains a variety of questions along with the correct answers which we'll compare against the model's responses. After running the code block, you should see a sample row from the dataset, giving you a glimpse of the kind of questions and answers it contains.

In [10]:
# @title Loading the SQuAD Dataset Subset
from datasets import load_dataset
dataset = load_dataset('squad', split="validation")
dataset_test = dataset.shard(num_shards=30, index=0)

clear_output()
dataset_test[0]

{'id': '56be4db0acb8001400a502ec',
 'title': 'Super_Bowl_50',
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'question': 'Which NFL team represented the AFC at Super Bowl 50?',
 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],


With the dataset ready, it's time to see how Llama-2 fares. We'll feed the questions from the dataset to the model and collect its answers. Then, we'll use the score functions to calculate the Exact Match and F1 scores for each response, giving us a clear picture of the model's performance on this dataset.

In [None]:
# @title Evaluating Llama-2 on the Dataset
predictions = []
ground_truths = []

for example in tqdm(dataset_test):
    input_text = f"Question: {example['question']} Context: {example['context']}"
    output_text = llm(prompt_template % (preprompt, input_text))
    predictions.append(output_text)
    ground_truths.append(example['answers']['text'])

em_score = compute_exact_match_score(predictions, ground_truths),
f1_score = compute_f1_score(predictions, ground_truths)

print(f"EM Score={em_score}, F1 Score={f1_score}")

  0%|          | 0/353 [00:00<?, ?it/s]

EM Score=(18.69688385269122,), F1 Score=0.37896507808988594


Having seen how the model performs on the vanilla dataset, let’s delve into some analytical reflections:
1. <font color="green"> What do you think is the better metric for evaluating Llama-2 on this dataset and why? </font>
2. <font color="green"> How can preprompt text affect the evaluation and the model's performance? </font>


<font color="red"> Asnwer 1: </font> <br>

The F1 score might be a better metric for evaluating Llama-2 on this dataset. Because there are often many valid ways to express the same idea. The F1 score allows this, rewarding answers that are essentially correct even if they are not word-for-word matches. This metric can provide a better view of the model's understanding and generation capabilities.

<font color="red"> Asnwer 2: </font> <br>

Preprompt text can frame the answer model generates which is useful for extracting and evaluating the results. Also, it could direct the model to refrain from hallucinating and generating extra information not asked of it. Defining constraint that limit the way the model think is another useful effect of preprompt texts.

## 3. Adversarial Dataset Construction

In this section, we venture into the realm of adversarial evaluation to delve deeper into the abilities of the Llama-2 model. The objective is to scrutinize how the model responds to scenarios that are crafted to challenge its reasoning and retrieval capacities. We propose three methods to create adversarial datasets, each aimed at examining different facets of the model's behavior.

1. **Answer Absence**: In this method, we modify the SQuAD dataset by crafting questions for which the answers do not exist in the provided context.

2. **Entity Substitution**: Here, we substitute entity words in the context with other entities to test whether the model relies on retrieval or refers to the context accurately for answering the question. For instance, changing the context from "The president of the USA lives in the White House. Barack Obama is the current president of the USA." to "The president of the USA lives in the White House. Gall Granuaile is the current president of the USA." and observing if the answer changes appropriately.

3. **Nonsense Word Substitution**: In this method, we replace certain words or entities with nonsensical words in a consistent and meaningful way, defining the nonsense words before asking the question. For example, replacing "White House" with "Glibber House" and explaining that "Glibber" means "White".

Before embarking on the evaluation using adversarial datasets, we encourage students to ponder upon a few analytical questions:
<font color="green">

3. What is your expectation regarding the model's performance on these adversarial datasets?
4. How might the model's behavior on standard versus adversarial datasets inform us about its reasoning and retrieval abilities?

</font>

<font color="red"> Asnwer 3: </font> <br>

Answer Absence: I would except this method to have the worst results as the answers don't exist in the context so it impossible for the model to give correct answers. The model might either fabricate a response or indicate that the answer is not available.

Entity Substitution: In this case, the model's performance might vary. If the LLM heavily relies on memorized knowledge, it may not recognize the incorrect or altered context and provide an answer based on its pre-existing knowledge. So the performance of the model might not decrease significantly.

Nonsense Word Substitution: This is the most challenging scenario for the model. If the nonsense word is consistently defined and used, a LLM might adapt to this new vocabulary and answer correctly. However, this requires the model to not only understand the context but also to adapt to new information that contradicts its training data.

<font color="red"> Asnwer 4: </font> <br>

A model's performance on adversarial datasets can highlight whether it truly understands the content or simply retrieves memorized information. Struggles with entity substitution or nonsense word substitution could indicate reliance on memorization over contextual understanding.

How a model handles nonsense word substitution can demonstrate its ability to adapt to new information and context, a crucial aspect of human-like understanding and reasoning.

Differences in performance between standard and adversarial datasets can help identify specific weaknesses or biases in the model, guiding future improvements and training strategies.

The model's ability to handle adversarial scenarios can indicate its robustness and reliability, which are important for practical applications, especially in unpredictable or unconventional contexts.


### 3.1. Answer Absence

#### Modifying the Dataset

For this section we need to modify the original dataset in the way that for each example there will be a new context that is totally different with the original context of the example.

To do so, we suggest that you use the title feature in each example and then swap the context between examples that do not have the same title.

*  Of course, this is just a suggestion and you can feel free to implement this section as you desire, as long as it meets the required criteria.

Some key points:

*   The goal is for each example to have a new context that differs from the original.
* Using the title of each example is one potential way to pair up examples for swapping contexts.
* Feel free to use any approach for generating new contexts as long as they meaningfully differ from the originals.
* The modified dataset should meet the specifications and requirements for the assignment.
* Be creative in how you modify the contexts - the approach suggested is just one option.


    

In [None]:
from collections import defaultdict
original_group_contexts = defaultdict(list)
for ex in dataset_test:
   original_group_contexts[ex['title']].append(ex['context'])

def create_adversarial_example(example):
    ## Your code begins ##
    indices = list(range(len(dataset_test)))
    random.shuffle(indices)
    new_context = None
    for idx in indices:
      if dataset_test[idx]['title'] != example['title']:
        new_context = dataset_test[idx]['context']
        break
    example['new_context'] = new_context
    return example
    ## Your code ends ##

shuffled_context_dataset = dataset_test.map(create_adversarial_example)

## Your code begins ##
adversial_group_contexts = defaultdict(list)
for ex in shuffled_context_dataset:
   adversial_group_contexts[ex['title']].append(ex['context'])
## Your code ends ##

Map:   0%|          | 0/353 [00:00<?, ? examples/s]

In [None]:
for ex in shuffled_context_dataset:
    assert(ex['context'] != ex['new_context'])

#### Evaluating Model Performance on Modified Dataset

Now we will test the performance of our model on the modified dataset to determine how reliant it is on the original contexts.

- Compare the model predictions to the original correct answers.
- Calculate evaluation metrics.
- Analyze whether there is a significant decrease in model performance on the modified dataset and explain your thoughts.


In [None]:
###########################################
### Evaluating original answers section ###
###########################################

predictions = []
ground_truths = []

for example in tqdm(shuffled_context_dataset):
    input_text = f"Context: {example['new_context']} \nQuestion: {example['question']}"
    output_text = llm(prompt_template % (preprompt, input_text))
    predictions.append(output_text)
    ground_truths.append(example['answers']['text'])

em_score = compute_exact_match_score(predictions, ground_truths),
f1_score = compute_f1_score(predictions, ground_truths)

print(f"EM Score={em_score}, F1 Score={f1_score}")

  0%|          | 0/353 [00:00<?, ?it/s]

EM Score=(0.28328611898017,), F1 Score=0.021694415653218217


This clearly shows that the absence of proper context can heavily demolish the performance of the model. Which is an obvious result of the lack of information.

#### Evaluating "Not Enough Info." Responses

In the prompt we specified that the model should respond "Not enough info." if the context lacks the information needed to answer the question.

Now we will evaluate the model's performance on these "not enough info." responses.

Which evaluation metric should we use and why?


In [None]:
###########################################
### Evaluating modified answers section ###
###########################################

## Your code begins ##

## Your code ends ##


I did this part in the cell above it. Considering not enough information is the best output of the model in this dataset, exact match score would not be a good metric for evaluating the model as it would get a very small score eventhough it's generating the best results it can.

#### Analyzing Model Responses

Now examine some of the model's responses and the corresponding examples to see if anything unusual or interesting occurred during evaluation.

**Steps:**

1. Sample some model responses across the dataset.

2. Analyze the input example and model's response.

4. Dig deeper into the model's response and explain why this is the case.

5. Possible insights:


<font color="red"> Asnwer: </font> <br>

  - Is model hallucinating or fabricating information? 

  The model is hallucinating fro  time to time but it seems like that the preprompt text has done a great job of keeping it from fabricating information as it often expresses it doesn't have enough information.

  - Does model seem biased or inconsistent?

  Model outputs are often "not enough information". It seems like the model is biased toward not generating an actual answer. This might be because it is relying too much on the context and isn't using the data learned in pretraining.

  - Does the model rely too much on the context?

  As asnwered before the model seems to be relying way too much on the context and not using it's previous knowledge as it often claimes it doesn't have enough information.


### 3.2. Entity Substitution

#### Modifying Entities in Examples

For this section, we need to modify the entities in each example with different entities from the same domain.

For example, the sentence "Joe Biden is the president of the US" could be changed to "Akbar is the king of England".

To do this, we recommend using the spaCy library and its named entity recognition (NER) capabilities.

**Steps:**

1. Load the `en_core_web_sm` model in spaCy.

2. Identify named entities in each example text.

3. Decide which entities could be swapped out.

4. Replace entities with new random ones from the same domain.


**Of course, this is just a suggestion and you can feel free to implement this section as you desire, as long as it meets the required criteria.**


In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("This is a sentence.")
print([(w.text, w.pos_) for w in doc])

[('This', 'PRON'), ('is', 'AUX'), ('a', 'DET'), ('sentence', 'NOUN'), ('.', 'PUNCT')]


In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
ner = nlp.get_pipe("ner")
labels = ner.labels

for label in labels:
    print(label)
    print(spacy.explain(label))
    print('-------------------------------')


In [17]:
'''
EVENT
Named hurricanes, battles, wars, sports events, etc.
-------------------------------
FAC
Buildings, airports, highways, bridges, etc.
-------------------------------
GPE
Countries, cities, states
-------------------------------
LANGUAGE
Any named language
-------------------------------
LAW
Named documents made into laws.
-------------------------------
LOC
Non-GPE locations, mountain ranges, bodies of water
-------------------------------
NORP
Nationalities or religious or political groups
-------------------------------
ORG
Companies, agencies, institutions, etc.
-------------------------------
PERSON
People, including fictional
-------------------------------
PRODUCT
Objects, vehicles, foods, etc. (not services)
-------------------------------
WORK_OF_ART
Titles of books, songs, etc.
-------------------------------
'''
entities = {
    "EVENT": [
        "Hurricane Katrina",
        "Battle of Waterloo",
        "World War II",
        "Super Bowl",
        "Vietnam War",
        "Hurricane Sandy",
        "Gulf War",
        "French Open",
        "Battle of Gettysburg",
        "FIFA World Cup"
    ],
    "FAC": [
        "LaGuardia Airport",
        "Golden Gate Bridge",
        "CN Tower",
        "Heathrow Airport Terminal",
        "Shanghai Metro",
        "Hoover Dam",
        "Burj Khalifa",
        "Cape Canaveral Space Force Station",
        "CERN Hadron Collider",
        "Shanghai Tunnel"
    ],
    "GPE": [
        "Paris",
        "France",
        "Canada",
        "California",
        "US",
        "India",
        "Mexico",
        "Germany",
        "New South Wales",
        "Australia",
        "Jakarta",
        "Indonesia",
        "Shanghai",
        "China",
        "Texas"
    ],
    "LANGUAGE": [
        "English",
        "Mandarin Chinese",
        "Spanish",
        "Arabic",
        "Russian",
        "French",
        "German",
        "Japanese",
        "Hindi",
        "Portuguese"
    ],
    "LAW": [
        "United States Constitution",
        "Magna Carta",
        "Code of Hammurabi",
        "Declaration of Independence",
        "Bill of Rights",
        "Geneva Conventions",
        "Universal Declaration of Human Rights",
        "Treaty of Versailles",
        "Patient Protection and Affordable Care Act",
        "Civil Rights Act"
    ],
    "LOC": [
        "Sahara Desert",
        "Amazon River",
        "Mount Everest",
        "Pacific Ocean",
        "Hudson River",
        "Urals Mountains",
        "Lake Victoria",
        "Strait of Gibraltar",
        "Antarctica",
        "Mariana Trench"
    ],

    "NORP": [
        "Arabs",
        "Hispanics",
        "Kurds",
        "Tamils",
        "Hutus",
        "Pashtuns",
        "Hmong",
        "Israelis",
        "Basques",
        "Chechens"
    ],
    "ORG": [
        "United Nations",
        "Microsoft Corporation",
        "Mayo Clinic",
        "Taliban",
        "NASA",
        "Starbucks",
        "FIFA",
        "Centers for Disease Control and Prevention",
        "European Union",
        "Harvard University"
    ],
    "PERSON": [
        "Barack Obama",
        "Queen Elizabeth II",
        "Cristiano Ronaldo",
        "J.K. Rowling",
        "Elon Musk",
        "Taylor Swift",
        "Donald Trump",
        "Serena Williams",
        "Jeff Bezos",
        "Malala Yousafzai"
    ],
    "PRODUCT": [
        "iPhone",
        "Coca-Cola",
        "Boeing 747",
        "Harry Potter books",
        "Lego",
        "PlayStation 5",
        "Tesla Model S",
        "Ikea Billy bookcase",
        "Honda Civic",
        "Heinz ketchup"
    ],
    "WORK_OF_ART": [
        "Mona Lisa",
        "Hamlet",
        "The Starry Night",
        "Thriller",
        "The Odyssey",
        "The Divine Comedy",
        "Pride and Prejudice",
        "La Gioconda",
        "Broadway musical Hamilton",
        "Hey Jude"
    ]
}

In [None]:
nlp = spacy.load("en_core_web_sm")

def change_example_entities(example):
    ## Your code begins ##
    context = example['context']
    changed_context = context.lower()
    for v in entities.values():
        for name in v:
            if name.lower() in changed_context:
                new = random.choice(v).upper()
                changed_context.replace(name.lower(), new)

    example['changed_context'] = changed_context.lower()
    return example
    ## Your code ends ##

changed_entiy_dataset = dataset_test.map(change_example_entities)


#### Evaluating Model on Modified Entities

Now we will evaluate our model's performance on the dataset with modified entities.

**Steps:**

1. Check model performance on original correct answers.

2. Check performance on modified answers.

   - Calculate metrics on answers changed to match context.

3. Examine some model responses.

   - Analyze model behavior on modified examples.

   - Explain anything interesting about model responses.

**Key Points**

- Evaluate on original answers as a baseline.

- Also evaluate on modified answers matching context.

- Compare metrics - does performance decrease?

- Inspect some responses for insightful model behaviors.



In [None]:
# ###########################################
# ### Evaluating original answers section ###
# ###########################################

# ## Your code begins ##
# predictions = []
# references = []

# for example in tqdm(dataset):
#     pass

# print(f"EM Score={em_score}, F1 Score={f1_score}")
# ## Your code ends ##


In [None]:
###########################################
### Evaluating modified answers section ###
###########################################

## Your code begins ##
predictions = []
ground_truths = []

for example in tqdm(changed_entiy_dataset):
    input_text = f"Context: {example['changed_context']} \nQuestion: {example['question']}"
    output_text = llm(prompt_template % (preprompt, input_text))
    predictions.append(output_text)
    ground_truths.append(example['answers']['text'])

em_score = compute_exact_match_score(predictions, ground_truths),
f1_score = compute_f1_score(predictions, ground_truths)

print(f"EM Score={em_score}, F1 Score={f1_score}")
## Your code ends ##

  0%|          | 0/353 [00:00<?, ?it/s]

EM Score=(26.345609065155806,), F1 Score=0.4710253522960759



<font color="red"> Asnwer: </font> <br>

After evaluating the dataset, we suprisingly find out that the performance of the model have improved compared to the original dataset. This could happen for a number of reasons:

1 - Because the substituted entities are still within the context that the model is familiar with, the model might adapt well to these changes. For example, substituting one well-known entity with another might not significantly disrupt the model's understanding of the context, allowing it to maintain or even improve its performance.

2 - In some cases, entity substitution can reduce the complexity of potential answers. This can happen if the substituted entities are less ambiguous or have a more direct association with the question, making it easier for the model to identify the correct answer.

3 - If the substituted entities are more widespread or better represented in the model’s pre-training data, the model might respond to them more accurately. Large language models often perform better with concepts and entities they have encountered more frequently during training.

Aside from the reasons mentioned above, It's also possible that the observed improvement is coincidental or due to the dataset. For example, the changes made might align more closely with the model's existing biases or strengths.

### 3.3. Nonsense Word Substitution

In this segment of the adversarial dataset construction, our primary aim is to assess the model's ability to adapt to new, artificially coined terms and evaluate its reasoning capabilities based on the provided context. We will implement a systematic approach to generate nonsense words, replace identifiable entities in the dataset with these generated words, and provide a definition for each nonsense word. This process encapsulates the essence of exploring how well the model can understand and use newly defined terms to answer questions accurately.

The first task at hand is to design a function that generates nonsense words. The goal here is to create a word that doesn't carry any pre-existing meaning. The function `generate_nonsense_word` below is your starting point. Implement the function such that it creates and returns a nonsense word.

In [19]:
# @title Generate Nonsense Words (Your Implementation)
import string
def generate_nonsense_word():
    ## Your code begins ##
    length = random.randint(3, 10)

    alphabet = string.ascii_lowercase

    word = ''.join(random.choices(alphabet, k=length))

    return word
    ## Your code ends ##


Having devised a mechanism to create nonsense words, we transition into the heart of this section—creating the adversarial dataset. We will employ the Spacy library's Named Entity Recognition (NER) system to identify entities within the text. Each identified entity will be replaced by a generated nonsense word, and a definition will be provided for every replacement. The create_adversarial_example function below encapsulates this task. Implement the function, and upon executing it, you will observe a sample example from the adversarial dataset that illustrates the substitutions and definitions.

In [20]:
# @title Create Adversarial Dataset (Your Implementation)
# nlp = spacy.load("en_core_web_sm")

def create_adversarial_example(example):
    # doc = nlp(example['context'])

    ## Your code begins ##
    entity_replacements = {
    random.choice(li): generate_nonsense_word()
    for k, li in entities.items()
}

    altered_context = example['context'].lower()
    altered_question = example['question'].lower()
    for entity, nonsense_word in entity_replacements.items():
        altered_context = altered_context.replace(entity, nonsense_word)
        altered_question = altered_question.replace(entity, nonsense_word)

    definitions = {v: k for k, v in entity_replacements.items()}
    ## Your code ends ##

    return {
      'altered_context': altered_context,
      'altered_question': altered_question,
      'definitions': ', '.join([f'{k} is another word for {v}' for k, v in definitions.items()]),
    }

adversarial_examples = dataset_test.map(create_adversarial_example)

clear_output()
adversarial_examples[0]


{'id': '56be4db0acb8001400a502ec',
 'title': 'Super_Bowl_50',
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'question': 'Which NFL team represented the AFC at Super Bowl 50?',
 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],


With the adversarial dataset in place, the stage is set for evaluating the model's performance. We aim to uncover how well the model navigates through the maze of newly introduced terms while clinging to the definitions provided. Implement the evaluation code block below to gauge the model's performance on this adversarial dataset. The insights garnered from this exercise will shed light on the model's ability to adapt to new information and reason based on provided definitions, which is a step closer to understanding the model's reasoning faculties.

In [21]:
# @title Evaluating Llama-2 on the Adversarial Dataset
predictions = []
ground_truths = []

for example in tqdm(adversarial_examples):
    input_text = f"Question: {example['altered_question']} Context: {example['altered_context']} Definitions: {example['definitions']}"
    output_text = llm(prompt_template % (preprompt, input_text))
    predictions.append(output_text)
    ground_truths.append(example['answers']['text'])

em_score = compute_exact_match_score(predictions, ground_truths)
f1_score = compute_f1_score(predictions, ground_truths)

print(f"EM Score={em_score}, F1 Score={f1_score}")

  0%|          | 0/353 [00:00<?, ?it/s]

EM Score=(5.94900849858357,), F1 Score=0.29327400488431254


<font color="red"> Asnwer: </font> <br>

As expected, the results of the model have slightly worsened compared to the original dataset. This could happedn bacause the model is struggling to align the newly defined words with the original ones. These new words are surely absent from the pretraining dataset so it makes sense for the model to not fully understand and associate them with other words.

## 4. Conclusion
This exercise navigates through the curious interplay of reasoning and retrieval within Large Language Models, particularly focusing on the Llama-2 model. Through meticulous evaluation and crafting adversarial datasets, we aim to provide a window into the model's behavior, shedding light on its strengths, weaknesses, and its approach to deciphering and responding to questions under varying conditions.

Now, reflect upon the model's performance and share your insights:


5. <font color="green"> Did the model's performance align with your expectations? </font>
6. <font color="green"> How do the adversarial evaluations contribute to our understanding of the model's strengths and weaknesses in terms of reasoning and retrieval? </font>



<font color="red"> Asnwer 5: </font> <br>

Eventhough I expected parts 1 and 3 to behave the way they did, the results of the second part came as a suprise to me. By substituding words with random words from the same domain I was anticipating to encounter decreased resluts as the model is recieving somewhat false information but the performance imprtoved, much to my surprise.

<font color="red"> Asnwer 6: </font> <br>

Adversarial evaluations are really helpful for understanding how well a model like Llama-2 can reason and retrieve information. By using tricky tests, like changing parts of a text or asking questions with no clear answers, we can see if the model is just repeating what it knows or if it's actually understanding the context. If a model does well in these tough situations, it shows that it's good at figuring things out and not just relying on what it has learned before. But if it struggles, it might mean the model is mostly remembering things and not really getting the deeper meaning. These kinds of tests are important because they give us a clearer picture of what the model can do and where it needs to get better, especially in understanding and using language.
