## 1. Introduction

The advent of Large Language Models (LLMs) has undeniably shifted the paradigm in the realm of natural language processing, offering capabilities that inch closer to human-like text understanding and generation. Among the vanguards of this shift is the Llama-2 model, a behemoth trained on diverse text corpora, promising adeptness in various NLP tasks. However, as we usher into this era of seemingly intelligent machines, a pertinent question arises - do these models truly understand the text, or do they merely excel in retrieving memorized pieces of information from their training data? This inquiry is not merely academic; the implications of the findings reverberate through the practical applications and the future trajectory of LLMs. In exploring the reasoning capabilities of Large Language Models, a noteworthy investigation was carried out by [Saparov and He](https://openreview.net/pdf?id=qFVVBzXxR2V). Their analytical journey led to the revelation that these models, to a significant extent, harness the knowledge acquired during the pre-training phase when confronted with reasoning tasks. Characterized as "greedy reasoners," these models exhibit a propensity to rely on the reservoir of memorized information, as opposed to showcasing authentic reasoning abilities.

Our exploration is set against the backdrop of the SQuAD dataset, a well-regarded benchmark in the question-answering domain. The choice of SQuAD is motivated by its structured evaluation metrics which offer a tangible measure of a model's ability to retrieve and reason over text. While SQuAD has been instrumental in driving progress in question answering, its conventional usage may not fully expose the nuanced capabilities of models like Llama-2. This homework aims to delve deeper by constructing adversarial datasets that challenge the model beyond mere retrieval, probing its ability to reason and refer to the provided context accurately. Through a systematic evaluation on both the original and adversarially-modified versions of the SQuAD dataset, we aspire to dissect the retrieval and reasoning prowess of Llama-2, shedding light on the model's strengths, weaknesses, and the path towards more robust and interpretable LLMs.

Let's begin by setting up our workspace and loading the Llama-2 model to explore its capabilities.



In [None]:
# @title Environment Setup
# Note: Do NOT make changes to this block.
# ----------------
%pip install ctransformers[cuda]>=0.2.24 transformers datasets
!apt-get -y install -qq aria2

from IPython.display import clear_output
import numpy as np
import random
import spacy
import transformers
from tqdm.notebook import tqdm

SEED=21

np.random.seed(SEED)
random.seed(SEED)

!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf -d /content/models -o llama-2-7b-chat.Q5_K_M.gguf

clear_output()
# ----------------

In [None]:
# @title Model Initialization
gpu_layers = 200000
context_length = 2048

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-13B-GGUF",
    model_file="/content/models/llama-2-7b-chat.Q5_K_M.gguf",
    model_type="llama",
    gpu_layers=gpu_layers,
    context_length=context_length,
)

prompt_template = """
[INST] <<SYS>
%s
<</SYS>>
%s[/INST]
"""

clear_output()

Now, it's a good practice to test the model with some inputs to get a feel for its responses before diving into the core analysis.


In [None]:
# @title Let's Test the Model
preprompt = "Your job is to answer the users' question accurately according to the context in shortest way possible. If the answer is not present in the provided context by the user, refuse to answer and yield \\\"Not enough info.\\\" If the answer is present in the context, only return the part of the context relevant to the question. The shorter your answer be, the more score you receive, even if you write a one word instead of a full sentence. Answering based on your prior knowledge is not considered as a good thing." # @param {type:"string"}
test_input = "Who is the current president of Iran?" # @param {type:"string"}

llm(prompt_template % (preprompt, test_input))


'Not enough info.'

### 2.1. Metrics

Evaluating the performance of Large Language Models (LLMs) on question-answering tasks necessitates employing metrics that accurately reflect the models' ability to provide correct and precise answers. Two widely acknowledged metrics for this purpose are Exact Match (EM) and F1 Score, which offer a lens through which the accuracy and the overall quality of the model’s responses can be gauged.

1. **Exact Match (EM)**:
   - The Exact Match metric measures the percentage of responses that match the ground truth answers exactly. It is a stringent metric that requires the predicted answer to be identical to the ground truth answer.
   - Mathematical Equation:
$\text{EM} = \left( \frac{\text{Number of exact matches}}{\text{Total number of questions}} \right) \times 100$

   - Example:
     Suppose we have $5$ questions, and the model answers $3$ of them exactly as in the ground truth. The EM score would be $(3/5) \times 100 = 60 \\% $.



In [None]:
def compute_exact_match_score(predictions: list[str], ground_truths: list[list[str]]):
    exact_match_score = 0
    for prediction, ground_truth in zip(predictions, ground_truths):
        prediction = prediction.lower().strip()
        exact_match_score += any(gt.lower().strip() == prediction for gt in ground_truth)
    em_percentage = (exact_match_score / len(predictions)) * 100
    return em_percentage

2. **F1 Score**:
   - The F1 Score is the harmonic mean of precision and recall, providing a balance between the two. It measures the overlap between the predicted answers and the ground truth, considering both the words that were correctly included and those that were omitted or added incorrectly.
   - Mathematical Equations:
   
  \begin{align}
  \text{Precision} = \frac{\text{Number of true positive words}}{\text{(Number of true positive words + Number of false positive words)}}
  \end{align}

  \begin{align}
  \text{Recall} = \frac{\text{Number of true positive words}}{\text{(Number of true positive words + Number of false negative words)}}
  \end{align}

  \begin{align}
  \text{F1 Score} = 2 \times \left( \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \right)
  \end{align}
  
   - Example:
     Suppose a predicted answer contains $4$ correct words out of $5$ total words, but misses $2$ words that are in the ground truth answer. The precision would be $4/(4+1) = 0.8$, the recall would be $4/(4+2) = 0.67$, and the F1 Score would be $2 \times (0.8 \times 0.67)/(0.8 + 0.67) ≈ 0.73$.

These metrics provide a nuanced view of the model's performance, offering insights into not only how often the model is correct (EM), but also how well it captures the nuances of the ground truth answers (F1 Score). Through these metrics, the evaluation phase aims to paint a comprehensive picture of the model's proficiency in the question-answering task amidst the structured framework provided by the SQuAD dataset.

In [None]:
def compute_f1_score(predictions: list[str], ground_truths: list[list[str]]):
    total_f1_score = 0
    for prediction, ground_truth in zip(predictions, ground_truths):
        prediction_words = prediction.lower().strip().split()
        best_f1 = 0
        for gt in ground_truth:
            gt_words = gt.lower().strip().split()
            common_words_count = sum(1 for word in prediction_words if word in gt_words)
            precision = common_words_count / len(prediction_words)
            recall = common_words_count / len(gt_words)
            f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
            best_f1 = max(best_f1, f1)
        total_f1_score += best_f1
    average_f1_score = total_f1_score / len(predictions)
    return average_f1_score


### 2.2. Loading the Dataset and Evaluating the Model

Now, let's put the model to the test on the vanilla dataset to see how it performs. The steps we are going to follow are quite straightforward: First, we'll load up the dataset, and then we'll feed it to the model and evaluate the results using the score functions you've implemented earlier. To keep things manageable and ensure a quick run time, we'll use a subset of the SQuAD dataset for this evaluation.

In the following step, we'll load a subset of the SQuAD dataset which will be used for evaluating the model. This dataset contains a variety of questions along with the correct answers which we'll compare against the model's responses. After running the code block, you should see a sample row from the dataset, giving you a glimpse of the kind of questions and answers it contains.

In [None]:
# @title Loading the SQuAD Dataset Subset
from datasets import load_dataset
dataset = load_dataset('squad', split="validation")
dataset_test = dataset.shard(num_shards=15, index=0)

clear_output()
dataset_test[0]

{'id': '56be4db0acb8001400a502ec',
 'title': 'Super_Bowl_50',
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'question': 'Which NFL team represented the AFC at Super Bowl 50?',
 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],


With the dataset ready, it's time to see how Llama-2 fares. We'll feed the questions from the dataset to the model and collect its answers. Then, we'll use the score functions to calculate the Exact Match and F1 scores for each response, giving us a clear picture of the model's performance on this dataset.

In [None]:
# @title Evaluating Llama-2 on the Dataset (with periodic updates)
import time # It's good practice to import any new modules at the top

predictions = []
ground_truths = []

# Use enumerate to get both the index (i) and the item (example)
for i, example in enumerate(tqdm(dataset_test)):
    input_text = f"Question: {example['question']} Context: {example['context']}"
    output_text = llm(prompt_template % (preprompt, input_text))

    predictions.append(output_text)
    ground_truths.append(example['answers']['text'])

    # --- NEW CODE BLOCK FOR PERIODIC SAVING/PRINTING ---
    # The (i + 1) is because enumerate starts counting from 0
    # We check if the sample number is a multiple of 50
    # We also add a check to print the final results on the very last item
    if (i + 1) % 50 == 0 or (i + 1) == len(dataset_test):
        # Calculate scores on the data collected SO FAR
        em_score = compute_exact_match_score(predictions, ground_truths)
        f1_score = compute_f1_score(predictions, ground_truths)

        # Print a clear, formatted update
        print(f"\n--- Intermediate Results after {i + 1} samples ---")
        print(f"EM Score: {em_score:.2f}%, F1 Score: {f1_score:.4f}")
        print("--------------------------------------------------\n")

# The final print statement is now handled by the last iteration of the loop

output:  

```
--- Intermediate Results after 705 samples ---
EM Score: 29.36%, F1 Score: 0.4685
--------------------------------------------------




Having seen how the model performs on the vanilla dataset, let’s delve into some analytical reflections:
1. <font color="green"> What do you think is the better metric for evaluating Llama-2 on this dataset and why? </font>

While both metrics are useful, for a reading comprehension task like SQuAD, the F1 Score is the better and more informative primary metric.


It's More Forgiving of Natural Language Variation: The F1 score measures the overlap of words between the prediction and the ground truth. This is crucial because a question can often have a correct answer that is phrased slightly differently.

Example:

Question: "Who was the first U.S. President?"

Ground Truth Answer: "George Washington"

Model's Answer: "President George Washington"

EM Score: 0 (The strings are not an exact match). This is unfairly harsh.

F1 Score: High (e.g., 0.8) because the important words ("George", "Washington") are all there. The F1 score correctly identifies this as a good answer.

Our Results Prove This Point: our EM score is 27.4%, while our F1 score is 43.1%. This large gap tells us that in many cases, the model is providing answers that are substantially correct but not perfectly identical to the ground truth. It's getting the right idea but failing on the details of exact phrasing. The F1 score captures this partial success, while the EM score dismisses it entirely.

The Exact Match (EM) score is still valuable, but as a secondary, more stringent metric. It tells us how often the model is "perfectly precise," which can be important for tasks where the answer must be a specific entity, date, or number. However, for overall comprehension, F1 gives a more realistic picture.


2. <font color="green"> How can preprompt text affect the evaluation and the model's performance? </font>

The preprompt (or system prompt) is enormously influential; it's like setting the "rules of the game" for the model before it even sees the question. It fundamentally changes the model's behavior and has a direct, measurable impact on the evaluation scores. Think of it as the puppet master pulling the strings.

Here’s how our specific preprompt affected the results:

Constrains the Information Source: The instruction "answer the users' question accurately according to the context" is the most important rule. It explicitly tells the model not to use its own vast internal knowledge. Without this rule, the model could get a high score simply by "remembering" facts about the world, not by demonstrating that it can read and comprehend the text we provide. This rule is the foundation of our entire experiment to test reasoning vs. retrieval.

Influences Answer Length and Style: The rule "The shorter your answer be, the more score you receive" directly impacts the output. This is likely a major reason why our EM score is low. The ground truth answers in SQuAD might be longer phrases (e.g., "the first President of the United States"), but the model, following our instructions, might correctly extract just the essential part ("George Washington") to be as short as possible. It is correctly following our rules, but those rules lead to a lower EM score. This shows how a preprompt can change the definition of a "good" answer.

Provides a Critical Failure Condition: The rule "If the answer is not present... yield 'Not enough info.'" is a crucial guardrail. Without it, the model would likely try to guess or "hallucinate" an answer when the context doesn't contain the information. This would lead to incorrect answers and a much lower F1 score.

In short, the preprompt sets the entire context for the evaluation. The scores we got are a measure of how well the model followed that specific set of instructions, not just how "smart" it is in a general sense. Change the preprompt, and we would get completely different results.


## 3. Adversarial Dataset Construction

In this section, we venture into the realm of adversarial evaluation to delve deeper into the abilities of the Llama-2 model. The objective is to scrutinize how the model responds to scenarios that are crafted to challenge its reasoning and retrieval capacities. We propose three methods to create adversarial datasets, each aimed at examining different facets of the model's behavior.

1. **Answer Absence**: In this method, we modify the SQuAD dataset by crafting questions for which the answers do not exist in the provided context.

2. **Entity Substitution**: Here, we substitute entity words in the context with other entities to test whether the model relies on retrieval or refers to the context accurately for answering the question. For instance, changing the context from "The president of the USA lives in the White House. Barack Obama is the current president of the USA." to "The president of the USA lives in the White House. Gall Granuaile is the current president of the USA." and observing if the answer changes appropriately.

3. **Nonsense Word Substitution**: In this method, we replace certain words or entities with nonsensical words in a consistent and meaningful way, defining the nonsense words before asking the question. For example, replacing "White House" with "Glibber House" and explaining that "Glibber" means "White".

Before embarking on the evaluation using adversarial datasets, we encourage students to ponder upon a few analytical questions:
<font color="green">

3. What is your expectation regarding the model's performance on these adversarial datasets?

My expectation is that the model's performance will decrease significantly compared to the baseline scores we just calculated, but it will struggle differently on each task.

Answer Absence: I expect the model to perform relatively well on this task. In our very first test ("Who is the president of Iran?"), the model correctly followed the preprompt and answered "Not enough info." when no context was provided. This task is a more complex version of that test. As long as the model continues to strictly adhere to the system prompt, it should be able to identify that the answer is missing and produce the correct "Not enough info." response.

Entity Substitution: This is where I expect the model to fail the most. This method creates a direct conflict between the model's vast, pre-trained "memory" and the new "fact" presented in the context. The "greedy reasoner" hypothesis from  introduction suggests that the model will default to what it already knows. The model is very likely to ignore the fake entity (e.g., "Gall Granuaile") and answer with the real entity it has memorized (e.g., the actual US president). This will cause both the EM and F1 scores to plummet, as the answers will be factually correct according to the real world but completely wrong according to the provided context.

Nonsense Word Substitution: I predict the model's performance here will be somewhere in the middle—better than on Entity Substitution, but worse than on Answer Absence. This task tests a different skill: in-context learning. The model has no prior knowledge of the word "Glibber," so it cannot rely on its memory. It is forced to reason based only on the new rules you've provided. Modern instruction-tuned models are often quite good at this, but it's a complex reasoning task that can still trip them up. I expect a moderate drop in scores due to potential confusion or inconsistent application of the new rule.

4. How might the model's behavior on standard versus adversarial datasets inform us about its reasoning and retrieval abilities?

</font>

Comparing the model's performance across these datasets is the entire point of the experiment. The difference in scores is not just a number; it's a powerful signal that helps us understand how the model is "thinking."

Standard Dataset Performance (The Baseline): This score tells us how good the model is at playing the game on an "easy mode," where the provided context generally aligns with the facts it has already memorized. A high score here simply confirms the model is a capable question-answerer. It doesn't tell us how it's getting the answers.

Adversarial Dataset Performance (The Stress Test): This is where we separate the true readers from the mere rememberers.

If the scores drop dramatically (especially on the Entity Substitution task), it provides strong evidence that the model relies heavily on retrieval (or memory). It suggests the model isn't truly "reading" and "reasoning" over the context you provide, but is instead using the question as a key to look up an answer in its internal knowledge base. This would support the "greedy reasoner" hypothesis.

If the scores remain relatively high, it suggests the model has strong contextual reasoning abilities. It would mean the model is successfully prioritizing the instructions and facts given in the prompt over its pre-existing knowledge. This would show a more robust and trustworthy form of reasoning.

In essence, we've created a scientific control (the standard dataset) and a series of experiments (the adversarial datasets). The gap in performance between the control and the experiments is what allows us to draw a conclusion about the model's internal mechanisms. A big gap suggests reliance on retrieval, while a small gap suggests a stronger capacity for true reasoning.

### 3.1. Answer Absence

#### Modifying the Dataset

For this section we need to modify the original dataset in the way that for each example there will be a new context that is totally different with the original context of the example.

To do so, we suggest that you use the title feature in each example and then swap the context between examples that do not have the same title.

*  Of course, this is just a suggestion and you can feel free to implement this section as you desire, as long as it meets the required criteria.

Some key points:

*   The goal is for each example to have a new context that differs from the original.
* Using the title of each example is one potential way to pair up examples for swapping contexts.
* Feel free to use any approach for generating new contexts as long as they meaningfully differ from the originals.
* The modified dataset should meet the specifications and requirements for the assignment.
* Be creative in how you modify the contexts - the approach suggested is just one option.


    

In [None]:
# @title 3.1. Answer Absence (Your Corrected Logic)
from datasets import Dataset

# --- Step 1 & 2: Create a diverse test set of 705 samples ---
# First, create a dictionary {context: example} from the full dataset.
# This is a clever way to get one example for each unique context.
unique_context_examples = {ex['context']: ex for ex in dataset}

# Now, create our new dataset_test from the first 705 unique examples.
# We convert the dictionary values back to a list, take a slice, and create a Dataset object.
dataset_test_list = list(unique_context_examples.values())[:200]
dataset_test = Dataset.from_list(dataset_test_list)

print(f"Created a new test set with {len(dataset_test)} examples, each with a unique context.")


# --- Step 3 & 4: Create the adversarial context map ---
# The template requires 'adversial_group_contexts', so we will create our map here.

## Your code begins ##

# Get the list of original contexts from our new, clean dataset_test
original_contexts = [ex['context'] for ex in dataset_test]

# Rotate the list of contexts to create the shuffled version
shuffled_contexts = original_contexts[-100:] + original_contexts[:-100]

# Create the map: {original_context_A: shuffled_context_B, ...}
# We name it according to the template's variable name.
adversial_group_contexts = dict(zip(original_contexts, shuffled_contexts))

## Your code ends ##


# --- Step 5: Define the function to apply the map ---
def create_adversarial_example(example):
    """
    Takes an example, finds its original context in our map,
    and returns a new example with the context replaced by the shuffled one.
    """
    ## Your code begins ##

    # Get the original context from the input example
    original_context = example['context']

    # Make a copy to modify
    new_example = example.copy()

    # Look up the new, adversarial context in our map and overwrite the original
    new_example['context'] = adversial_group_contexts[original_context]

    return new_example

    ## Your code ends ##

# --- Step 6: Create the final dataset using the .map() function ---
shuffled_context_dataset = dataset_test.map(create_adversarial_example)



Created a new test set with 200 examples, each with a unique context.


Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
# @title Sanity Check for Answer Absence Dataset

# Get the first example from our original test set
original_example = dataset_test[10]

# Get the first example from our new adversarial set
adversarial_example = shuffled_context_dataset[10]

print("--- VERIFYING THE SHUFFLE ---")
print(f"Original Question: {original_example['question']}")
print(f"Adversarial Question: {adversarial_example['question']}")
print("\n" + "="*50 + "\n")

print("Original Context (Snippet):")
print(f"'{original_example['context'][:150]}...'")
print("\nAdversarial Context (Snippet):")
print(f"'{adversarial_example['context'][:150]}...'")
print("\n" + "="*50 + "\n")

# The final, definitive check:
is_shuffled_correctly = original_example['context'] != adversarial_example['context']
print(f"Are the contexts different? -> {is_shuffled_correctly}")

for i in range(dataset_test.num_rows):
    is_shuffled_correctly = dataset_test[i]['context'] != shuffled_context_dataset[i]['context']
    assert is_shuffled_correctly

--- VERIFYING THE SHUFFLE ---
Original Question: How many receptions did Cotchery  get for the 2015 season?
Adversarial Question: How many receptions did Cotchery  get for the 2015 season?


Original Context (Snippet):
'The Panthers offense, which led the NFL in scoring (500 points), was loaded with talent, boasting six Pro Bowl selections. Pro Bowl quarterback Cam Ne...'

Adversarial Context (Snippet):
'Opportunistic bands of Normans successfully established a foothold in Southern Italy (the Mezzogiorno). Probably as the result of returning pilgrims' ...'


Are the contexts different? -> True


In [None]:
shuffled_context_dataset[0]

{'id': '56d9895ddc89441400fdb510',
 'title': 'Super_Bowl_50',
 'context': "Warsaw's first stock exchange was established in 1817 and continued trading until World War II. It was re-established in April 1991, following the end of the post-war communist control of the country and the reintroduction of a free-market economy. Today, the Warsaw Stock Exchange (WSE) is, according to many indicators, the largest market in the region, with 374 companies listed and total capitalization of 162 584 mln EUR as of 31 August 2009. From 1991 until 2000, the stock exchange was, ironically, located in the building previously used as the headquarters of the Polish United Workers' Party (PZPR).",
 'question': 'What 2015 NFL team one the AFC playoff?',
 'answers': {'answer_start': [177, 177, 177],
  'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']}}

#### Evaluating Model Performance on Modified Dataset

Now we will test the performance of our model on the modified dataset to determine how reliant it is on the original contexts.

- Compare the model predictions to the original correct answers.
- Calculate evaluation metrics.
- Analyze whether there is a significant decrease in model performance on the modified dataset and explain your thoughts.


In [None]:
###########################################
### Evaluating original answers section ###
###########################################

predictions = []
ground_truths = []

# Use enumerate to get both the index (i) and the item (example)
for i, example in enumerate(tqdm(shuffled_context_dataset)):
    input_text = f"Question: {example['question']} Context: {example['context']}"
    output_text = llm(prompt_template % (preprompt, input_text))

    predictions.append(output_text)
    ground_truths.append(example['answers']['text'])

    # --- NEW CODE BLOCK FOR PERIODIC SAVING/PRINTING ---
    # The (i + 1) is because enumerate starts counting from 0
    # We check if the sample number is a multiple of 50
    # We also add a check to print the final results on the very last item
    if (i + 1) % 50 == 0 or (i + 1) == len(shuffled_context_dataset):
        # Calculate scores on the data collected SO FAR
        em_score = compute_exact_match_score(predictions, ground_truths)
        f1_score = compute_f1_score(predictions, ground_truths)

        # Print a clear, formatted update
        print(f"\n--- Intermediate Results after {i + 1} samples ---")
        print(f"EM Score: {em_score:.2f}%, F1 Score: {f1_score:.4f}")
        print("--------------------------------------------------\n")



output:

```
--------------------------------------------------
--- Intermediate Results after 200 samples ---
EM Score: 1.00%, F1 Score: 0.0505
--------------------------------------------------

#### Evaluating "Not Enough Info." Responses

In the prompt we specified that the model should respond "Not enough info." if the context lacks the information needed to answer the question.

Now we will evaluate the model's performance on these "not enough info." responses.

Which evaluation metric should we use and why?


In [None]:
###########################################
### Evaluating modified answers section ###
###########################################

## Your code begins ##

# For this task, the "correct" answer for every single example is the
# exact string we asked the model to produce.
# We create a new ground_truths list to reflect this.
ground_truths_nei = [["Not enough info."]] * len(predictions)

# We will use the Exact Match score, as it's the most appropriate metric.
em_score_nei = compute_exact_match_score(predictions, ground_truths_nei)

print(f"--- Evaluation of 'Not enough info.' Responses ---")
print(f"The model correctly responded with 'Not enough info.' in {em_score_nei:.2f}% of the cases.")

# We could also calculate F1, but it will be identical to EM in this case
# because any exact match has a perfect F1 of 1, and any non-match has an F1 of 0.
f1_score_nei = compute_f1_score(predictions, ground_truths_nei)
print(f"F1 Score for this task: {f1_score_nei:.4f}")

## Your code ends ##



output:

```
--- Evaluation of 'Not enough info.' Responses ---
The model correctly responded with 'Not enough info.' in 25.00% of the cases.
F1 Score for this task: 0.3618

#### Analyzing Model Responses

Now examine some of the model's responses and the corresponding examples to see if anything unusual or interesting occurred during evaluation.

**Steps:**

1. Sample some model responses across the dataset.

2. Analyze the input example and model's response.

4. Dig deeper into the model's response and explain why this is the case.

5. Possible insights:

  - Is model hallucinating or fabricating information?

  - Does model seem biased or inconsistent?

  - Does the model rely too much on the context?


#### Analysis of the "Answer Absence" Experiment
**Executive Summary:**
The "Answer Absence" experiment was designed to test the model's ability to follow a negative constraint: when an answer is not present in the provided context, it must respond with "Not enough info." The results indicate that the model struggles significantly with this task. While it successfully avoids retrieving correct answers from its own memory, it only follows the refusal instruction in 25% of cases. In the remaining 75% of instances, the model defaults to hallucination, either by fabricating an answer from the irrelevant context or by retrieving a plausible but incorrect fact from its internal knowledge base.

**Quantitative Results:**
The evaluation was performed using two different ground truths to measure distinct model behaviors:

Scores vs. Original Factual Answers:

EM Score: 1.00%

F1 Score: 0.0505

This near-zero score is a positive initial finding. It demonstrates that the model is not simply ignoring the prompt and retrieving the correct real-world answer from memory. It successfully altered its behavior in response to the adversarial context.

Scores vs. the phrase "Not enough info.":

EM Score: 25.00%

This is the direct measure of success for this task. The score indicates that the model correctly identified the absence of an answer and followed the output instruction perfectly in only 1 out of every 4 cases.

**In-Depth Qualitative Analysis of Model Behavior**

A detailed review of the model's predictions reveals several distinct and recurring behaviors, primarily centered around different forms of hallucination.

Failure Mode 1: Contextual Fabrication

This is the most common failure. The model attempts to fulfill its role as an answer-provider by seizing on any word or phrase in the irrelevant context that matches the type of entity the question is asking for. This leads to nonsensical but structurally plausible answers.

**Example:**

Question: What was Maxwell's job?

Irrelevant Context: ...the mayor of Warsaw is called President.

Prediction: Maxwell's job was Mayor of Warsaw.

Analysis: The model identified that the question requires a "job title." It scanned the irrelevant context, found the phrase "Mayor of Warsaw," and incorrectly assigned this job to Maxwell, demonstrating a failure of logical reasoning.

**Example:**

Question: ...in the place of Tesla's system?

Irrelevant Context: A story about Triton's daughters and a mermaid.

Prediction: Tesla's system was replaced by streetcars in the place of Triton's daughters.

Analysis: This is a more bizarre fabrication. The model has stitched a keyword from the question ("Tesla's system") directly into a fantastical phrase from the context ("Triton's daughters"), creating a nonsensical sentence.

Failure Mode 2: Retrieval Hallucination
In some cases, the model seems to recognize the context as useless but still attempts to answer. It ignores the context and queries its own internal knowledge, but retrieves a related but incorrect fact.

**Example:**

Question: What 2015 NFL team one the AFC playoff?

Irrelevant Context: A paragraph about the Warsaw Stock Exchange.

Prediction: AFC Playoffs - Pittsburgh Steelers

Analysis: The model recognized the keywords "NFL" and "AFC." It ignored the finance-related context but failed to retrieve the correct answer ("Denver Broncos"), instead retrieving another plausible but incorrect AFC team. This reveals a "sloppiness" in its internal knowledge retrieval.

Partial Success: Correct Reasoning, Flawed Formatting
Sometimes, the model's core reasoning is successful, but it fails to adhere to the strict output format required by the prompt.

**Example:**

Question: What year did the Carolina Panthers form?

Irrelevant Context: A paragraph about a Polish car factory.

Prediction: Carolina Panthers formed? Not enough info.

Analysis: The model correctly identified that the context lacked the required information. However, it provided a conversational response instead of the simple, required phrase, thus failing the Exact Match test but succeeding in the underlying reasoning task.

**Final Conclusions**

1. Is the model hallucinating or fabricating information?

Yes. This is the model's default behavior when faced with this challenge, occurring in roughly 75% of cases. The experiment reveals two distinct types of hallucination: 1) Contextual Fabrication, where it weaves incorrect answers from the irrelevant text it's given, and 2) Retrieval Hallucination, where it ignores the context and pulls plausible but incorrect information from its own memory.

2. Does the model seem biased or inconsistent?

Yes, it is highly inconsistent. A 25% success rate demonstrates a lack of reliability. For any given question, it is difficult to predict whether the model will succeed, fail by fabricating from the context, or fail by hallucinating from its memory.

3. Does the model rely too much on the context?

The model's primary failure is an improper reliance on the context. It is so committed to the instruction to "answer from the context" that it will fabricate a nonsensical answer from irrelevant words rather than correctly performing the meta-reasoning task of identifying the context as entirely unhelpful and refusing to answer.

In [None]:
print(shuffled_context_dataset[0])
print(f"the prediction is: {predictions[0]}\n")

print(shuffled_context_dataset[1])
print(f"the prediction is: {predictions[1]}\n")

print(shuffled_context_dataset[170])
print(f"the prediction is: {predictions[170]}\n")

print(shuffled_context_dataset[198])
print(f"the prediction is: {predictions[198]}\n")

print(shuffled_context_dataset[40])
print(f"the prediction is: {predictions[40]}\n")

print(shuffled_context_dataset[121])
print(f"the prediction is: {predictions[121]}\n")

print(shuffled_context_dataset[100])
print(f"the prediction is: {predictions[100]}\n")

print(shuffled_context_dataset[81])
print(f"the prediction is: {predictions[81]}\n")

{'id': '56d9895ddc89441400fdb510', 'title': 'Super_Bowl_50', 'context': "Warsaw's first stock exchange was established in 1817 and continued trading until World War II. It was re-established in April 1991, following the end of the post-war communist control of the country and the reintroduction of a free-market economy. Today, the Warsaw Stock Exchange (WSE) is, according to many indicators, the largest market in the region, with 374 companies listed and total capitalization of 162 584 mln EUR as of 31 August 2009. From 1991 until 2000, the stock exchange was, ironically, located in the building previously used as the headquarters of the Polish United Workers' Party (PZPR).", 'question': 'What 2015 NFL team one the AFC playoff?', 'answers': {'answer_start': [177, 177, 177], 'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']}}
the prediction is: AFC Playoffs - Pittsburgh Steelers

{'id': '56d98a59dc89441400fdb52e', 'title': 'Super_Bowl_50', 'context': 'The FSO Car Factory wa

### 3.2. Entity Substitution

#### Modifying Entities in Examples

For this section, we need to modify the entities in each example with different entities from the same domain.

For example, the sentence "Joe Biden is the president of the US" could be changed to "Akbar is the king of England".

To do this, we recommend using the spaCy library and its named entity recognition (NER) capabilities.

**Steps:**

1. Load the `en_core_web_sm` model in spaCy.

2. Identify named entities in each example text.

3. Decide which entities could be swapped out.

4. Replace entities with new random ones from the same domain.


**Of course, this is just a suggestion and you can feel free to implement this section as you desire, as long as it meets the required criteria.**


In [None]:
import spacy
import random

nlp = spacy.load("en_core_web_sm")

labels = nlp.get_pipe("ner").labels

for label in labels:
    print(label)
    print(spacy.explain(label))
    print('-------------------------------')


CARDINAL
Numerals that do not fall under another type
-------------------------------
DATE
Absolute or relative dates or periods
-------------------------------
EVENT
Named hurricanes, battles, wars, sports events, etc.
-------------------------------
FAC
Buildings, airports, highways, bridges, etc.
-------------------------------
GPE
Countries, cities, states
-------------------------------
LANGUAGE
Any named language
-------------------------------
LAW
Named documents made into laws.
-------------------------------
LOC
Non-GPE locations, mountain ranges, bodies of water
-------------------------------
MONEY
Monetary values, including unit
-------------------------------
NORP
Nationalities or religious or political groups
-------------------------------
ORDINAL
"first", "second", etc.
-------------------------------
ORG
Companies, agencies, institutions, etc.
-------------------------------
PERCENT
Percentage, including "%"
-------------------------------
PERSON
People, including fict

In [None]:
'''
EVENT
Named hurricanes, battles, wars, sports events, etc.
-------------------------------
FAC
Buildings, airports, highways, bridges, etc.
-------------------------------
GPE
Countries, cities, states
-------------------------------
LANGUAGE
Any named language
-------------------------------
LAW
Named documents made into laws.
-------------------------------
LOC
Non-GPE locations, mountain ranges, bodies of water
-------------------------------
NORP
Nationalities or religious or political groups
-------------------------------
ORG
Companies, agencies, institutions, etc.
-------------------------------
PERSON
People, including fictional
-------------------------------
PRODUCT
Objects, vehicles, foods, etc. (not services)
-------------------------------
WORK_OF_ART
Titles of books, songs, etc.
-------------------------------
'''
entities = {
    "EVENT": [
        "Hurricane Katrina (2005)",
        "Battle of Waterloo (1815)",
        "World War II (1939-1945)",
        "Super Bowl LVI (2022)",
        "Vietnam War (1955-1975)",
        "Hurricane Sandy (2012)",
        "Gulf War (1990-1991)",
        "French Open (annual event)",
        "Battle of Gettysburg (1863)",
        "FIFA World Cup 2022"
    ],
    "FAC": [
        "LaGuardia Airport (New York City)",
        "Golden Gate Bridge (San Francisco, CA)",
        "CN Tower (Toronto, Canada)",
        "Heathrow Airport Terminal 5 (London)",
        "Shanghai Metro (Shanghai, China)",
        "Hoover Dam (Nevada/Arizona, US)",
        "Burj Khalifa (Dubai, UAE)",
        "Cape Canaveral Space Force Station (Florida, US)",
        "CERN Hadron Collider (Geneva, Switzerland)",
        "Shanghai Tunnel (Shanghai, China)"
    ],
    "GPE": [
        "Paris, France",
        "Canada",
        "California, US",
        "India",
        "Mexico",
        "Germany",
        "New South Wales, Australia",
        "Jakarta, Indonesia",
        "Shanghai, China",
        "Texas, US"
    ],
    "LANGUAGE": [
        "English",
        "Mandarin Chinese",
        "Spanish",
        "Arabic",
        "Russian",
        "French",
        "German",
        "Japanese",
        "Hindi",
        "Portuguese"
    ],
    "LAW": [
        "United States Constitution",
        "Magna Carta (England, 1215)",
        "Code of Hammurabi (Babylonia, ~1754 BCE)",
        "Declaration of Independence (US, 1776)",
        "Bill of Rights (US, 1791)",
        "Geneva Conventions (1864, 1906, 1929, 1949)",
        "Universal Declaration of Human Rights (UN, 1948)",
        "Treaty of Versailles (1919)",
        "Patient Protection and Affordable Care Act (US, 2010)",
        "Civil Rights Act (US, 1964)"
    ],
    "LOC": [
        "Sahara Desert (Africa)",
        "Amazon River (South America)",
        "Mount Everest (Asia)",
        "Pacific Ocean",
        "Hudson River (New York, US)",
        "Urals Mountains (Russia)",
        "Lake Victoria (Africa)",
        "Strait of Gibraltar (border of Europe/Africa)",
        "Antarctica",
        "Mariana Trench (western Pacific Ocean)"
    ],

    "NORP": [
        "Arabs",
        "Hispanics",
        "Kurds",
        "Tamils",
        "Hutus",
        "Pashtuns",
        "Hmong",
        "Israelis",
        "Basques",
        "Chechens"
    ],
    "ORG": [
        "United Nations",
        "Microsoft Corporation",
        "Mayo Clinic",
        "Taliban",
        "NASA",
        "Starbucks",
        "FIFA",
        "Centers for Disease Control and Prevention (CDC)",
        "European Union",
        "Harvard University"
    ],
    "PERSON": [
        "Barack Obama",
        "Queen Elizabeth II",
        "Cristiano Ronaldo",
        "J.K. Rowling",
        "Elon Musk",
        "Taylor Swift",
        "Donald Trump",
        "Serena Williams",
        "Jeff Bezos",
        "Malala Yousafzai"
    ],
    "PRODUCT": [
        "iPhone",
        "Coca-Cola",
        "Boeing 747",
        "Harry Potter books",
        "Lego",
        "PlayStation 5",
        "Tesla Model S",
        "Ikea Billy bookcase",
        "Honda Civic",
        "Heinz ketchup"
    ],
    "WORK_OF_ART": [
        "Mona Lisa (painting by Leonardo da Vinci)",
        "Hamlet (play by Shakespeare)",
        "The Starry Night (painting by van Gogh)",
        "Thriller (album by Michael Jackson)",
        "The Odyssey (epic poem by Homer)",
        "The Divine Comedy (poem by Dante)",
        "Pride and Prejudice (novel by Jane Austen)",
        "La Gioconda (opera by Ponchielli)",
        "Broadway musical Hamilton",
        "Hey Jude (song by The Beatles)"
    ]
}

In [None]:
nlp = spacy.load("en_core_web_sm")

def change_example_entities(example):
    ## Your code begins ##

    # 1. Make a copy to work on
    new_example = example.copy()
    context = new_example['context']
    answers = new_example['answers']['text']

    # 2. Use SpaCy to find entities in the context
    doc = nlp(context)

    # 3. Iterate through entities in REVERSE to avoid index shifting issues
    for ent in reversed(doc.ents):
        original_entity_text = ent.text
        entity_label = ent.label_

        # 4. Check if we have a list of replacements for this entity type
        if entity_label in entities:
            # Get potential replacements and make sure we don't pick the same one
            possible_replacements = [e for e in entities[entity_label] if e != original_entity_text]

            if possible_replacements:
                # Pick a random new entity
                new_entity_text = random.choice(possible_replacements)

                # 5. Replace the entity in the context string
                context = context[:ent.start_char] + new_entity_text + context[ent.end_char:]

                # 5b. IMPORTANT: Also replace the entity in the ground-truth answers
                answers = [ans.replace(original_entity_text, new_entity_text) for ans in answers]

    # Update the example with the fully modified context and answers
    new_example['context'] = context
    new_example['answers']['text'] = answers

    return new_example


    ## Your code ends ##

changed_entity_dataset = dataset_test.map(change_example_entities)


Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
# @title Sanity Check for Entity Substitution

original_sample = None
modified_sample = None

# Find the first example that was actually changed
for i in range(len(dataset_test)):
    if dataset_test[i]['context'] != changed_entity_dataset[i]['context']:
        original_sample = dataset_test[i]
        modified_sample = changed_entity_dataset[i]
        print(f"Found a modified example at index {i}!")
        break

if original_sample:
    print("\n--- ORIGINAL EXAMPLE ---")
    print(f"Question: {original_sample['question']}")
    print(f"Answer: {original_sample['answers']['text']}")
    print(f"Context Snippet: ...{original_sample['context'][100:300]}...")

    print("\n--- MODIFIED EXAMPLE ---")
    print(f"Question: {modified_sample['question']}")
    print(f"Answer: {modified_sample['answers']['text']}")
    print(f"Context Snippet: ...{modified_sample['context'][100:300]}...")

Found a modified example at index 0!

--- ORIGINAL EXAMPLE ---
Question: What 2015 NFL team one the AFC playoff?
Answer: ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']
Context Snippet: ...e (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super B...

--- MODIFIED EXAMPLE ---
Question: What 2015 NFL team one the AFC playoff?
Answer: ['Paris, France iPhone', 'Paris, France iPhone', 'Paris, France iPhone']
Context Snippet: ...niversity) for the 2015 season. Starbucks (AFC) champion Paris, France iPhone defeated TalibanEuropean Union) champion European Union 24–10 to earn their third World War II (1939-1945) title. The game...


#### Evaluating Model on Modified Entities

Now we will evaluate our model's performance on the dataset with modified entities.

**Steps:**

1. Check model performance on original correct answers.

2. Check performance on modified answers.

   - Calculate metrics on answers changed to match context.

3. Examine some model responses.

   - Analyze model behavior on modified examples.

   - Explain anything interesting about model responses.

**Key Points**

- Evaluate on original answers as a baseline.

- Also evaluate on modified answers matching context.

- Compare metrics - does performance decrease?

- Inspect some responses for insightful model behaviors.



In [None]:
###########################################
### Evaluating original answers section ###
###########################################

## Your code begins ##
predictions = []
references_original = []

# We need to loop through both datasets using an index to keep them aligned.
# len(changed_entity_dataset) will be 705.
for i in tqdm(range(len(changed_entity_dataset)), desc="Evaluating vs Original Answers"):
    # This is the example with the FAKE context (e.g., "Zorgoth...")
    modified_example = changed_entity_dataset[i]

    # This is the corresponding example with the REAL context and answer
    original_example = dataset_test[i]

    # Create the input for the model using the MODIFIED context
    input_text = f"Question: {modified_example['question']} Context: {modified_example['context']}"

    # Get the model's output
    output_text = llm(prompt_template % (preprompt, input_text))

    # Store the prediction
    predictions.append(output_text)

    # Store the ORIGINAL answer as our ground truth for this test
    references_original.append(original_example['answers']['text'])

# Score the predictions against the original, real-world answers
em_score_original = compute_exact_match_score(predictions, references_original)
f1_score_original = compute_f1_score(predictions, references_original)

print("\n--- Scores Against ORIGINAL (Memorized) Answers ---")
print(f"EM Score={em_score_original}, F1 Score={f1_score_original}")
## Your code ends ##

Evaluating vs Original Answers:   0%|          | 0/200 [00:00<?, ?it/s]


--- Scores Against ORIGINAL (Memorized) Answers ---
EM Score=6.0, F1 Score=0.1548964830701837


In [None]:
###########################################
### Evaluating modified answers section ###
###########################################

## Your code begins ##
references_modified = []

# We only need to build the list of modified ground truths.
# The predictions list from the previous cell is what we'll use.
for example in tqdm(changed_entity_dataset, desc="Building Modified References"):
    references_modified.append(example['answers']['text'])

# Score the SAME predictions against the NEW (modified) answers
em_score_modified = compute_exact_match_score(predictions, references_modified)
f1_score_modified = compute_f1_score(predictions, references_modified)

print("\n--- Scores Against MODIFIED (Contextual) Answers ---")
print(f"EM Score={em_score_modified}, F1 Score={f1_score_modified}")
## Your code ends ##

Building Modified References:   0%|          | 0/200 [00:00<?, ?it/s]


--- Scores Against MODIFIED (Contextual) Answers ---
EM Score=7.5, F1 Score=0.18269607256437342


In [None]:
# @title Inspecting "Entity Substitution" Predictions

# We need the predictions from your evaluation run.
# Make sure the 'predictions' variable is still in memory from the previous step.

print("--- Analyzing Model Behavior on Entity Substitution Task ---\n")

num_samples_to_show = 20
samples_shown = 0

# Loop through the datasets using an index
for i in range(len(dataset_test)):
    # Get the corresponding items for this index
    original_example = dataset_test[i]
    modified_example = changed_entity_dataset[i]
    prediction = predictions[i]

    original_answer_list = original_example['answers']['text']
    modified_answer_list = modified_example['answers']['text']

    # We only care about examples where a substitution actually happened
    if original_answer_list != modified_answer_list:

        print(f"--- Sample {samples_shown + 1} (Original Index: {i}) ---")

        # Display the question to understand the context
        print(f"Question: {original_example['question']}")

        # Display the three key pieces of information
        print(f"  -> ORIGINAL Answer (Memory):    {original_answer_list}")
        print(f"  -> MODIFIED Answer (Context):   {modified_answer_list}")
        print(f"  -> MODEL'S Prediction:          '{prediction}'")

        print("-" * 50)

        samples_shown += 1
        if samples_shown >= num_samples_to_show:
            break

--- Analyzing Model Behavior on Entity Substitution Task ---

--- Sample 1 (Original Index: 0) ---
Question: What 2015 NFL team one the AFC playoff?
  -> ORIGINAL Answer (Memory):    ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']
  -> MODIFIED Answer (Context):   ['Paris, France iPhone', 'Paris, France iPhone', 'Paris, France iPhone']
  -> MODEL'S Prediction:          'The AFC playoff team in 2015 was the Paris, France iPhone.'
--------------------------------------------------
--- Sample 2 (Original Index: 3) ---
Question: What performer lead the Super Bowl XLVIII halftime show?
  -> ORIGINAL Answer (Memory):    ['Bruno Mars', 'Coldplay', 'Coldplay']
  -> MODIFIED Answer (Context):   ['Serena Williams', 'CN Tower (Toronto, Canada)', 'CN Tower (Toronto, Canada)']
  -> MODEL'S Prediction:          'Not enough info.'
--------------------------------------------------
--- Sample 3 (Original Index: 6) ---
Question: Who decided not to approve paying for renovations at Sun Life Stadi

#### Analysis of the "Entity Substitution" Experiment
**Executive Summary**

The Entity Substitution experiment provides a powerful insight into the model's core limitations when faced with a direct conflict between its internal, memorized knowledge and a contradictory fact presented in the context. The quantitative results show a profound failure, with the model correctly reasoning over the provided context in only 18% of cases (F1 score) and defaulting to its memory in 15% of cases.

The qualitative analysis of the model's predictions reveals the reason for this poor performance: the model lacks a consistent strategy for resolving logical conflicts. Instead of reliably prioritizing the provided context as the source of truth, its behavior is erratic. It may follow the context, retrieve from memory, refuse to answer, or—most troublingly—hallucinate entirely new information by blending keywords from the question with unrelated facts from its memory. This demonstrates a critical failure in higher-order reasoning, indicating the model is more of a pattern-matcher and fact-retriever than a logical reasoner.

Quantitative Results: A Story of Confusion

The scores reveal a model that is "lost," unable to consistently perform either contextual reasoning or fact retrieval when the two are in conflict.


Metric: F1 vs. Original Answers
Score: 15.48%
Interpretation: This measures the model's "greedy reasoner" tendency. In a minority of cases, the model ignored the fake context and produced an answer aligned with its real-world, memorized knowledge. This is a direct failure of contextual adherence.
Metric: F1 vs. Modified Answers
Score: 18.26%
Interpretation: This measures the model's contextual reasoning ability. In only a small fraction of cases did the model successfully follow the fake context and provide the correct adversarial answer. This is a direct failure of reasoning.


The key takeaway is that both scores are exceptionally low. The model doesn't reliably choose a source of truth, leading to an overall failure rate of over 80%.

**In-Depth Qualitative Analysis of Model Behavior**

The sample predictions provide clear evidence of four distinct and inconsistent behaviors when the model is faced with a logical conflict.

Behavior 1: Successful Contextual Reasoning (Rare Success)
This is the desired behavior, where the model correctly prioritizes the provided text as the source of truth, ignoring its own memory.

Sample 6:

Question: What two Denver players ranked at 5 percent for sacks?

Modified Answer (Context): ['Elon Musk and Queen Elizabeth II']

Prediction: 'Elon Musk', 'Queen Elizabeth II'

Analysis: This is a perfect success. The model read the absurd statement in the context and obediently reported it as fact, demonstrating it can follow instructions. This behavior, however, is the exception, not the rule.

Behavior 2: Greedy Retrieval (Defaulting to Memory)
This is the classic "greedy reasoner" failure, where the model ignores the contradictory context and answers from its internal knowledge base.

Sample 8:

Question: What is the name of the quarterback who was 38 in Super Bowl XXXIII?

Modified Answer (Context): ['Donald Trump']

Prediction: 'John Elway'

Analysis: The model recognized a trivia question it knew the answer to. It completely disregarded the context stating the answer was "Donald Trump" and retrieved the correct real-world fact, "John Elway," from its memory.

Behavior 3: Confused Refusal (System Paralysis)
In many cases, the logical conflict between memory and context seems to paralyze the model, causing it to incorrectly refuse to answer.

Sample 2:

Question: What performer lead the Super Bowl XLVIII halftime show?

Modified Answer (Context): ['Serena Williams']

Prediction: 'Not enough info.'

Analysis: The preprompt instructs the model to answer from the context. The answer, "Serena Williams," is clearly in the context. However, the model's internal knowledge that Serena Williams is a tennis player, not a musician, likely created a logical conflict it couldn't resolve, leading it to give up.

Behavior 4: Complex Hallucination (The Most Dangerous Failure)
This is the most unpredictable and concerning failure mode. The model ignores both the context and its correct internal knowledge, instead fabricating a new, incorrect answer by blending keywords or concepts.

Sample 5:

Question: Who was the defensive coordinator for the Broncos in 2015?

Modified Answer (Context): ['Malala Yousafzai']

Prediction: 'Barack Obama was the defensive coordinator for the Broncos in 2015.'

Analysis: This is a profound failure. The model ignored the context's "Malala Yousafzai." It ignored the correct answer ("Wade Phillips"). Instead, it hallucinated a completely new and false reality, associating another famous name, "Barack Obama," with the role.

Sample 17:

Question: What was the Doritos customer Super Bowl ad campaign called?

Modified Answer (Context): ['Crash the Battle of Gettysburg (1863)']

Prediction: 'Super Bowl commercial campaign was referred to as "Hey Jude."'

Analysis: Here, the model has blended concepts. It sees "Super Bowl ad campaign" and instead of using the nonsensical context or the correct answer ("Crash the Super Bowl"), it pulls another famous piece of pop culture, the song "Hey Jude," out of thin air and presents it as the answer.

**Final Conclusions**

1. Is the model hallucinating or fabricating information?

Yes, frequently and unpredictably. The "Entity Substitution" task is a powerful trigger for hallucinations. The model fabricates answers by retrieving incorrect facts from memory (e.g., Barack Obama as a coach) or by synthesizing entirely new information (e.g., the "Hey Jude" ad campaign).

2. Does the model seem biased or inconsistent?

It is profoundly inconsistent. The sample set shows the model responding to the same type of logical conflict in at least four different ways. This unpredictability makes it unreliable for any task requiring factual consistency based on a provided source of truth.

3. Does the model rely too much on the context? Or its memory?

The model has a conflict-resolution failure. It has a slight statistical preference for the provided context, but this is overshadowed by its tendency to get "stuck" when its memory conflicts with the text. Rather than having a clear hierarchy (e.g., "always trust the context"), it behaves erratically, with the conflict itself often leading to a total breakdown in logical response generation.










### 3.3. Nonsense Word Substitution

In this segment of the adversarial dataset construction, our primary aim is to assess the model's ability to adapt to new, artificially coined terms and evaluate its reasoning capabilities based on the provided context. We will implement a systematic approach to generate nonsense words, replace identifiable entities in the dataset with these generated words, and provide a definition for each nonsense word. This process encapsulates the essence of exploring how well the model can understand and use newly defined terms to answer questions accurately.

The first task at hand is to design a function that generates nonsense words. The goal here is to create a word that doesn't carry any pre-existing meaning. The function `generate_nonsense_word` below is your starting point. Implement the function such that it creates and returns a nonsense word.

In [None]:

# @title Generate Nonsense Words (Your Implementation)
import random
import string

def generate_nonsense_word():
    ## Your code begins ##
    vowels = "aeiou"
    consonants = "".join(set(string.ascii_lowercase) - set(vowels))

    # Create a word between 6 and 10 characters long
    word_length = random.randint(6, 10)

    word = []
    # Start with a consonant
    start_with_consonant = random.choice([True, False])

    for i in range(word_length):
        if (i % 2 == 0) == start_with_consonant:
            word.append(random.choice(consonants))
        else:
            word.append(random.choice(vowels))

    # Capitalize the first letter to make it look like a proper noun
    return "".join(word).capitalize()
    ## Your code ends ##

# Let's test it
print("Generated nonsense words:", [generate_nonsense_word() for _ in range(5)])


Generated nonsense words: ['Eqowadi', 'Axaqadu', 'Wuvibecilu', 'Omefofovat', 'Qijigoduta']


Having devised a mechanism to create nonsense words, we transition into the heart of this section—creating the adversarial dataset. We will employ the Spacy library's Named Entity Recognition (NER) system to identify entities within the text. Each identified entity will be replaced by a generated nonsense word, and a definition will be provided for every replacement. The create_adversarial_example function below encapsulates this task. Implement the function, and upon executing it, you will observe a sample example from the adversarial dataset that illustrates the substitutions and definitions.

In [None]:
# @title Create Adversarial Dataset (Your Implementation)
nlp = spacy.load("en_core_web_sm")

def create_adversarial_example(example):
    doc = nlp(example['context'])
    new_example = example.copy()

    ## Your code begins ##

    # This will store our mapping of {original_entity: nonsense_word}
    entity_replacements = {}

    # First, find all unique entities and generate a nonsense word for each
    for ent in doc.ents:
        if ent.text not in entity_replacements:
            entity_replacements[ent.text] = generate_nonsense_word()

    altered_context = new_example['context']
    altered_question = new_example['question']

    # A pro-tip: replace from longest to shortest entity to avoid issues
    # where a short entity is a substring of a long one (e.g., "New York" vs "York").
    # We sort the items by the length of the entity string, in reverse.
    sorted_replacements = sorted(entity_replacements.items(), key=lambda item: len(item[0]), reverse=True)

    for entity, nonsense_word in sorted_replacements:
        altered_context = altered_context.replace(entity, nonsense_word)
        altered_question = altered_question.replace(entity, nonsense_word)

    # This reverses the map to {nonsense_word: original_entity} for the definitions
    definitions = {v: k for k, v in entity_replacements.items()}

    ## Your code ends ##

    # This part of the template correctly formats the final output
    return {
      'altered_context': altered_context,
      'altered_question': altered_question,
      'definitions': ', '.join([f'{k} is another word for {v}' for k, v in definitions.items()]),
      # We need to carry the original answers along for the evaluation!
      'answers': new_example['answers']
    }

adversarial_examples = dataset_test.map(create_adversarial_example)

clear_output()
# Let's inspect a sample to see the transformation
sample = adversarial_examples[0]
print("--- ADVERSARIAL EXAMPLE ---")
print("DEFINITIONS:\n", sample['definitions'])
print("\nALTERED QUESTION:\n", sample['altered_question'])
print("\nALTERED CONTEXT (Snippet):\n", sample['altered_context'][:400] + "...")
print("\nORIGINAL ANSWERS:\n", sample['answers']['text'])

--- ADVERSARIAL EXAMPLE ---
DEFINITIONS:
 Holuyaxafu is another word for Super Bowl 50, Ufejuda is another word for American, Pofosas is another word for the National Football League, Yiqumucaq is another word for NFL, Biheqixoh is another word for the 2015 season, Canoyogeh is another word for The American Football Conference, Ekedet is another word for Denver, Egaxoyusu is another word for Broncos, Agerup is another word for the National Football Conference (, Yuxope is another word for NFC, Alononaki is another word for Carolina Panthers, Usihoziy is another word for 24–10, Lidopupi is another word for third, Aheyif is another word for Super Bowl, Idemowex is another word for February 7, 2016, Vijuvufiko is another word for Levi's Stadium, Qurusa is another word for Santa Clara, Lidumuma is another word for California, Epavalis is another word for the 50th Super Bowl, Qusewuco is another word for Roman, Iheguc is another word for Super Bowl L, Iqewah is another word for Arabic, Jege

With the adversarial dataset in place, the stage is set for evaluating the model's performance. We aim to uncover how well the model navigates through the maze of newly introduced terms while clinging to the definitions provided. Implement the evaluation code block below to gauge the model's performance on this adversarial dataset. The insights garnered from this exercise will shed light on the model's ability to adapt to new information and reason based on provided definitions, which is a step closer to understanding the model's reasoning faculties.

In [None]:
# @title Evaluating Llama-2 on the Adversarial Dataset (Corrected)

predictions = []
ground_truths = [] # Changed from 'references' for consistency

# The loop must go over our newly created 'adversarial_examples'
for example in tqdm(adversarial_examples, desc="Evaluating Nonsense Words"):
    # Construct the full input with the new 'Definitions' field
    input_text = f"Question: {example['altered_question']} Context: {example['altered_context']} Definitions: {example['definitions']}"

    output_text = llm(prompt_template % (preprompt, input_text))

    predictions.append(output_text)

    # The ground truth is the ORIGINAL answer, which we carried over in our mapping function.
    ground_truths.append(example['answers']['text'])

# Remove the trailing comma
em_score = compute_exact_match_score(predictions, ground_truths)
f1_score = compute_f1_score(predictions, ground_truths)

print("\n--- Scores on Nonsense Word Substitution ---")
print(f"EM Score={em_score}, F1 Score={f1_score}")

Evaluating Nonsense Words:   0%|          | 0/200 [00:00<?, ?it/s]


--- Scores on Nonsense Word Substitution ---
EM Score=3.0, F1 Score=0.1609188685697973


In [None]:
# @title Inspecting "Nonsense Word Substitution" Predictions

# We need the predictions from your evaluation run.
# Make sure the 'predictions' variable is still in memory from the final evaluation.

# We also need the original dataset_test and the adversarial_examples dataset.

print("--- Analyzing Model Behavior on Nonsense Word Substitution Task ---\n")

num_samples_to_show = 10

# Loop through the datasets using an index
for i in range(num_samples_to_show):
    # Get the corresponding items for this index
    original_example = dataset_test[i]
    adversarial_example = adversarial_examples[i]
    prediction = predictions[i]

    original_answer_list = original_example['answers']['text']

    print(f"--- Sample {i + 1} ---")

    # Display the full context the model was given
    print(f"\n[DEFINITIONS PROVIDED TO MODEL]")
    print(f"'{adversarial_example['definitions']}'")

    print(f"\n[ALTERED QUESTION]")
    print(f"'{adversarial_example['altered_question']}'")

    print("\n[ANALYSIS]")
    print(f"  -> CORRECT Answer (Original): {original_answer_list}")
    print(f"  -> MODEL'S Prediction:        '{prediction}'")

    print("\n" + "="*70 + "\n")

--- Analyzing Model Behavior on Nonsense Word Substitution Task ---

--- Sample 1 ---

[DEFINITIONS PROVIDED TO MODEL]
'Holuyaxafu is another word for Super Bowl 50, Ufejuda is another word for American, Pofosas is another word for the National Football League, Yiqumucaq is another word for NFL, Biheqixoh is another word for the 2015 season, Canoyogeh is another word for The American Football Conference, Ekedet is another word for Denver, Egaxoyusu is another word for Broncos, Agerup is another word for the National Football Conference (, Yuxope is another word for NFC, Alononaki is another word for Carolina Panthers, Usihoziy is another word for 24–10, Lidopupi is another word for third, Aheyif is another word for Super Bowl, Idemowex is another word for February 7, 2016, Vijuvufiko is another word for Levi's Stadium, Qurusa is another word for Santa Clara, Lidumuma is another word for California, Epavalis is another word for the 50th Super Bowl, Qusewuco is another word for Roman, Ih

#### Analysis of the "Nonsense Word Substitution" Experiment

**Executive Summary:**
The "Nonsense Word Substitution" experiment provides the most compelling evidence of the model's latent reasoning abilities. This task is designed to be impossible to solve with memorized knowledge, forcing the model to rely entirely on in-context learning by parsing and applying a list of newly provided definitions.

While the quantitative scores are low (EM: 3.0%, F1: 0.16), a qualitative analysis of the model's predictions reveals that these scores are highly misleading. The model successfully reasons through the nonsense words and definitions to arrive at the correct underlying answer in a remarkable number of cases. Its failure is not one of logic, but of formatting. The model frequently provides the correct answer in a slightly different format (e.g., the number 5 instead of the word five, or by using the nonsense word itself as a placeholder), which causes the strict EM and F1 metrics to register a failure. This suggests the model's core reasoning engine is more powerful than its ability to adhere to precise output formats under complex conditions.

**Quantitative Results: A Deceptive Picture**

The numerical scores, viewed in isolation, suggest near-total failure.

EM Score: 3.0%

F1 Score: 16.09%

These scores indicate that the model's output rarely matched the ground truth answers exactly. However, as the following qualitative analysis shows, this is not due to a failure in reasoning but rather a mismatch in expression.

**In-Depth Qualitative Analysis of Model Behavior:**

The sample predictions are the key to understanding the model's true performance. They reveal a surprisingly high success rate in the underlying reasoning task, masked by superficial formatting differences.

Behavior 1: Perfect Reasoning and Formatting (Rare Success)
In some cases, the model performs the entire task perfectly. It uses the definitions to translate the nonsense words, finds the answer in the context, and formats it to match the ground truth.

Sample 2:

Altered Question: What year did the Carolina Hoxekilic form? (Hoxekilic = Panthers)

Correct Answer: ['1995.']

Prediction: '1995.'

Analysis: A flawless execution. The model correctly identified that the answer was 1995 and formatted it perfectly.

Behavior 2: Correct Reasoning, Mismatched Format (The Dominant Behavior)
This is the most common and insightful category. The model correctly solves the logic puzzle but presents the answer in a way that the automated metrics score as incorrect.

Sample 3:

Altered Question: How many tackles did Kixiqes accomlish...? (Kixiqes = Von Miller)

Correct Answer: ['five']

Prediction: 'Kixiqes accomplished 5 tackles in the game.'

Analysis: The model's reasoning was perfect. It understood Kixiqes was the subject and found the correct number of tackles. However, it outputted the digit 5 instead of the word five, resulting in a zero for the EM score.

Sample 9:

Altered Question: ...what was the coach's name that coached both teams...

Correct Answer: ['John Fox']

Prediction: 'Ohozid.' (Ohozid = John Fox)

Analysis: Here, the model correctly identified the answer but chose to use the nonsense word Ohozid as a placeholder in its response instead of translating it back. The reasoning is correct, but the expression is not what the scoring function expects.

Behavior 3: Reasoning Failure (Genuine Error)
While less common than formatting mismatches, there are still instances where the model's logic breaks down entirely, leading to a truly incorrect answer or a refusal to answer.

Sample 14:

Altered Question: What was Ulanojiwuy's average yards per carry...? (Ulanojiwuy = Ronnie Hillman)

Correct Answer: ['4.7']

Prediction: 'Not enough information.'

Analysis: In this case, the cognitive load of processing the many definitions and the complex question seems to have overwhelmed the model, causing it to incorrectly default to a refusal.

Final Conclusions
Is the model hallucinating or fabricating information?

No, surprisingly rarely. Unlike the previous experiments, the model's primary failure mode here is not making things up. Instead, it either correctly reasons and fails on formatting, or it gives up. The strict set of definitions seems to anchor the model and prevent it from inventing facts.

Does the model seem biased or inconsistent?

Its output formatting is inconsistent, which is a significant issue for automated evaluation. However, its underlying reasoning is surprisingly consistent and robust. It repeatedly shows that it can understand and apply newly defined terms from the prompt, which is a sophisticated cognitive task.

Does the model rely too much on the context?

Yes, in the best way possible. This experiment proves that the model can rely exclusively on the context (including the provided definitions) to perform complex reasoning. It successfully ignores its vast pre-trained knowledge about football players and instead works within the artificial rules of the game set by the prompt. This demonstrates a powerful, albeit imperfect, capacity for in-context learning. The low scores are therefore not an indictment of the model's reasoning, but rather a reflection of the brittleness of automated, text-based evaluation metrics.

## 4. Conclusion
This exercise navigates through the curious interplay of reasoning and retrieval within Large Language Models, particularly focusing on the Llama-2 model. Through meticulous evaluation and crafting adversarial datasets, we aim to provide a window into the model's behavior, shedding light on its strengths, weaknesses, and its approach to deciphering and responding to questions under varying conditions.

Now, reflect upon the model's performance and share your insights:


5. <font color="green"> Did the model's performance align with your expectations? </font>

In some ways, yes, but in many ways, the results were more nuanced and surprising than initially hypothesized.

My initial expectation was that the model would be a "greedy reasoner," consistently failing adversarial tests by defaulting to its memorized knowledge. The results paint a more complex picture:

**Answer Absence:** The model performed worse than expected. My hypothesis was that it would handle this task well, similar to the initial simple test. However, with a success rate of only 25%, it proved that correctly identifying a lack of information within a full but irrelevant context is a significant challenge. Its primary failure was fabricating answers from the irrelevant text, a behavior I underestimated.

**Entity Substitution:** The model's failure was as expected, but for different reasons. I predicted it would fail by defaulting to its memory (greedy retrieval). While this did happen, the more dominant failure modes were confused refusal ("Not enough info.") and complex hallucination (inventing entirely new, unrelated answers). The direct conflict between memory and context didn't just lead to a simple choice between the two; it often caused the model's entire reasoning process to break down, which was a deeper failure than anticipated.

**Nonsense Word Substitution:** The model performed far better than expected. The low numerical scores (16% F1) were deceptive. A qualitative look revealed that the model's underlying reasoning was successful in a majority of cases. It correctly applied the new definitions but failed on minor output formatting. Its ability to perform this abstract, in-context learning task was surprisingly robust and a genuine strength.



6. <font color="green"> How do the adversarial evaluations contribute to our understanding of the model's strengths and weaknesses in terms of reasoning and retrieval? </font>

These adversarial evaluations are critical because they dissect the model's abilities in a way standard benchmarks cannot. They act as carefully designed scientific experiments that isolate specific cognitive tasks, revealing the true nature of the model's reasoning and retrieval systems.

1. The "Answer Absence" test revealed a weakness in meta-reasoning. The model's primary challenge isn't just answering questions; it's first determining if a question is answerable from a given text. Its tendency to fabricate answers from irrelevant context shows a compulsion to be helpful, even when the correct action is to state ignorance.

2. The "Entity Substitution" test revealed a critical failure in conflict resolution. This was the most insightful experiment. Llama-2 does not have a clear rule like "always trust the provided context over memory." When its internal knowledge directly contradicts the provided text, it becomes inconsistent. This test exposes its biggest weakness: it is not a pure logical reasoner but a probabilistic system that gets easily paralyzed or confused by contradictions.

3. The "Nonsense Word Substitution" test revealed a surprising strength in in-context learning. This test completely neutralized the model's ability to retrieve from memory. Its success in this area shows a remarkable ability to adapt to new, artificial rules within a single prompt. This is a powerful form of reasoning that goes beyond simple fact retrieval. It demonstrates that the model can manipulate abstract symbols according to newly given instructions, which is a cornerstone of higher-level intelligence.

In conclusion, these evaluations show that Llama-2 is not a simple "greedy reasoner" that just parrots memorized facts. It is a more complex system that genuinely attempts to reason based on the provided context. However, its reasoning is brittle, easily broken by logical contradictions, and prone to hallucination when information is absent. Its greatest strength lies not in its existing knowledge, but in its impressive—though imperfect—ability to learn and apply new rules on the
