Sheet 7.1: Behavioral assessment & Evaluation
======
**Author**: Polina Tsvilodub

This sheet focuses on evaluating the input-output (I/O) behavior of LLMs. Inspired by experimental paradigms / the terminology in cognitive science and psychology which investigate a blackbox (the human mind) via looking at the behavior across different interesting conditions (inputs), such assessment of LLMs (also blackboxes) can be called "behavioral assessment". This approach can be seen as one piece that should work in combination with attribution methods discussed in the previous sheet in order to provide fuller understanding of what LLMs can or cannot do (I/O testing) and how they do it (attributions). 
Following the structure of the lecture, we will first look at practical aspects of benchmark testing, and then look at "machine psychology", which often draws on the same methods but addresses somewhat different research questions.

Therefore, the learning goals of this sheet are:
* look at examples of a few different benchmarks and how they are usually constructed
* become familiar with standard evaluation metrics and methods used for evaluating LLMs on benchmarks (these include PPL, log probability based scores, accuracy, F1, free generation etc)
* become familiar with different more advanced eval measures HELM  
* look at examples of machine psychology and how, in practice, LLM performance can be compared to various human data
* how LLMs can be tested as "psychology subjects".

## Benchmark testing

Such I/O evaluations are the most common approach to LLM evaluation. Taking a more technical / engineering-oriented perspective which aims at building LLMs for specific application, it is very common to make use of large benchmark datasets which are designed to test models’ performance on a variety of tasks in an automated way. This is often done by checking the models’ outputs against ground truth answers or by computing standard scores for certain datasets. Therefore, quality of LLMs is measured by their scores on these benchmarks. 

Initially, these benchmarks were designed to test LLMs’ linguistic performance since the goal of building the model is a system that predict grammatical and fluent natural language. Therefore, some first benchmarks (or, commonly used textual datasets) are, for instance, Wikipedia texts, the Penn Treebank, and the GLUE benchmark. Wikipedia texts are often used for measuring the perplexity of the model on this standard text (see below for details). The Penn Treebank was often used for fine-tuning or evaluating models, e.g., on part-of-speech tagging as an approximation of syntactic performance, while the GLUE benchmark contains tasks which are supposed to approximate (semantic) natural language understanding in the form of paraphrase tasks, sentiment classification, natural language inference tasks etc.

Now recent LLMs have shown perhaps unexpectedly impressive generalization to tasks which seem to require more than linguistic fluency, like solving math and reasoning problems. Therefore, more recent benchmarks incorporate tests of various tasks going beyond linguistic capability. Two of the most widely used benchmarks include the [MMLU](https://arxiv.org/abs/2009.03300) and the [BIG-Bench](https://arxiv.org/abs/2206.04615) datasets.
Given that SOTA LLMs are also often designed as assisstants and embedded in user-facing applications, it also became crucial to evaluate potenital social impacts that LLMs might exhibit with their outputs, like assessing biases and toxicity of the generations. To this end, specialized benchmarks like [RealToxicityPrompts](https://arxiv.org/abs/2009.11462) or [WinoGender](https://arxiv.org/abs/1804.09301) were created.

One crucial assumption behind benchmark evaluation is that benchmarks are representative of tasks and covers a wide variety of data that the model should perform well on in order to count as a good model for its target deployment. Although benchmarks arguably provide a wide coverage (they commonly contain thousands of inputs and answers), they often test only an approximation of what the model does in deployment (i.e., free text generation). 
Furthermore, with newer models trained on newer crawls of the internet, there  are increasing worries of so-called *contamination*, i.e., actually including the test datasets in the training data of the models, thereby potentially inflating the models' true generalization scores. For instance, Wikipedia is included in the training data of most of the modern models. 

Scalably evaluating longer generated texts is quite a difficult task. This is because, intuitively, there is no single "ground truth answer" when it comes to writing; there are many equally good ways of writing summary of a text, or even potentially multiple ways of translating a sentence. This makes text evaluation difficult to evaluate automatically. This is still a largely unsolved issue (!), so that human or machine evaluation is often used. The available methods for automated text scoring are rooted in work on summarization and machine translation, and require (human-written) gold-standard texts. 

Note that when mentioning a *model* in the explanations, we refer to trained models which are evaluated with respect to their performance, i.e., in *inference mode*. If one wanted to track the performance on certain benchmarks during training, one could also run evaluations on intermediate model checkpoints during training, too. Just note that the model is "frozen" and runs in inference mode during all of the testing described in this sheet.

In sum, the reasons why benchmarks are so widely used are a few core advantages: 
* the availability if a few well-known datasets leads to (somewhat of a) standardization of the evaluation procedure across different work. 
* their large scale often provides high coverage, more reliable results (although coverage might not always mean consistent quality or variability expected, e.g., by linguists). 
* **crucially**: they are design to be evaluated with easy to compute *automatic evaluation metrics*. You have heard about them in the lecture; we will recap these below and then work with them in practice.

### Metrics

**Perplexity**: It is computed as:
$$PPL_{LM}(x_0 ... x_n) = \exp(\frac{1}{n}\sum_{i=0}^n - \log P_{LM}(x_i \mid x_{<i})) $$

Note that this is only applicable to causal language models. This is the metric commonly used, e.g., on the Wikipedia texts. For instance, the PPL of GPT-2 on the Penn Treebank dataset is 35.76, while the perplexity of GPT-3 on the same dataset is 20.50. 
The idea is that an ideal model should have a perplexity as close to 0 as possible for a naturally occurring text that it has learned, thereby approximating good fit to the "ground truth distribution of natural language".

Below is some code for computing the perplexity of different sizes of GPT-2 for an exerpt from Wkipedia.
> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 7.1.1: Calculating perplexity</span></strong>
>
> 1. Please complete the code below. (Hint: only one simple transformation is required in order to calculate the perplexity from the NLL loss)
> 2. Compare the results for the models of different sizes. Does their comparison (ordering) match your intuition?

In [2]:
import torch
if torch.cuda.is_available():
    device = torch.device('cuda')
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device('cpu')

In [14]:
# perplexity evaluation
from datasets import load_dataset
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
test = load_dataset("wikitext", 'wikitext-2-raw-v1', split="test")

input_tokens = tokenizer(
    "\n\n".join(test["text"][:10]), 
    return_tensors="pt",
).input_ids.to(device)

# select a part of the text
# input_tokens = encodings[:,10:50]

# load models of different sizes
model_s = AutoModelForCausalLM.from_pretrained("gpt2").to(device)
model_xl = AutoModelForCausalLM.from_pretrained("gpt2-xl").to(device)

output_s = model_s(input_tokens, labels = input_tokens)
output_xl = model_xl(input_tokens, labels = input_tokens)
print("Average NLL for wikipedia chunk under small model ", output_s.loss.item())
print("Average NLL for wikipedia chunk under xl model ", output_xl.loss.item())

### your code for computing the perplexity goes here ###
perplexity_s = np.exp( output_s.loss.item())
perplexity_xl = np.exp( output_xl.loss.item())

print(f"PPL of smaller model: {perplexity_s}, PPL of larger model: {perplexity_xl}")

Average NLL for wikipedia chunk under small model  9.524829864501953
Average NLL for wikipedia chunk under xl model  12.172213554382324
PPL of smaller model: 13695.599618637785, PPL of larger model: 193341.5425429966


[This](https://huggingface.co/docs/transformers/en/perplexity) blogpost provides an interesting outlook to dealing with the issue of fixed length of the context window of LMs when trying to compute the perplexity of longer texts (e.g., Wikipedia).


**Accuracy**: this is a standard metric widely used in many domains, not only NLP. It computes the proportion of correct responses in a set of tasks. Presupposes that there is a single correct answer for a given input. We have seen in the lecture that one way to compute accuracy is to score each answer option, given the input, under the LLM, and retreive the predicted options via $argmax$; i.e., take the option for which the model assigned the highest (log) probability to be the chosen option. If this option is the ground truth option, the model's prediction is correct for this test item (i.e., correctness = 1); otherwise, correctness = 0. Accuracy is then the average correctness across all the test items in the benchmark. The lecture pointed out limitations of the argmax approach. Just as a recap, the underlying assumption is that a model that can perform a task correctly will predict:
$$\log P_{LM}(\text{correct label} \mid \text{context}) >  \log P_{LM}(\text{incorrect label} \mid \text{context})$$

The advantage of this approach is that it makes sure to score only the available answer options under the model, which is an especially important constraint for weaker models. However, SOTA more powerful LLMs, especially if they are instruction-tuned are often also tested via *text generation*. I.e., the input is given with an appropriate instruction, and the model's generated text is evaluated via string matching (e.g., regex of simple matching). If the correct answer option was generated, the model's correctness is 1 for this trial, and 0 otherwise. 

Below is some code exemplifying evaluating a model on a question answering benchmark CommonsenseQA which we have already used in the homework, via scoring answers under the model. This now provides an automatic implementation of the last task of HW1 / task 2 in HW2. For retrieving conditional log probabilities of different options, given a context, we will be using the package [`minicons`](https://github.com/kanishkamisra/minicons).

Note that here we are interested in scoring the different response options, given the questions, under the model, rather prompting the model with a list of possible options and letting it generate the option label. Therefore, the wrangling of the dataset is slightly different than in the homework.

> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 7.1.2: Calculating accuracy</span></strong>
>
> 1. Please complete the code below.
> 2. Compare the results to your results from the homework. Which are better? Do you think the log probability based evaluation is better than the strategy we used in the homework? Why (not)?
> 3. What is the expected chance accuracy on this dataset? Why is it important to consider chance accuracy when interpreting the results of a system?
> 4. The lecture mentioned effects of various bias corrections that can be applied to the raw scores. In the code below, by default, a length correction is applied (i.e., average log probabilities are used). use the docs / examples of the minicons package [here](https://github.com/kanishkamisra/minicons/blob/master/examples/surprisals.md) to retrieve "raw" log probabilities of the completions (i.e., sums over the token probabilities) and use those to calculate the accuracy. Do the results change?

In [15]:
# load dataset 
dataset = load_dataset("tau/commonsense_qa")



In [21]:
def massage_input_text(example):
    """
    Helper for converting labels, answer options
    into a single string.

    Arguments
    ---------
    example: dict
        Sample input from the dataset which contains the 
        question, answer labels (e.g. A, B, C, D),
        the answer options for the question, and which 
        of the answers is correct.
    
    Returns
    -------
    answer_options: list[str]
        Formatted list of answer options (e.g., 'A. <option 1> B. <option 2>' etc)
        and the ground truth answer.
    """
    # combine each label with its corresponding text
    answer_options_list = list(zip(
        example["choices"]["label"],
        example["choices"]["text"]
    ))
    # join each label and text with . and space
    answer_options = [f"{label}. {text}" for label, text in answer_options_list]

    return answer_options

# process input texts of validation dataset
massaged_dataset_val = dataset["validation"].map(
    lambda example: {
        "text": example["question"],
        "answers": massage_input_text(example),
        # get the index of the correct answer
        "label": example["choices"]["label"].index(example["answerKey"])
    }
)

In [23]:
massaged_dataset_val[0]

{'id': '1afa02df02c908a558b4036e80242fac',
 'question': 'A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?',
 'question_concept': 'revolving door',
 'choices': {'label': ['A', 'B', 'C', 'D', 'E'],
  'text': ['bank', 'library', 'department store', 'mall', 'new york']},
 'answerKey': 'A',
 'text': 'A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?',
 'answers': ['A. bank',
  'B. library',
  'C. department store',
  'D. mall',
  'E. new york'],
 'label': 0}

In [None]:
# iterate over part of the validation set an compute accuracy 
# (the test set doesn't have ground truth labels)

# set up a scorer 
from minicons import scorer 

lm_scorer = scorer.IncrementalLMScorer(
    'gpt2',
    device=device,
)
# initialize list for storing the correctness of the model predictions
correctness = []

for i in range(100):
    # get the ith example from the validation set
    example = massaged_dataset_val[i]
    # get the text of the question
    question = example['text']
    # get the list of answer options
    answer_options = example['answers']
    # get the ground truth label
    label = example['label']
    
    # pass a list of contexts and a list of continuations to be scored
    answer_scores = lm_scorer.conditional_score(
        # format the question into a list of same length as the number of answer options
        [question] * len(answer_options), 
        answer_options,
    ) 
    # get the predicted answer (Hint: check above how we determine what the model predicts is the correct answer)
    predicted_label = ### YOUR CODE HERE ###
    # check if the prediction is correct
    is_correct = predicted_label == label
    correctness.append(is_correct)

# compute the accuracy
print("Accuracy: ", np.mean(correctness))

**F1-score**:

This is a score that is commonly used on *binary* tasks (i.e., tasks with only two possible answer options) instead of accuracy. It is calculated from the *precision* and *recall* of the test results.  The precision is the number of true positive results divided by the number of all samples predicted to be positive, including those not identified correctly. The recall is the number of true positive results divided by the number of all samples that should have been identified as positive. Here, positive and negative results refer to predictions in each of the two answer categories, respectively. 

The F1 score is the harmonic mean of the precision and recall. It thus symmetrically represents both precision and recall in one metric:
$$F1 = 2 \times \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$
The more generic $F_{\beta}$ score applies additional weights, valuing one of precision or recall more than the other.
The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if precision and recall are zero. 

We will use the BoolQ dataset from the SuperGLUE benchmark and evaluate GPT-2's performance in terms of F1 scores on it. This is a task wherein the model has to predict an answer (true/false) to a question, given context. Therefore, the positive prediction here will be "true", and the negative "false".

You can find the test dataset [here](https://github.com/CogSciPrag/Understanding-LLMs-course/tree/main/understanding-llms/tutorials/files/super_glue_boolq.csv).
We will retrieve the model's predictions similarly to the accuracy evaluation above. Specifically, we will retrieve the probabilities of "true" and "false", given the context and the question.

> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 7.1.3: Calculating F1 scores</span></strong>
>
> 1. Please complete the code below.
> 2. Calculate the results. Does GPT-2 do well in this task? 
> 3. Evaluate the performance of the model using accuracy. What is the conceptual difference between the two results? Which one might be more reliable and why?
> 4. Find out how to compute the F1 score with the `sklearn.metrics` package.

In [39]:
import pandas as pd
df_boolq = pd.read_csv("files/super_glue_boolq.csv")

In [40]:
# inspect the dataset to understand its structure
# if is_true = 1, it means that the answer to the question is "True"
df_boolq.head()

Unnamed: 0,sentence1,sentence2,is_true
0,English Football League play-offs -- Before th...,do away goals count in the league 2 playoffs,0
1,1986 NBA Playoffs -- Second-year player Michae...,did the bulls get swept by the celtics,1
2,Water table -- The groundwater may be from pre...,is the depth of the water table always the same,0
3,FIFA World Cup qualification -- The hosts of t...,does the host country for the world cup get an...,1
4,"Bad Lip Reading -- In December of 2015, Bad Li...",is seagulls stop it now a real song,1


In [53]:
predicted_answer= []
true_answers = []

for i, r in df_boolq[:200].iterrows():
    # get the context for the question
    context = r['sentence1']
    # get the text of the question
    question = r['sentence2']
    # construct the list of answer options
    answer_options = ["False", "True"]
    # get the ground truth label
    true_answer = r["is_true"]
    
    # pass a list of contexts and a list of continuations to be scored
    try:
        answer_scores = lm_scorer.conditional_score(
            # format the context + question into a list of same length as the number of answer options
            [context + " " + question + "?"] * len(answer_options), 
            answer_options,
        ) 
    except:
        continue
    # get the predicted answer (Hint: check above how we determine what the model predicts is the correct answer)
    predicted_label = ### YOUR CODE HERE ###
    # record the predicted answer
    predicted_answer.append(predicted_label)
    true_answers.append(true_answer)



0
[-17.42737579345703, -16.241905212402344]
1
[-15.057735443115234, -12.546226501464844]
2
[-21.462608337402344, -17.94011688232422]
3
[-17.27320098876953, -13.855903625488281]
4
[-16.0611572265625, -13.052642822265625]
5
[-21.04007339477539, -17.46932601928711]
6
[-20.496910095214844, -18.105430603027344]
7
[-21.313457489013672, -16.074993133544922]
8
[-19.694324493408203, -16.635204315185547]
9
[-20.120223999023438, -16.774398803710938]
10
[-19.766246795654297, -16.45175552368164]
11
[-12.953720092773438, -10.639785766601562]
12
[-19.742599487304688, -17.128097534179688]
13
[-18.70870590209961, -14.96548080444336]
14
[-18.65872573852539, -15.077144622802734]
15
[-15.560127258300781, -14.01436996459961]
16
[-17.079673767089844, -13.116645812988281]
17
[-16.172325134277344, -12.018722534179688]
18
[-6.572422027587891, -4.708042144775391]
19
[-19.05229949951172, -14.953254699707031]
20
[-17.088207244873047, -13.424301147460938]
21
[-16.65937042236328, -14.122299194335938]
22
[-20.477622

In [54]:
# compute the F1 score
true_positive = sum([(i == j) & (i == 1) for i, j in zip(predicted_answer, true_answers)])
print("True positive: ", true_positive)
false_positive = sum([(i != j) & (i == 1) for i, j in zip(predicted_answer, true_answers)]) 
print("False positive: ", false_positive)
false_negative = sum([(i != j) & (i == 0) for i, j in zip(predicted_answer, true_answers)])
f1_score = # YOUR CODE HERE
print("F1 score: ", f1_score)

True positive:  130
False positive:  68
F1 score:  0.7926829268292683


**NLG metrics**: The lecture discussed the common metrics for generation evaluation: BLEU, ROUGE and METEOR. We already used ROUGE in task 2 of HW 3. These metrics all check whether the predicted text overlaps with ground truth texts. Often different overlap measures are used; for instance, overlaps of unigrams, bigrams or trigrams can be computed. 
These metrics originate from summarization and machine translation, where corpora of reference human summaries or translations. These are also applied to any other generation tasks, too, as long as reference texts are available. 

Below is space for trying out the BLEU score, in order to evaluate the translation predicted by FLAN-T5 small. 

> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 7.1.3: Calculating NLG scores</span></strong>
>
> 1. Please complete the code below by referring to the docs [here](https://huggingface.co/spaces/evaluate-metric/bleu).
> 2. Calculate the results. What happens if you change the values of the `max_order` parameter, for this example and in general?
> 3. If possible, try this out with a different language pair / a different sentence pair.

In [None]:
# import the implementation of the bleu score computation
from torchtext.data.metrics import bleu_score
# load model and tokenizer
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

tokenizer_t5 = T5Tokenizer.from_pretrained("google/flan-t5-small")
model_t5 = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small")

# define example sentences for translating from English to German
text_en = "All of the others were of a different opinion."
text_de = "Alle anderen waren anderer Meinung."
# define task 
prefix = "Translate to German: "

# encode the source and the target sentences
encoding_en = tokenizer_t5(
    [prefix + text_en],
    return_tensors="pt",
).input_ids
# we don't need the task prefix before the target
encoding_de = tokenizer_t5(
    [text_de],
    return_tensors="pt",
).input_ids

# predict with model
predicted_de = model_t5.generate(encoding_en)



# decode the prediction
predicted_decoded_de = tokenizer_t5.decode(
    predicted_de[0],
    skip_special_tokens=True,
)
print("Predicted translation: ", predicted_decoded_de)

# compute BLEU for the prediction
### YOUR CODE CALLING THE HELPER ABOVE GOES HERE ###
bleu = 

**TODO**
* Calibration

### Outlook

**TODO**:
* Pseudo-LL for masked LMs
* Jenn's metalinguistic paper.

## Machine psychology

However, more recently, work more informed by human language use and processing has compared LLMs’ performance to aspects of human behavior. Here, the assessment of LLMs is guided more by the question of how human-like certain aspects of its performance are.

**TODO**
* compute grammaticality judgements and own intution -- interpret
* hands-on methods of comparing quantitatively human and machine data: Ethan's code and data

In [None]:
import minicons
import openai
import time
import numpy as np
import pandas as pd
from dotenv import load_dotenv
import os
import argparse 
import json
from pprint import pprint
# set openAI key in separate .env file w/ content
load_dotenv() 
openai.api_key = os.getenv("OPENAI_API_KEY")

# read test cases with single token prediction
grammaticality_test_cases = pd.read_csv("data/grammaticality_tests.csv")

def get_surprisal(
        masked_sequence, 
        full_sequence,
        preface = 'start ', 
        model_name =  "text-davinci-003", 
        mask_token = "[MASK]",
        return_region_surprisal=True,
    ):
    """
    Helper for retrieving surprisal of different response types from GPT-3.

    Parameters:
    -----------
    masked_sequence: str
        Sequence with masked critical region.
    full_sequence: str
        Full sequence with crticial region.
    preface: str
        Preface (instructions or few-shot) to be added to the sequence.
    model_name: str
        Name of the GPT-3 model to be used.
    mask_token: str
        Token used for masking the critical region.
    return_region_surprisal: bool
        Whether to return surprisal of the critical region only or average for full sequence.

    Returns:
    --------
    mask_surprisals: list
        Surprisal of the critical region or average for full sentence.
    """
    # get log probs for sequence
    if model_name not in ["gpt-3.5-turbo", "gpt-4"]:
        response = openai.Completion.create(
                engine      = model_name, 
                prompt      = preface + full_sequence,
                max_tokens  = 0, # sample 0 new tokens, i.e., only get scores of input sentence
                temperature = 1, 
                logprobs    = 0, 
                echo        = True,
            ) 
        pprint(response)
    else:
        raise ValueError("GPT-4 and turbo cannot return log probs!")

    text_offsets = response.choices[0]['logprobs']['text_offset']
    # allow to use few shot examples
    if preface != '':
        cutIndex = text_offsets.index(max(i for i in text_offsets if i <= len(preface))) 
        endIndex = response.usage.total_tokens
    else:
        cutIndex = 0
        endIndex = len(response.choices[0]["logprobs"]["tokens"])
    answerTokens = response.choices[0]["logprobs"]["tokens"][cutIndex:endIndex]
    answerTokenLogProbs = response.choices[0]["logprobs"]["token_logprobs"][cutIndex:endIndex] 
    # retrieve critical region surprisal
    if return_region_surprisal:
        # get target region surprisal
        # for grammaticality judgment comparison
        # retrieve the target region which is masked in the masked sequence
        # get its index in the full sentence
        mask_ind = [i for i, e in 
                    enumerate(masked_sequence.replace(".", " .").split(" ")) 
                    if e == mask_token
                    ]
        # get target region
        masked_words = [full_sequence.replace(".", " .").split(" ")[mask_i] for mask_i in mask_ind]
        # get tokens corresponding to the target region
        # and handle subword tokenization of GPT
        mask_log_probs = []
        mask_log_prob = np.nan
        for masked_word in masked_words:
            for i, t in enumerate(answerTokens):
                if t.strip() == masked_word.strip():
                    mask_log_prob = answerTokenLogProbs[i]
                    mask_log_probs.append(mask_log_prob)
                elif t.strip() in masked_word:
                    if t.strip() + answerTokens[i+1] in masked_word:
                        mask_log_prob = answerTokenLogProbs[i]
                        mask_log_probs.append(mask_log_prob)
                else:
                    continue
                
        mask_surprisals = [-m for m in mask_log_probs]
    # get full sentence surprisal
    else:
        mask_surprisals = [- np.mean(
            np.asarray(answerTokenLogProbs)
        )]

    return mask_surprisals

def compare_surprisals(row, return_region_surprisal):
    """
    Helper for comparing surprisals of grammatical and ungrammatical sentences.

    Parameters:
    -----------
    row: pd.Series
        Row of the dataframe containing the test case.
    return_region_surprisal: bool
        Whether to return surprisal of the critical region only or average for full sentence.
    
    Returns:
    --------
    is_grammatical: bool
        Whether the grammatical sentence has lower surprisal than the ungrammatical one.
    """
    # get surprisal of grammatical sentence
    grammatical_surprisal = get_surprisal(
        row["masked_sentence"],
        row["grammatical_sentence"],
        return_region_surprisal=return_region_surprisal,
    )
    print(f"--- Surprisal of grammatical sentence {row['grammatical_sentence']}: {grammatical_surprisal} ---\n\n")
    # get surprisal of ungrammatical sentence
    ungrammatical_surprisal = get_surprisal(
        row["masked_sentence"],
        row["ungrammatical_sentence"],
        return_region_surprisal=return_region_surprisal,
    )
    print(f"--- Surprisal of ungrammatical sentence {row['ungrammatical_sentence']}: {ungrammatical_surprisal} ---\n\n")
    
    # check LM accuracy (in terms of surprisal)
    is_grammatical = all([g < u for g, u in zip(grammatical_surprisal, ungrammatical_surprisal)])
    return is_grammatical

print("Gramamticality test cases: \n\n", grammaticality_test_cases.head(10))
# call surprisal computation for single test cases from the slides
print("--- Agreement test case --- \n Is grammatical sentence less surprising than ungrammatical one?", 
      compare_surprisals(grammaticality_test_cases.iloc[0], return_region_surprisal=False), "\n\n")

print("--- Reflexive test case --- \n Is grammatical sentence less surprising than ungrammatical one?", 
      compare_surprisals(grammaticality_test_cases.iloc[11], return_region_surprisal=False), "\n\n")


def main():
    """
    Runs all test cases.
    """
    for _, r in grammaticality_test_cases.iterrows():
        print("--------------------")
        is_grammatical = compare_surprisals(r, return_region_surprisal=False)
        print(f"Grammatical sentence: {r['grammatical_sentence']} \n\n")
        print(f"Ungrammatical sentence: {r['ungrammatical_sentence']} \n\n")
        # check LM accuracy (in terms of surprisal)
        print("Is the grammatical sentence more likely than the ungrammatical one under LM?", 
              is_grammatical)
    

SuperGLUE

Finally, we will get our hands dirty with evaluating LLMs which already have been trained. In this task, we will use a few tasks from one of the most-used LM benchmarks, the SuperGLUE benchmark:

    a natural language inference (NLI) task “rte”,

        a task wherein the model has to predict whether a second sentence is entailed by the first one (i.e., predict the label ‘entailment’ or ‘no entailment’)

    a question answering task “boolq”,

        a task wherein the model has to predict an answer (yes/no) to a question, given context

    and a sentence continuation task “copa”.

        a task wherein the model has to select one of two sentences as the more plausible continuation given an input sentence.

We will be using (subset of) the validation splits of the tasks for our evaluation.

With the introduction of first language models like BERT, a common approach to using benchmarks like SuperGLUE was to fine-tune the pretrained model on the train split of the benchmark datasets, and then use the test splits for evaluation. With SOTA LLMs, it is more common to do zero- or few-shot evaluation where the model has to, e.g., predict labels or select answer options without special fine-tuning, just given instructions.

We are also not going to fine-tune our model on these specific tasks. Instead, as introduced in class, we are going to compare the log probabilities of different answer options (e.g., log probabilities of “entailment” vs. “no entailment” following a pair of sentences from the RTE-task). With this method, the assumption is that a model’s output prediction for a particular trial is correct iff: 

$$\log P_{LM}(\text{correct label} \mid \text{context}) >  \log P_{LM}(\text{incorrect label} \mid \text{context})$$

For tasks like “copa” where there is no single label but instead a sentence continuation, we are going to compute the average token log probability as a single-number representation of the continuation. Here, the model’s prediction will count as correct iff the average log probability of the correct continuation sentence will be higher, given the input, than for the incorrect continuation. We will not using task instructions in our evaluation since the model wasn’t fine-tuned on instruction-following.

Your job is to complete the code below, evaluate the model which you have fine-tuned above and summarize the results you find in a few words (see below for more detailed step-by-step instructions). If you have issues with the previous task and cannot use your own fine-tuned model, please use the initial IMDB fine-tuned GPT-2 with which we initialized the policy in exercise 2. Please indicate which model you are testing on Moodle in the respective exercise responses.
