Sheet 7.1: Behavioral assessment & Evaluation
======
**Author**: Polina Tsvilodub

This sheet focuses on evaluating the input-output (I/O) behavior of LLMs. Inspired by experimental paradigms / the terminology in cognitive science and psychology which investigate a blackbox (the human mind) via looking at the behavior across different interesting conditions (inputs), such assessment of LLMs (also blackboxes) can be called "behavioral assessment". This approach can be seen as one piece that should work in combination with attribution methods discussed in the previous sheet in order to provide fuller understanding of what LLMs can or cannot do (I/O testing) and how they do it (attributions). 
Following the structure of the lecture, we will first look at practical aspects of benchmark testing, and then look at "machine psychology", which often draws on the same methods but addresses somewhat different research questions.

Therefore, the learning goals of this sheet are:
* look at examples of a few different benchmarks and how they are usually constructed
* become familiar with standard evaluation metrics and methods used for evaluating LLMs on benchmarks (these include PPL, log probability based scores, accuracy, F1, free generation etc)
* become familiar with different more advanced eval measures HELM  
* look at examples of machine psychology and how, in practice, LLM performance can be compared to various human data
* how LLMs can be tested as "psychology subjects".

## Benchmark testing

Such I/O evaluations are the most common approach to LLM evaluation. Taking a more technical / engineering-oriented perspective which aims at building LLMs for specific application, it is very common to make use of large benchmark datasets which are designed to test models’ performance on a variety of tasks in an automated way. This is often done by checking the models’ outputs against ground truth answers or by computing standard scores for certain datasets. Therefore, quality of LLMs is measured by their scores on these benchmarks. 

Initially, these benchmarks were designed to test LLMs’ linguistic performance since the goal of building the model is a system that predict grammatical and fluent natural language. Therefore, some first benchmarks (or, commonly used textual datasets) are, for instance, Wikipedia texts, the Penn Treebank, and the GLUE benchmark. Wikipedia texts are often used for measuring the perplexity of the model on this standard text (see below for details). The Penn Treebank was often used for fine-tuning or evaluating models, e.g., on part-of-speech tagging as an approximation of syntactic performance, while the GLUE benchmark contains tasks which are supposed to approximate (semantic) natural language understanding in the form of paraphrase tasks, sentiment classification, natural language inference tasks etc.

Now recent LLMs have shown perhaps unexpectedly impressive generalization to tasks which seem to require more than linguistic fluency, like solving math and reasoning problems. Therefore, more recent benchmarks incorporate tests of various tasks going beyond linguistic capability. Two of the most widely used benchmarks include the MMLU and the BIG-Bench datasets.
Given that SOTA LLMs are also often designed as assisstants and embedded in user-facing applications, it also became crucial to evaluate potenital social impacts that LLMs might exhibit with their outputs, like assessing biases and toxicity of the generations. To this end, specialized benchmarks like RealToxicity or WinoGender were created.
**TODO:** refs.

**TODO** NLG benchmarks.

**TODO:** complete: model refers to trained models which are evaluated with respect to their performance. However, if one wanted to track the performance during training, one could also run evaluations on intermediate model checkpoints during training too. Just note that the model is "frozen" and runs in inference mode during all of the testing described in this sheet.

The core advantage of benchmarks: (somewhat of a) standardization of the evaluation procedure, large scale (higher coverage, more reliable results), and design to be evaluated with easy to compute automatically evaluation metrics. You have heard about them in the lecture; we will recap these below and then work with them in practice.

### Metrics

**Perplexity**: avg surprisal that the trained model assigns to some piece of text. It is computed as:
$$PPL_{LM}(x_0 ... x_n) = \frac{1}{n}\sum_i=0^n -P_{LM}(x_i \mid x_{<i}) $$

Note that this is only applicable to causal language models. This is the metric commonly used, e.g., on the Wikipedia texts. For instance, the PPL of GPT-2 on XXXX is YYY **TODO**. The idea is that an ideal model should have a perplexity as close to 0 as possible for a naturally occurring text that it has learned, thereby approximating good fit to the "ground truth distribution of natural language".

Below is some code for computing the perplexity of different sizes of GPT-2 for an exerpt from Wkipedia.
> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 7.1.1: Calculating perplexity</span></strong>
>
> 1. Please complete the code below. (Hint: only one simple transformation is required in order to calculate the perplexity from the NLL loss)
> 2. Compare the results for the models of different sizes. Does their comparison (ordering) match your intuition?

In [None]:
# perplexity evaluation
from datasets import load_dataset
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
test = load_dataset("wikitext", 'wikitext-2-raw-v1', split="test")
encodings = tokenizer(
    "\n\n".join(test["text"]), 
    return_tensors="pt",
).input_ids
# select a part of the text
input_tokens = encodings[:,10:50]

# load models of different sizes
model_s = AutoModelForCausalLM.from_pretrained("gpt2")
model_xl = AutoModelForCausalLM.from_pretrained("gpt2-xl")

output_s = model_s(input_tokens, labels = input_tokens)
output_xl = model_s(input_tokens, labels = input_tokens)
print("Average NLL for wikipedia chunk under small model ", output_s.loss.item())
print("Average NLL for wikipedia chunk under xl model ", output_xl.loss.item())

### your code for computing the perplexity goes here ###
perplexity_s =
perplexity_xl =

print(f"PPL of smaller model: {perplexity_s}, PPL of larger model: {perplexity_xl}")


**Accuracy**: this is a standard metric widely used in many domains, not only NLP. It computes the proportion of correct responses in a set of tasks. Presupposes that there is a single correct answer for a given input. We have seen in the lecture that one way to compute accuracy is to score each answer option, given the input, under the LLM, and retreive the predicted options via $argmax$; i.e., take the option for which the model assigned the highest (log) probability to be the chosen option. If this option is the ground truth option, the model's prediction is correct for this test item (i.e., correctness = 1); otherwise, correctness = 0. Accuracy is then the average correctness across all the test items in the benchmark. The lecture pointed out limitations of the argmax approach.

The advantage of this approach is that it makes sure to score only the available answer options under the model, which is an especially important constraint for weaker models. However, SOTA more powerful LLMs, especially if they are instruction-tuned are often also tested via text generation. I.e., the input is given with an appropriate instruction, and the model's generated text is evaluated via string matching. If the correct answer option was generated, the model's correctness is 1 for this trial, and 0 otherwise. 

Below is some code exemplifying evaluating a model on a question answering benchmark which we have already used in the homework, via scoring answers under the model.

In [None]:
# TODO

**F1-score**:

**correlation**:

Better explain some QA benchmarks which we have seen in the HW tasks already. 

### Outlook

* Pseudo-LL for masked LMs
* Jenn's metalinguistic paper.

NLG scores:

In [None]:
# import the implementation of the bleu score computation
from torchtext.data.metrics import bleu_score
# load model and tokenizer
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

tokenizer_t5 = T5Tokenizer.from_pretrained("t5-small")
model_t5 = T5ForConditionalGeneration.from_pretrained("t5-small")

# define example sentences for translating from English to German
text_en = "All of the others were of a different opinion."
text_de = "Alle anderen waren anderer Meinung."
# define task 
prefix = "translate English to German: "

# define helper function taking prediction and gold standard
def compute_bleu(prediction, target, n):
    """
    Parameters:
    ----------
    prediction: str
        Predicted translation.
    target: str
        Gold standard translation.
    n: int
        n-gram over which to compute BLEU.
    Returns:
    -------
    score: float
        BLEU-n score.
    """
    score = bleu_score(
        [prediction.split()], 
        [[target.split()]], 
        max_n=n, 
        weights = [1/n] * n,
    )
    return score 

# encode the source and the target sentences
encoding_en = tokenizer_t5(
    [prefix + text_en],
    return_tensors="pt",
).input_ids
# we don't need the task prefix before the target
encoding_de = tokenizer_t5(
    [text_de],
    return_tensors="pt",
).input_ids

# predict with model
predicted_de = model_t5.generate(encoding_en)

print("Predicted translation: ", predicted_decoded_de)

# decode the prediction
predicted_decoded_de = tokenizer_t5.decode(
    predicted_de[0],
    skip_special_tokens=True,
)

# compute BLEU-1 for the prediction
### YOUR CODE CALLING THE HELPER ABOVE GOES HERE ###
# bleu1 = 

Surprisals:

In [None]:
#!pip install minicons

In [None]:
import minicons
import openai
import time
import numpy as np
import pandas as pd
from dotenv import load_dotenv
import os
import argparse 
import json
from pprint import pprint
# set openAI key in separate .env file w/ content
load_dotenv() 
openai.api_key = os.getenv("OPENAI_API_KEY")

# read test cases with single token prediction
grammaticality_test_cases = pd.read_csv("data/grammaticality_tests.csv")

def get_surprisal(
        masked_sequence, 
        full_sequence,
        preface = 'start ', 
        model_name =  "text-davinci-003", 
        mask_token = "[MASK]",
        return_region_surprisal=True,
    ):
    """
    Helper for retrieving surprisal of different response types from GPT-3.

    Parameters:
    -----------
    masked_sequence: str
        Sequence with masked critical region.
    full_sequence: str
        Full sequence with crticial region.
    preface: str
        Preface (instructions or few-shot) to be added to the sequence.
    model_name: str
        Name of the GPT-3 model to be used.
    mask_token: str
        Token used for masking the critical region.
    return_region_surprisal: bool
        Whether to return surprisal of the critical region only or average for full sequence.

    Returns:
    --------
    mask_surprisals: list
        Surprisal of the critical region or average for full sentence.
    """
    # get log probs for sequence
    if model_name not in ["gpt-3.5-turbo", "gpt-4"]:
        response = openai.Completion.create(
                engine      = model_name, 
                prompt      = preface + full_sequence,
                max_tokens  = 0, # sample 0 new tokens, i.e., only get scores of input sentence
                temperature = 1, 
                logprobs    = 0, 
                echo        = True,
            ) 
        pprint(response)
    else:
        raise ValueError("GPT-4 and turbo cannot return log probs!")

    text_offsets = response.choices[0]['logprobs']['text_offset']
    # allow to use few shot examples
    if preface != '':
        cutIndex = text_offsets.index(max(i for i in text_offsets if i <= len(preface))) 
        endIndex = response.usage.total_tokens
    else:
        cutIndex = 0
        endIndex = len(response.choices[0]["logprobs"]["tokens"])
    answerTokens = response.choices[0]["logprobs"]["tokens"][cutIndex:endIndex]
    answerTokenLogProbs = response.choices[0]["logprobs"]["token_logprobs"][cutIndex:endIndex] 
    # retrieve critical region surprisal
    if return_region_surprisal:
        # get target region surprisal
        # for grammaticality judgment comparison
        # retrieve the target region which is masked in the masked sequence
        # get its index in the full sentence
        mask_ind = [i for i, e in 
                    enumerate(masked_sequence.replace(".", " .").split(" ")) 
                    if e == mask_token
                    ]
        # get target region
        masked_words = [full_sequence.replace(".", " .").split(" ")[mask_i] for mask_i in mask_ind]
        # get tokens corresponding to the target region
        # and handle subword tokenization of GPT
        mask_log_probs = []
        mask_log_prob = np.nan
        for masked_word in masked_words:
            for i, t in enumerate(answerTokens):
                if t.strip() == masked_word.strip():
                    mask_log_prob = answerTokenLogProbs[i]
                    mask_log_probs.append(mask_log_prob)
                elif t.strip() in masked_word:
                    if t.strip() + answerTokens[i+1] in masked_word:
                        mask_log_prob = answerTokenLogProbs[i]
                        mask_log_probs.append(mask_log_prob)
                else:
                    continue
                
        mask_surprisals = [-m for m in mask_log_probs]
    # get full sentence surprisal
    else:
        mask_surprisals = [- np.mean(
            np.asarray(answerTokenLogProbs)
        )]

    return mask_surprisals

def compare_surprisals(row, return_region_surprisal):
    """
    Helper for comparing surprisals of grammatical and ungrammatical sentences.

    Parameters:
    -----------
    row: pd.Series
        Row of the dataframe containing the test case.
    return_region_surprisal: bool
        Whether to return surprisal of the critical region only or average for full sentence.
    
    Returns:
    --------
    is_grammatical: bool
        Whether the grammatical sentence has lower surprisal than the ungrammatical one.
    """
    # get surprisal of grammatical sentence
    grammatical_surprisal = get_surprisal(
        row["masked_sentence"],
        row["grammatical_sentence"],
        return_region_surprisal=return_region_surprisal,
    )
    print(f"--- Surprisal of grammatical sentence {row['grammatical_sentence']}: {grammatical_surprisal} ---\n\n")
    # get surprisal of ungrammatical sentence
    ungrammatical_surprisal = get_surprisal(
        row["masked_sentence"],
        row["ungrammatical_sentence"],
        return_region_surprisal=return_region_surprisal,
    )
    print(f"--- Surprisal of ungrammatical sentence {row['ungrammatical_sentence']}: {ungrammatical_surprisal} ---\n\n")
    
    # check LM accuracy (in terms of surprisal)
    is_grammatical = all([g < u for g, u in zip(grammatical_surprisal, ungrammatical_surprisal)])
    return is_grammatical

print("Gramamticality test cases: \n\n", grammaticality_test_cases.head(10))
# call surprisal computation for single test cases from the slides
print("--- Agreement test case --- \n Is grammatical sentence less surprising than ungrammatical one?", 
      compare_surprisals(grammaticality_test_cases.iloc[0], return_region_surprisal=False), "\n\n")

print("--- Reflexive test case --- \n Is grammatical sentence less surprising than ungrammatical one?", 
      compare_surprisals(grammaticality_test_cases.iloc[11], return_region_surprisal=False), "\n\n")


def main():
    """
    Runs all test cases.
    """
    for _, r in grammaticality_test_cases.iterrows():
        print("--------------------")
        is_grammatical = compare_surprisals(r, return_region_surprisal=False)
        print(f"Grammatical sentence: {r['grammatical_sentence']} \n\n")
        print(f"Ungrammatical sentence: {r['ungrammatical_sentence']} \n\n")
        # check LM accuracy (in terms of surprisal)
        print("Is the grammatical sentence more likely than the ungrammatical one under LM?", 
              is_grammatical)
    

SuperGLUE

Finally, we will get our hands dirty with evaluating LLMs which already have been trained. In this task, we will use a few tasks from one of the most-used LM benchmarks, the SuperGLUE benchmark:

    a natural language inference (NLI) task “rte”,

        a task wherein the model has to predict whether a second sentence is entailed by the first one (i.e., predict the label ‘entailment’ or ‘no entailment’)

    a question answering task “boolq”,

        a task wherein the model has to predict an answer (yes/no) to a question, given context

    and a sentence continuation task “copa”.

        a task wherein the model has to select one of two sentences as the more plausible continuation given an input sentence.

We will be using (subset of) the validation splits of the tasks for our evaluation.

With the introduction of first language models like BERT, a common approach to using benchmarks like SuperGLUE was to fine-tune the pretrained model on the train split of the benchmark datasets, and then use the test splits for evaluation. With SOTA LLMs, it is more common to do zero- or few-shot evaluation where the model has to, e.g., predict labels or select answer options without special fine-tuning, just given instructions.

We are also not going to fine-tune our model on these specific tasks. Instead, as introduced in class, we are going to compare the log probabilities of different answer options (e.g., log probabilities of “entailment” vs. “no entailment” following a pair of sentences from the RTE-task). With this method, the assumption is that a model’s output prediction for a particular trial is correct iff: 

$$\log P_{LM}(\text{correct label} \mid \text{context}) >  \log P_{LM}(\text{incorrect label} \mid \text{context})$$

For tasks like “copa” where there is no single label but instead a sentence continuation, we are going to compute the average token log probability as a single-number representation of the continuation. Here, the model’s prediction will count as correct iff the average log probability of the correct continuation sentence will be higher, given the input, than for the incorrect continuation. We will not using task instructions in our evaluation since the model wasn’t fine-tuned on instruction-following.

Your job is to complete the code below, evaluate the model which you have fine-tuned above and summarize the results you find in a few words (see below for more detailed step-by-step instructions). If you have issues with the previous task and cannot use your own fine-tuned model, please use the initial IMDB fine-tuned GPT-2 with which we initialized the policy in exercise 2. Please indicate which model you are testing on Moodle in the respective exercise responses.


**TODO**
* Calibration
* implementing various bias corrections
* 


## Machine psychology

However, more recently, work more informed by human language use and processing has compared LLMs’ performance to aspects of human behavior. Here, the assessment of LLMs is guided more by the question of how human-like certain aspects of its performance are.

**TODO**
* compute SG scores with some example
* coming up with testing some psychological aspect
* hands-on methods of comparing quantitatively human and machine data