## Project 1
# Zero-Shot Question Answering
The aim of this project is to get familiar with Language Models for Zero-Shot Question Answering and possible pitfalls when it comes to mesuring LLM performance on common benchmarks.  
The project is divided into two parts:
1. **Encoder Models**:
    - Here you will see how to used predefined HF / transformers classes to solve this task
2. **Decoder Models**:
    - Here you will see how to adapt a decoder model to solve this task and how are modern LLMs benchmarked on this task.

### What is Zero-Shot Question Answering?
Zero-shot question answering is a task where a model is given a question and a context, and the model is expected to predict the answer without any training on the context or the question. The model is expected to generalize to unseen context and questions. From practical perspective it is a situation where we want to use a model to our task without any fine-tuning.

### Part 0: Setup

In [1]:
%pip install datasets
%pip install 'transformers[torch]'

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Part 1: Dataset

We will work on [MMLU dataset](https://huggingface.co/datasets/CohereForAI/Global-MMLU). Let's have a look at examples from the dataset. For each question we are given 4 answers, the correct one and the subject of the question

In [2]:
from datasets import load_dataset, Dataset

ds = load_dataset("CohereForAI/Global-MMLU", "en", split="test")

def preprocess(sample: dict):
    return {
        "options": [
            sample[option]
            for option in ["option_a", "option_b", "option_c", "option_d"]
        ],
    }

ds = ds.map(preprocess)

In [3]:
print(f"N Examples: {len(ds)}")
print(f"Mean length: {sum(len(x['question']) for x in ds) / len(ds):4.2f}")
print(f"Max length: {max(len(x['question']) for x in ds)}")

N Examples: 14042
Mean length: 274.54
Max length: 4671


In [4]:
sample_idx = 0

sample_question = ds[sample_idx]["question"]
sample_subject = ds[sample_idx]["subject"]
options = ds[sample_idx]["options"]
answer = ds[sample_idx]["answer"]

print("Sample question:", sample_question)
print("Sample subject:", sample_subject)
print("Options:\n", "\n".join([f"{c.upper()}: {o}" for c, o in zip("abcd", options)]))
print("Answer:", answer)

Sample question: Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
Sample subject: abstract_algebra
Options:
 A: 0
B: 4
C: 2
D: 6
Answer: B


### Part 2: Encoder Models

Let's have a look how to use out of the box transformers pipeline to solve this task

In [5]:
from transformers import pipeline, set_seed

set_seed(42)

zero_shot_classifier = pipeline(
    "zero-shot-classification", model="MoritzLaurer/ModernBERT-large-zeroshot-v2.0"
)

2025-04-22 14:33:33.767913: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745325213.911919   11570 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1745325213.952575   11570 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1745325214.255805   11570 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1745325214.255830   11570 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1745325214.255833   11570 computation_placer.cc:177] computation placer alr

In [6]:
zero_shot_classifier(
    sample_question,
    options,
    hypothesis_template="The correct answer is: {}",
    multi_label=False,
)

Compiling the model with `torch.compile` and using a `torch.cpu` device is not supported. Falling back to non-compiled mode.


{'sequence': 'Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.',
 'labels': ['6', '4', '2', '0'],
 'scores': [0.2790496051311493,
  0.275924414396286,
  0.23752973973751068,
  0.20749622583389282]}

#### How it works under the hood?

If you go to the [source code](https://github.com/huggingface/transformers/blob/9e94801146ceeb3b215bbdb9492be74d7d7b7210/src/transformers/pipelines/zero_shot_classification.py#L49) you can see that it uses `ModelForSequenceClassification` and in [model card](https://huggingface.co/MoritzLaurer/ModernBERT-large-zeroshot-v2.0) you can read that the model was in fact fine tuned on question answering task.  
The base model used for fine-tuning is ModernBERT, which is a modernized version of the BERT model, making use of various advancements in the *atention* mechanism, improving both performance and efficiency.  
If you are interested in details, we highly recommend the following [Hugging Face blogpost](https://huggingface.co/blog/modernbert).

By digging deeper in [model config](https://huggingface.co/MoritzLaurer/ModernBERT-large-zeroshot-v2.0/blob/a51e07b524299e309dd2b88d48b0cfa2bd9ec598/config.json#L24) we can see that the only labels the model knows about are
```
"id2label": {
    "0": "entailment",
    "1": "not_entailment"
  }
```

For each option the model classifies the text
```
{question}
{hypothesis_template} {option}
```
as either entailment or not entailment. The option with the highest entailment score is the answer.

#### Task: evaluate the model on the dataset
Your task is to evaluate the model on the dataset and calculate some metrics (accuracy, potentially some other metrics and more granular insight - e.g. per question subject).  
Additionally you will implement batching to improve the evaluation performance and use profiler to analyze the improvements.

Note that our problem is not typical classification task because the classes (here: available answers) are different for each question.  
The "zero-shot-classification" pipeline expects that the *classes* passed to it are the same for all examples in the batch.  
To overcome this limitation we need to reimplement the pipeline.

The task involves the following steps:

    1. First, implement a naive function which given the dataset (or its subset) processes it row by row using the zero-shot pipeline. (1 pkt)
    2. Implement a vectorized (batched) version of the pipeline. (4 pkt)
    3. Write a test function comparing the results of batching with the naive version. (1 pkt)
    4. Profile the batched version and (adaptively) choose the best batch size for processing the whole dataset. (2 pkt)
    5. Calculate accuracy of the model and some more insight on the results. (2pkt)
        Batching is not strictly required for this part.

#### Utilities

In [7]:
import gc
from textwrap import dedent
import torch
import numpy as np


QUESTION_TEMPLATE = dedent(
    """
    Question: {question}
"""
)
HYPOTHESIS_TEMPLATE = "The correct answer is: {}"


def flush():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

#### Naive implementation

In [8]:
from typing import TypedDict
from tqdm.auto import tqdm

class PipelineResult(TypedDict):
    labels: list[list[str]] #sorted according to scores
    scores: list[list[float]] #sorted descending
    top_inds: list[list[int]] #for each label, its index in the input's options list


def naive_zero_shot_classifier_pipeline(zero_shot_classifier, pipeline_input: Dataset) -> PipelineResult:
    """A naive ZeroShotClassificationPipeline which iterates over examples and processes them one by one."""
    labels_list, scores_list, top_inds_list = [], [], []
    
    i = 0
    for entry in pipeline_input:
        if i % 20 == 0:
            print(i)
        i += 1
        question = entry['question']
        options = entry['options']

        result = zero_shot_classifier(
            question,
            options,
            hypothesis_template = 'The correct answer is: {}',
            multi_label = False,
        )

        scores = np.array(result['scores'])
        labels = np.array(result['labels'])
        options = np.array(options)

        perm = np.argsort(scores)[::-1]
        assert all(perm == np.arange(len(perm)))

        labels_perm = np.argsort(labels)
        options_perm = np.argsort(options)
        
        def inverse_perm(p):
            inv = np.zeros_like(p, dtype = int)
            inv[p] = np.arange(len(p))
            return inv

        perm_labels_options = np.arange(len(labels))[options_perm]
        perm_labels_options = perm_labels_options[inverse_perm(labels_perm)]

        # print(labels_perm, options_perm)
        # print(labels)
        # print(options)
        # print(perm_labels_options)
        # print(options[perm_labels_options])

        assert all(options[perm_labels_options] == labels)

        scores_list += [scores]
        labels_list += [labels]
        top_inds_list += [perm_labels_options]

    # TODO: Your code here

In [140]:
# %%time
r = naive_zero_shot_classifier_pipeline(
    zero_shot_classifier, ds.take(256)
)

0
20
40
60
80
100
120
140
160
180
200
220
240


#### Batched implementation
Rewrite the pipeline to process the dataset in batches to improve efficiency.  
ModernBERT supports a special batching mode called *sequence packing* but its usage requires FlashAttention and is beyond the scope of this task.  
Your goal is to implement batching in such a way that the processing of the whole dataset is fast, gpu utilization is high and you don't run out of memory.

**Hint (general):** group inputs in some specific way to minimize the amount of padding tokens.   
**Hint (implementation):** You may (but don't have to) check the implementation of the "zero-shot-classification" pipeline in Hugging Face transformers.

In [9]:
def zero_shot_classifier_with_batching(
    zero_shot_classifier, pipeline_input: Dataset
) -> PipelineResult:
    """A batched ZeroShotClassificationPipeline which processes examples in batches.

    Choosing the batch size is part of the function and can be done adaptively.
    """
    labels_list, scores_list, top_inds_list = [], [], []
    
    i = 0
    for entry in pipeline_input:
        print(entry)
    #TODO: Your code here

In [11]:
#flush()

In [12]:
%%time
r_batched = zero_shot_classifier_with_batching(
    zero_shot_classifier, ds.take(256)
)

{'sample_id': 'abstract_algebra/test/0', 'subject': 'abstract_algebra', 'subject_category': 'STEM', 'question': 'Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.', 'option_a': '0', 'option_b': '4', 'option_c': '2', 'option_d': '6', 'answer': 'B', 'required_knowledge': '[]', 'time_sensitive': '[]', 'reference': '[]', 'culture': '[]', 'region': '[]', 'country': '[]', 'cultural_sensitivity_label': '-', 'is_annotated': False, 'options': ['0', '4', '2', '6']}
{'sample_id': 'abstract_algebra/test/1', 'subject': 'abstract_algebra', 'subject_category': 'STEM', 'question': 'Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.', 'option_a': '8', 'option_b': '2', 'option_c': '24', 'option_d': '120', 'answer': 'C', 'required_knowledge': '[]', 'time_sensitive': '[]', 'reference': '[]', 'culture': '[]', 'region': '[]', 'country': '[]', 'cultural_sensitivity_label': '-', 'is_annotated': False, 'options': ['8', '2', '24', '120']}
{'sample_id': 'abstract_

#### Test naive vs batched
Write a test checking that naive and vectorized implementations produce same results.

**Hint**: there might be some examples in the data which break the comparison.  
You may remove them or adjust the function to handle them correctly.

In [12]:
def compare_naive_and_bathched_zero_shot_classifiers(zero_shot_classifier, data: Dataset):
    # TODO: your code here

SyntaxError: incomplete input (552249550.py, line 2)

In [None]:
compare_naive_and_bathched_zero_shot_classifiers(zero_shot_classifier, ds.shuffle(42).take(256))

#### Profiling
Profile both implementations with Torch profiler.  
Include the results as screenhots and comment on them.

**TODO:** you profiling results HERE

### Process the whole dataset & calculate metrics
Here you should process the whole dataset.  
Note the time it took.  
Then calculate some metrics (accuracy and other you may like) and comment on them.  
If you don't have the batched implementation, you may process the dataset with the naive version.  

In [None]:
%%time
r_batched_whole_dataset = zero_shot_classifier_with_batching(
    zero_shot_classifier, ds
)

**TODO:** your evaluation and comments HERE

### Part 3: Decoder Models  

In this section, we will explore how to adapt a decoder model to solve this task and how modern LLMs are benchmarked on it.  

Recall that decoder models are used for autoregressive text generation, meaning they predict one token at a time, conditioning each prediction on previously generated tokens. A natural way to solve this task would be to prompt the model with different answer options and let it generate a response. However, this approach presents two major challenges:  

1. The model may not generate the answer in the expected format, making automatic evaluation difficult.  
2. Since decoder models generate text step by step, they do not directly assign a single probability to an entire answer, making it hard to compare different answer choices.  

To address this, we use **perplexity** to evaluate how likely the model considers each possible answer.  

### Perplexity-Based Evaluation  

Since a decoder model predicts a probability distribution over the vocabulary for each token, we can compute the likelihood of any given sequence by multiplying the probabilities assigned to its tokens. Perplexity is a measure of how well the model predicts a sequence, defined as the exponentiated negative average log-likelihood of the sequence. Formally, for a sequence of tokens $\mathbf{w} = (w_1, w_2, ..., w_n)$, perplexity is computed as:  

$
PPL(\mathbf{w}) = \exp \left( -\frac{1}{n} \sum_{i=1}^{n} \log P(w_i \mid w_{<i}) \right)
$

where $ P(w_i \mid w_{<i}) $ is the probability assigned by the model to token $ w_i $ given the preceding tokens.  

A lower perplexity score indicates that the model assigns a higher probability to the given answer, making it a more likely choice. By computing perplexity for each possible answer and selecting the one with the lowest value, we can systematically rank the answers without requiring the model to generate them explicitly.  

This approach ensures reliable and scalable evaluation, making it a standard technique for benchmarking decoder models on multiple-choice tasks.  

You can read more about perplexity and what problems there are when it comes to using it as a metric in [this short blog](https://blog.eleuther.ai/multiple-choice-normalization/). Notice the challenges when it comes to models with different tokenizers and how to overcome them.

Last but not least there is reproducibility issue if you deploy big optimized model on moder GPU, you can read more about it [here](https://community.openai.com/t/a-question-on-determinism/8185)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2").eval()

#### Revisiting the prompt

The prompt and response format also matters. You can read more about that [here](https://huggingface.co/blog/open-llm-leaderboard-mmlu) and also about the differences when it comes to deciding which answer model choosed. You can read in this blog that depending on the prompt and evaluation strategy the benchmark results can vary.

We will use HELM prompt and normalize perplexity by token its count.

In [None]:
from textwrap import dedent

HELM_PROMPT_TEMPLATE = dedent("""
The following are multiple choice questions (with answers) about {subject}:

{question}
A. {option_a}
B. {option_b}
C. {option_c}
D. {option_d}
Answer:
""")

print(
    HELM_PROMPT_TEMPLATE.format(
        subject=sample_subject,
        question=sample_question,
        option_a=options[0],
        option_b=options[1],
        option_c=options[2],
        option_d=options[3],
    )
)



The following are multiple choice questions (with answers) about abstract_algebra:

Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer:



Let's generate sample answers

In [None]:
generator = pipeline("text-generation", model="gpt2")

sample_prompt_formatted = HELM_PROMPT_TEMPLATE.format(
    subject=sample_subject,
    question=sample_question,
    option_a=options[0],
    option_b=options[1],
    option_c=options[2],
    option_d=options[3],
)

generations = generator(
    sample_prompt_formatted, max_new_tokens=30, num_return_sequences=5
)

for i, generation in enumerate(generations):
    print(
        f"Attempt {i+1}:", generation["generated_text"][len(sample_prompt_formatted) :]
    )

Device set to use mps:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Attempt 1: 
Sqrt(1/sqrt) is the number of elements with the same name in a field. Q is a number in the range
Attempt 2: 
E. A for Q Q(x) A E B (X, Y) E f 1 Q (x)

E. b
Attempt 3: 
The value of a Q (a) (n) and a B (i) and of a (b) (n) is the value
Attempt 4: 
Let M define an element with four elements containing zero, or M choose the element for which "M" is true. (

If M
Attempt 5: 
a) Q is less than the mean from the given model, for instance the function Q(x,y) or Q{x,y


As you can see, if we tried to run it automatically in the background, it would be rather a mess!

We start with a simple implementation where we will also utilise [caching](https://huggingface.co/docs/transformers/en/kv_cache) to speed up the process.

In [None]:
from typing import List

from transformers import PreTrainedModel, PreTrainedTokenizer


def compute_unnormalised_log_prob_sequentially(
    model: PreTrainedModel,
    tokenizer: PreTrainedTokenizer,
    prompt: str,
    completions: List[str],
    correct: str,
):
    """
    Sequentially computes log probabilities of completions using KV caching.
    """
    prompt_inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Generate KV cache for question - shared part of each completion
    with torch.no_grad():
        outputs = model(
            **prompt_inputs,
            use_cache=True,
        )
        prompt_kv_cache = outputs.past_key_values

    log_probs_list = []

    # Process all completions sequentially
    for completion in completions:
        # Tokenize only the completion
        completion_inputs = tokenizer(completion, return_tensors="pt").to(model.device)

        # Run the model with the cached KV from the prompt
        with torch.no_grad():
            outputs = model(
                input_ids=completion_inputs.input_ids,
                past_key_values=prompt_kv_cache,
                use_cache=True,
            )

        logits = outputs.logits

        # Compute log probabilities
        log_probs = torch.nn.functional.log_softmax(logits, dim=-1)

        # Get log probs of the actual next tokens
        token_log_probs = torch.gather(
            log_probs,
            2,
            completion_inputs.input_ids[None, ...],
        ).squeeze(-1)

        # Sum the log probs to get the sequence log prob
        seq_log_prob = token_log_probs.sum()
        log_probs_list.append(seq_log_prob.item())

    log_probs_list = np.array(log_probs_list)
    is_correct = np.argmax(log_probs_list) == ord("D") - ord(correct)
    return log_probs_list, is_correct


scores_sequential, is_correct = compute_unnormalised_log_prob_sequentially(
    model, tokenizer, sample_prompt_formatted, options, answer
)

print("Scores:", scores_sequential)
print("Is correct:", is_correct)

Scores: [ -9.76306725  -9.76624203  -8.91491032 -10.23246193]
Is correct: True


##### TASK decoder vectorized:

Now your task is to implement vectorized version of this code. We don't want to make forward passes through the model with batch size = 1 in a for loop, that is very inefficient. We want to make forward passes with batch size = number of options (4 in that case).

The perplexity calculation after the forward pass doesn't need to be vectorized.

    1. Create KV cache with past key values for the shared prompt part - question (2 pkt)
    2. Repeat KV cache to make the shapes right for batched options (2 pkt)
    3. Calculate perplexity for each option. Make sure not to include padding tokens! (2 pkt)

In [None]:
def compute_unnormalised_log_prob_vectorized(
    model: PreTrainedModel,
    tokenizer: PreTrainedTokenizer,
    prompt: str,
    completions: List[str],
    correct: str,
):
    """
    Computes log probabilities of completions using KV caching with vectorized computation.
    """
    # Tokenize the prompt once
    prompt_inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Generate KV cache for question - shared part of each completion
    with torch.no_grad():
        outputs = model(
            **prompt_inputs,
            use_cache=True,
        )
        prompt_kv_cache = outputs.past_key_values

    # Tokenize all completions together
    # Tokenizer doesn't have padding token, so we use EOS token as padding. It doesn't matter since attention mask will exclude it.
    tokenizer.pad_token = tokenizer.eos_token
    completion_inputs = tokenizer(
        completions, return_tensors="pt", padding=True, truncation=True
    ).to(model.device)

    # Repeat KV cache for each completion
    # Keep in mind that cache is for K and V, for each element of initial batch and for each model layer
    batch_size = len(completions)
    batched_prompt_kv_cache = """
        TODO: your code here
        """

    # Run the model with the cached KV from the prompt
    with torch.no_grad():
        outputs = model(
            input_ids=completion_inputs.input_ids,
            attention_mask=completion_inputs.attention_mask,
            past_key_values=batched_prompt_kv_cache,
            use_cache=True,
        )

    logits = outputs.logits
    log_probs = torch.nn.functional.log_softmax(logits, dim=-1)

    # Calculate log probabilities for each completion
    log_probs_list = []

    for i, completion in enumerate(completions):
        """
        TODO: your code here
        """

        log_probs_list.append(seq_log_prob.item())

    log_probs_list = np.array(log_probs_list)
    is_correct = np.argmax(log_probs_list) == ord("D") - ord(correct)
    return log_probs_list, is_correct


scores_vectorized, is_correct = compute_unnormalised_log_prob_vectorized(
    model, tokenizer, sample_prompt_formatted, options, answer
)

print("Scores:", scores_vectorized)
print("Is correct:", is_correct)

In [None]:
assert np.allclose(scores_sequential, scores_vectorized)