**Heidelberg University**

**Data Science  Group**
    
Prof. Dr. Michael Gertz  

Ashish Chouhan, Satya Almasian, John Ziegler, Jayson Salazar, Nicolas Reuter
    
December 4, 2023
    
Natural Language Processing with Transformers

Winter Semster 2023/2024     
***

# **Assignment 3: “Transformers”**
**Due**: Monday, January 8, 2024, 2pm, via [Moodle](https://moodle.uni-heidelberg.de/course/view.php?id=19251)



### **Submission Guidelines**

- Solutions need to be uploaded as a **single** Jupyter notebook. You will find several pre-filled code segments in the notebook, your task is to fill in the missing cells.
- For the written solution, use LaTeX in markdown inside the same notebook. Do **not** hand in a separate file for it.
- Download the .zip file containing the dataset but do **not** upload it with your solution.
- It is sufficient if one person per group uploads the solution to Moodle, but make sure that the full names of all team members are given in the notebook.

***

## **Task 1: Diving into Attention** (3 + 4 + 4 + 1 = 12 points)

In this task, you work with self-attention equations and find out why multi-head attention is preferable to single-head attention.

Recall the equation of attention on slide 5-9 to compute self-attention on a series of input tokens. We simplify the formula by focusing on a single query vector $q \in R^d$, value vectors ($\{ v_1,v_2,...,v_i \},v_i \in R^d$), and key vectors ($\{ k_1,k_2,...,k_i \},k_i \in R^d$). We then have

$$
a_i=\frac{exp(q^Tk_i)}{\Sigma^n_{j=1}exp(q^Tk_j)}
$$

$$
 o= \Sigma^n_{i=1} a_i v_i
$$

with $a_i$ being the attention weight for query $q$ with respect to key $k_i$. Then the output $o$ is the new representation for the query token as a weighted average of value vectors with weights $a=\{ a_1,a_2,...,a_i \},a_i \in R^d$.
Answer the following questions with the help of the equations and the intuition behind attention that you learned in the class:



### Subtask 1: Copying  

1.   Explain why $a$ can be interpreted as a categorical distribution.
2.   This distribution is typically diffuse, where the mass is spread out between different values of $a_i$. Describe a scenario in which the categorical distribution puts all the weight on a single element, e.g., $a_j \gg \Sigma_{j\neq i}a_i$. What are the conditions on key and/or query for this to happen?
3. In this case of a single large $a$, what would the output $c$ look like and what it means intuitively?

In attention, it is easy to **copy** a value vector $v_i$ to the output $o$.





**Answer**


1. Because the calculation of $a$ is essentially the application of softmax to the matrix of $q^Tk_i$. Since softmax yields a propabilitiy distribution for the probability of the query regarding all the keys (value between 0 and 1 with a sum over all keys of 1), this can be seen as categorical distribution

2. The scenario where one element $a_j$ dominates the distribution ($a_j \gg \sum_{j\neq i} a_i$) occurs when the query vector $q$ strongly aligns with a specific key vector $k_j$. In other words, the dot product $q^Tk_j$ is significantly larger than the dot products of $q$ with other key vectors. This condition indicates that the query has a strong association with a particular key, leading to a focused attention on that specific element.

3. When a single $a_j$ is much larger than the others, the output $o$ becomes a weighted sum where the dominant weight is associated with the value vector $v_j$. Intuitively, this means that the output representation is heavily influenced by the information in the value vector corresponding to the key that strongly matches the query. The output essentially becomes a "copy" of the information from the value vector $v_j$, demonstrating the mechanism's capability to selectively focus on relevant information during the attention process.





#### ${\color{red}{Comments\ 1.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 2: Averaging


Instead of focusing on just one value vector $v_j$, the Transformer model can incorporate information from multiple inputs. Consider the situation where we want to incorporate information from two value vectors $v_b$ and $v_c$ with keys $k_b$ and $k_c$. In machine learning one of the ways to combine this information is through averaging of vectors $o= \frac{1}{2}(v_b+v_c)$.  It might seem hard to extract information about the original vectors $v_b$ and $v_c$ from the resulting average. But under certain conditions, one can do so. In this subtask, we look at the following cases:

1. Suppose we know the following:


* $v_b$ lies in a subspace $B$ formed by the $m$ basis vectors $\{b_1, b_2, .. , b_m\}$, while $v_c$ lies in a subspace $C$ formed by the $p$ basis vectors $\{c_1, c_2, . . . , c_p\}$ (This means that any $v_b$ and $v_c$ can be expressed as a linear combination of their basis vectors).
*   All basis vectors have the norm 1 and are orthogonal to each other.
*   The two subspaces $B$ and $C$ are orthogonal, meaning $b_j^Tc_k=0$ for all $j$ and $k$.
* Given that $\{b_1, b_2, .. , b_m\}$ are both orthogonal and form a basis for $v_b$, we know that there exists some $d_1, ..., d_m$ such that $v_b=d_1 b_1+d_2 b_2+...+d_m b_m$. Use these $d\text{s}$ to solve this task.

Using the basis vectors $\{b_1, b_2, .. , b_m\}$, construct a matrix $M$ such that for arbitrary vectors $v_b$ and $v_c$ with the given conditions, we can use $M$ to extract $v_b$ from the sum of the vector $s = v_b + v_c$. In other words, construct an $M$ such that  $ Ms = v_b$ holds.


2. If we assume that
* all key vectors are orthogonal, i.e., $k_i^Tk_j=0$ for all $i \neq j$, and
* all key vectors have the norm 1.

Find an expression for the query vector $q$ such that $o \approx \frac{1}{2}(v_b+v_c)$. Justify your answer.

**Hint:** Use your finding in subtask 1 to solve part 2.

**Hint:** If the norm of a vector $x$ is 1, then $x^Tx=1$

**Hint:** Start with writing $v_b$ and $v_c$ as the linear combination of the bases.


**Answer**

1. 
We know that $v_b = d_1 b_1 + d_2 b_2 + ... + d_m b_m$, and $v_c$ can be expressed similarly with other coefficents, we call them $d'_1, .., d'_p$. Let's define the matrix $M$ as follows:

$$
M = \begin{bmatrix}
    b_1^2 & 0 & \cdots & 0 \\
    0 & b_2^2 &\cdots & 0 \\
    \vdots & \vdots & \ddots & \vdots \\
    0 & 0 & \cdots & b_m^2 \\
\end{bmatrix}
$$

It is a diagonal matrix containing b_x / b_x. This allows the following:
Since s = $v_b + v_c$ both must be of equal length. Also we know that $b_j^T c_k = 0$ 
Hence, for each part of the sum of s it holds:
$ b_j^2 (b_j d_j + c_j d'_j) = b_j b_j b_j d_j + c_j b_j b_j d'_j = 1 b_j d_j + 0 b_j d'_j = v_b[j] $

Now, for any vectors $v_b$ and $v_c$ satisfying the given conditions, the matrix multiplication $Ms$ will yield $v_b$.



2. To achieve $o \approx \frac{1}{2}(v_b+v_c)$, we want the attention weights $a_i$ to be such that they give equal importance to $v_b$ and $v_c$. Let $a_b$ and $a_c$ be the attention weights for $v_b$ and $v_c$ respectively.

$ o \approx a_b \cdot v_b + a_c \cdot v_c $

Now, let's find $q$ such that $o \approx \frac{1}{2}(v_b + v_c)$. To achieve this, we need $a_b \approx a_c$ and $a_b + a_c = 1$.

Given the softmax attention weight calculation:

$a_i = \frac{\exp(q^Tk_i)}{\sum^n_{j=1}\exp(q^Tk_j)} $

To simplify the expression and achieve $o \approx \frac{1}{2}(v_b + v_c)$, we can aim for $q^Tk_b \approx q^Tk_c$ and $q^Tk_b + q^Tk_c = 0$ (to satisfy the orthogonality condition).

Let $q$ be:

$ q = \frac{1}{\sqrt{2}}(k_b - k_c) $

With this choice of $q$, let's analyze the terms:

$ q^Tk_b = \frac{1}{\sqrt{2}}(k_b - k_c)^Tk_b = \frac{1}{\sqrt{2}}\|k_b\|^2 - \frac{1}{\sqrt{2}}(k_b^Tk_c) $

$ q^Tk_c = \frac{1}{\sqrt{2}}(k_b - k_c)^Tk_c = -\frac{1}{\sqrt{2}}(k_b^Tk_c) + \frac{1}{\sqrt{2}}\|k_c\|^2 $

Therefore, $ q^Tk_b + q^Tk_c = 0 $, and $ q^Tk_b \approx q^Tk_c $. This implies that the attention weights $a_b$ and $a_c$ will be approximately equal.

As a result, the output \(o\) becomes:

$ o \approx a_b \cdot v_b + a_c \cdot v_c \approx \frac{1}{2}(v_b + v_c) $


#### ${\color{red}{Comments\ 1.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 3: Drawbacks of Single-head Attention

You might have wondered why we need multi-heads for attention. In this subtask, we look at some of the drawbacks of having a single head attention. As shown in the previous subtask, it is possible for single head attention to focus equally on two values. The same can apply to any subset of values, which therefor can become problematic.

Consider a set of key vectors $\{ k_1,k_2,...,k_n \}$, randomly sampled from a normal distribution with a known mean value of $\mu_i \in R^d$ and unknown covariance $Σ_i, i \in \{1, \ldots, n\}$, where


*   $\mu_i\text{s}$ are all orthogonal $\mu_i^T\mu_j=0$ if $i \neq j$.
*   $\mu_i\text{s}$ all have unit norm $||\mu_i||=1$.

1. For a vanishingly small $\alpha$ (not to be confused with attention weights), the covariance matrices are  $Σ_i=\alpha I, \forall i  \in \{1,2,..,n\}$, design a query $q$ in terms of the $\mu_i$ such that as before, $o= \frac{1}{2}(v_b+v_c)$ and describe why it works.

2.  Large perturbations in key value might cause problems for single head attention.  Specifically, in some cases, one key vector $k_b$ may be larger or smaller in norm than the others, while still pointing in the same direction as $\mu_b$. As an example of such a case,
consider a covariance matrix for item $b$ for vanishingly small $\alpha$ as $Σ_b=\alpha I + \frac{1}{2}(\mu_b^T\mu_b)$. This causes $k_a$ to point to roughly the same direction as $\mu_b$ but with large differences in magnitude, while for other items. Further, let $Σ_i=\alpha I\  \forall_i i \neq b$. When you sample multiple keys from the distribution $\{ k_1,k_2,...,k_n \}$ and use the $q$ vector from the pervious part, what do you expect vector $o$ to look like? Explain why this shows the drawback of single-head attention.

**Hint:**
Think about how it differs from pervious part and how $o$'s variance would be affected by the change in $Σ_b$.

**Hint:** Considering that $\mu_b^T\mu_b=1$, think of what are the ranges $Σ_b$ can take and how does that effect a sampled $k_b$ value.

**Hint:** $\frac{exp(b)}{exp(b)+exp(c)}=\frac{exp(b)}{exp(b)+exp(c)}\frac{exp(-b)}{exp(-b)}= \frac{1}{1+exp(c-b)}$

**Answer:**


1. Given that the covariance matrices are $Σ_i = \alpha I$, for vanishingly small $\alpha$, we can design the query vector $q$ as follows:

$ q = \frac{1}{\sqrt{2\alpha}}(\mu_b - \mu_c) $

Now, let's analyze why this choice of $q$ works:

- In the softmax attention weight calculation, $a_i = \frac{\exp(q^Tk_i)}{\sum^n_{j=1}\exp(q^Tk_j)}$.
- With $q = \frac{1}{\sqrt{2\alpha}}(\mu_b - \mu_c)$, the inner product $q^Tk_i$ simplifies to $\frac{1}{\sqrt{2\alpha}}(\mu_b^Tk_i - \mu_c^Tk_i)$.
- Since $\mu_i^T\mu_j = 0$ for $i \neq j$ (orthogonality), the terms $\mu_b^Tk_i$ and $\mu_c^Tk_i$ effectively isolate $k_b$ and $k_c$ respectively in the inner product.
- The denominators $\sum^n_{j=1}\exp(q^Tk_j)$ involve terms similar to $q^Tk_i$, and the softmax function ensures that $a_b \approx a_c$.

Therefore, with this choice of $q$, the attention weights $a_b$ and $a_c$ should be approximately equal, and the output $o$ becomes a weighted average of $v_b$ and $v_c$ with equal weights, achieving the desired result.

---

2. 
In this scenario, where $Σ_b = \alpha I + \frac{1}{2}(\mu_b^T\mu_b)$ and $Σ_i = \alpha I$ for all $i \neq b$, the covariance matrix for item $b$ introduces a perturbation term proportional to $\mu_b^T\mu_b$. This perturbation term causes $k_b$ to have large differences in magnitude compared to the other key vectors.

Considering the query vector $q = \frac{1}{\sqrt{2\alpha}}(\mu_b - \mu_c)$ from the previous part, the inner product $q^Tk_b$ simplifies to $\frac{1}{\sqrt{2\alpha}}(\mu_b^Tk_b - \mu_c^Tk_b)$. Due to the large perturbation in $k_b$, the term $\mu_b^Tk_b$ dominates, causing $q^Tk_b$ to be significantly larger than other $q^Tk_i$ terms.

As a result, in the softmax attention weight calculation, $a_b$ becomes much larger than the other attention weights $a_i$. This leads to a situation where the attention mechanism predominantly focuses on $v_b$ while largely neglecting the other values.

Therefore, in this case, the vector $o$ is expected to be heavily skewed towards $v_b$, indicating a drawback of single-head attention in handling large perturbations in key values.



#### ${\color{red}{Comments\ 1.3}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 4: Model Size  
1. Imagine you have an input sequence of  $l$ tokens, how much memory is required and what time complexity do we have for a single self-attention layer? (give your answer in terms of $l$)
2. If you have $N$ layers of self-attention, how  would the memory requirements and the time complexity change? (give your answer in terms of $l$ and $N$)
3. If you have $l=10,000$ and $10$ layers, with the ability to perform $10M$ operations per second, how long would it take to compute the attention output?


**Answer**

1. The memory requirement for a single self-attention layer involves storing the key, query, and value vectors for each token. Assuming each vector has a dimension of $d$, the memory complexity is $O(l \cdot d)$. The time complexity for self-attention is approximately $O(l^2 \cdot d)$.

2. For $N$ layers of self-attention, the memory requirements would scale linearly with the number of layers, and the time complexity would scale quadratically. The memory complexity becomes $O(N \cdot l \cdot d)$, and the time complexity becomes $O(N \cdot l^2 \cdot d)$.

3. Substitute $l=10,000$, $N=10$, and $d$ with the specific dimensionality of your vectors into the time complexity formula:
$ \text{Time} = O(N \cdot l^2 \cdot d) = 10 \cdot (10,000)^2 \cdot d \text{ operations} $

Given the ability to perform $10M$ (million) operations per second:

$ \text{Time taken} = \frac{1.000.000.000 \cdot d}{10.000.000} = 100 \cdot d \text{   seconds} $

#### ${\color{red}{Comments\ 1.4}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## **Task 2: Multiple Choice Question Answering** (4 + 3 + 5 + 2 = 14 points)

In this task, you will fine-tune a transformer model on a multiple-choice task, which is the task of selecting the most plausible inputs in a given selection. The dataset used here is [SWAG](https://www.aclweb.org/anthology/D18-1009/), which is available via the Hugging Face [hub](https://huggingface.co/datasets/swag). Check the link for an overview of the dataset. SWAG is a dataset about commonsense reasoning, where each example describes a situation and then proposes four options that could apply for it.
Let's start by installing the necessary packages.

In [3]:
# %pip install transformers
# %pip install datasets
# %pip install evaluate
# %pip install accelerate -U
# %pip install sentencepiece

In this task, you will use a BERT model with a `MultipleChoice` head from the Hugging Face library and then create your custom model.   Recall from the class that the BERT model has an auxiliary next sentence prediction task, in which two sentences are given to BERT separated by a `[SEP]` token and a classifier head decides if the second sentence logically follows the first one. Hugging Face has
 a `*ForMultipleChoice` architecture that uses the representation of the `[CLS]` token and a linear layer to classify if one sentence follows the other. We first start with this default architecture and then build a more complicated one in a later subtask.

### Subtask 1: Loading and Processing the Data

We use the `dataset` library to download the SWAG dataset, which already contains train, validation, and test splits.

In [3]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
from datasets import load_dataset, load_metric
datasets = load_dataset("swag", "regular")
datasets

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset swag (/Users/simon/.cache/huggingface/datasets/swag/regular/0.0.0/9640de08cdba6a1469ed3834fcab4b8ad8e38caf5d1ba5e7436d8b1fd067ad4c)
100%|██████████| 3/3 [00:00<00:00, 281.09it/s]


DatasetDict({
    train: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 73546
    })
    validation: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 20006
    })
    test: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 20005
    })
})

Lets look at the first item to see how the data looks like:

In [4]:
datasets["train"][0]

{'video-id': 'anetv_jkn6uvmqwh4',
 'fold-ind': '3416',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'gold-source': 'gold',
 'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'label': 0}

**Question:**
Look at the dataset card on the Hugging Face hub and define what each of these fields means, with respect to the task:

*   `sent1`:
*   `sent2`:
*    `ending0`, `ending1`, `ending2` and `ending3`:
*   `label`:




**Answer**

`
- sent1: the first sentence
- sent2: the start of the second sentence (to be filled)
- ending0: first proposition
- ending1: second proposition
- ending2: third proposition
- ending3: fourth proposition
- label: the correct proposition
`

Write a function that displays the context and each of the four choices, following the format


```
Context:...
A-
B-
C-
D-
Ground truth: option ...
```

How you display the results is not important. You should be able to extract different parts of the data correctly and know what each field represents.

In [5]:
def explain_example(example):
    # Extract relevant fields from the example
    context = example['startphrase']
    choices = [example[f'ending{i}'] for i in range(4)]
    # Convert label to corresponding option letter
    ground_truth = chr(ord('A') + example['label'])  

    # Display the information
    print(f"Context: {context}\n")
    
    # Display choices
    for i, choice in enumerate(choices):
        print(f"{chr(ord('A') + i)}: {choice}")

    # Display ground truth
    print(f"\nGround truth: Option {ground_truth}")

In [6]:
explain_example(datasets["train"][0])

Context: Members of the procession walk down the street holding small horn brass instruments. A drum line

A: passes by walking down the street playing their instruments.
B: has heard approaching them.
C: arrives and they're outside dancing and asleep.
D: turns the lead singer watches the performance.

Ground truth: Option A


Before feeding the data into the model, we need to preprocess the text using `Tokenizer` to tokenize the inputs into tokens and put it in a format that the model expects. The tokenizer specific to the model we want to use for this task is `distilbert-base-uncased`. Complete the code below to load a fast tokenizer for this model. DistilBERT is similar to the BERT model, and we only use this particular architecture for faster training.


In [7]:
from transformers import AutoTokenizer

# Download vocabulary from huggingface.co and cache.
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [43]:
tokenizer("This is the first sentence!", "And this is the second one.")

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 999, 102, 1998, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Write a function that preprocesses the samples.
The tricky part is to put all the possible pairs of sentences in two big lists before passing them to the tokenizer.
Each **first** sentence has to be repeated 4 times to go with different ending options.
There should be a separator token between the first and second sentence, to follow the BERT input logic.
The final output is a list of 4 elements, one for each choice, where the input is transformed by the tokenizer.
For example, with a list of 2 training examples, the output includes 2 lists, where each contains 4 elements. Each of those elements is the converted input ID of the first sentence followed by the second sentence with different endings.
When calling the `tokenizer`, we use the argument `truncation=True`. This will ensure that an input longer than what the model selected can handle will be truncated to the maximum length accepted by the model.

**Hint:** Flatten the lists (all choices are flattened into a single list) before feeding them into the tokenizer and unflatten them once again for the final output.

In [9]:
datasets["train"][:2]

{'video-id': ['anetv_jkn6uvmqwh4', 'anetv_jkn6uvmqwh4'],
 'fold-ind': ['3416', '3417'],
 'startphrase': ['Members of the procession walk down the street holding small horn brass instruments. A drum line',
  'A drum line passes by walking down the street playing their instruments. Members of the procession'],
 'sent1': ['Members of the procession walk down the street holding small horn brass instruments.',
  'A drum line passes by walking down the street playing their instruments.'],
 'sent2': ['A drum line', 'Members of the procession'],
 'gold-source': ['gold', 'gen'],
 'ending0': ['passes by walking down the street playing their instruments.',
  'are playing ping pong and celebrating one left each in quick.'],
 'ending1': ['has heard approaching them.', 'wait slowly towards the cadets.'],
 'ending2': ["arrives and they're outside dancing and asleep.",
  'continues to play as well along the crowd along with the band being interviewed.'],
 'ending3': ['turns the lead singer watches the

In [165]:
############# 
# Unoptimized
#############

# import torch

# ending_names = ['ending0', 'ending1', 'ending2', 'ending3']

# def preprocess_function(examples):
  
#     # repeat each first sentence four times
#     # Results is a list with n=number of options=4 items
#     # Each list inside contains  n=number of examples sentences
#     first_sentences = [examples['sent1'] for _ in range(4)]

#     # second sentences possible are combination of header and ending
#     question_headers = []
#     for first_sentence in first_sentences: 
#         first_option_temp = []
#         for first_option in first_sentence: 
#             first_option_temp.append(f"{first_option} [SEP]")
#         question_headers.append(first_option_temp)

#     assert len(first_sentences) == len(question_headers), "Dimensions don't match"
#     assert len(first_sentences[0]) == len(question_headers[0]), "Dimensions don't match"

#     # Get second sentence and merge each ending with it
#     # Results is a list with n=number of options=4 items
#     # Each list inside contains  n=number of examples sentences
#     second_sentences = []
#     for ending_name in ending_names: 
#         second_sentences_temp = []
#         for ending, start in zip(examples[ending_name], examples["sent2"]):
#             second_sentences_temp.append(start + " " + ending)
#         second_sentences.append(second_sentences_temp)

#     # flatten everything
#     flat_first_sentences = [item for sublist in first_sentences for item in sublist]
#     flat_second_sentences = [item for sublist in second_sentences for item in sublist]

#     assert len(flat_second_sentences) == len(examples['sent1']) * 4, "Dimensions don't match"
#     assert len(first_sentences) == len(second_sentences), "Dimensions don't match"

#     # tokenize
#     inputs = []
#     for flat_first, flat_second in zip(flat_first_sentences, flat_second_sentences):
#         inputs_temp = tokenizer(flat_first, flat_second, return_tensors='pt', truncation=True)
#         inputs.append(inputs_temp)

#     # un-flatten
#     input_ids = []
#     attention_masks = []

#     num_lists = len(examples["sent1"])
#     for i in range(num_lists):
#         input_id = [input["input_ids"][0] for input in inputs][i::num_lists]
#         attention_mask = [input["attention_mask"][0] for input in inputs][i::num_lists]
        
#         input_ids.append(input_id)
#         attention_masks.append(attention_mask)

#     inputs = {
#         'input_ids': input_ids,
#         'attention_mask': attention_masks,
#         'labels': torch.tensor([label for label in examples["label"]], dtype=torch.long)
#     }

#     return inputs


In [176]:
import torch

ending_names = ['ending0', 'ending1', 'ending2', 'ending3']

def preprocess_function(examples):
    # Repeat each first sentence four times
    first_sentences = [examples['sent1'] for _ in range(4)]

    # Generate question headers
    question_headers = [
        [f"{first_option} [SEP]" for first_option in first_sentence]
        for first_sentence in first_sentences
    ]

    # Generate second sentences by combining each ending with it
    second_sentences = [
        [start + " " + ending for ending, start in zip(examples[ending_name], examples["sent2"])]
        for ending_name in ending_names
    ]

    # Flatten the lists
    flat_first_sentences = [item for sublist in first_sentences for item in sublist]
    flat_second_sentences = [item for sublist in second_sentences for item in sublist]

    # Tokenize
    inputs = [
        tokenizer(flat_first, flat_second, return_tensors='pt', truncation=True)
        for flat_first, flat_second in zip(flat_first_sentences, flat_second_sentences)
    ]

    # Extract input_ids and attention_masks
    input_ids = [input["input_ids"][0] for input in inputs]
    attention_masks = [input["attention_mask"][0] for input in inputs]

    # Un-flatten the lists
    num_lists = len(examples["sent1"])
    input_ids = [input_ids[i::num_lists] for i in range(num_lists)]
    attention_masks = [attention_masks[i::num_lists] for i in range(num_lists)]

    # Return the processed inputs
    return {
        'input_ids': input_ids,
        'attention_mask': attention_masks,
        'label': torch.tensor([label for label in examples["label"]], dtype=torch.long)
    }


In [177]:
examples = datasets["train"][:2]
features = preprocess_function(examples)
print(len(features["input_ids"]), len(features["input_ids"][0]), [len(x) for x in features["input_ids"][0]])# output should be 2 4 [30, 25, 30, 28]

2 4 [30, 25, 30, 28]


We can now apply our function to all the examples in the dataset. We use the `map` method to apply the function on all the elements of all the splits in the dataset (training, validation, and testing).
Note that we passed `batched=True` to leverage the fast tokenizer and use multi-threading to process the texts in batches concurrently.

In [178]:
encoded_datasets = datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/73546 [00:00<?, ? examples/s]                                                                  

Our dataset is still not converted to tensors and not padded. This is the job of the `data collator`. A data collator takes a list of examples and converts them to a batch.
There is no data collator in the Hugging Face default library that works on our specific problem. We thus need to write our own one. In this collator:

*  All the inputs/attention masks are flattened.
* A flattened list is passed to the `tokenizer.pad ` method to apply dynamic padding to pad inputs to the maximum length in the batch. Output will be the size of `(batch_size * 4) x seq_length`.
* Everything needs to be unflattened for the output of the data collator.
* `input_ids` and `labels` should be returned as tensors.
* The output is a dictionary called `batch` that contains features needed for training (`input_ids`, `attention_mask`, `label`).



In [279]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class MultipleChoiceDataCollator:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        accepted_keys = ["input_ids", "attention_mask", "label"]
        if len(features[0]) > len(accepted_keys):
            features = [{k: v for k, v in i.items() if k in accepted_keys} for i in features]

        labels = [feature["label"] for feature in features]

        # flatten
        flattened_features = {k: [item for sublist in [f[k] for f in features] for item in sublist] for k in ["input_ids"]}
        print(flattened_features)

        # Use the tokenizer and attributes from the class to pad the input
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        # Un-flatten
        batch = {k: batch[k].view(len(features), *batch[k].shape[1:]) for k in batch}

        return batch

In [280]:
accepted_keys = ["input_ids", "attention_mask", "label"]
features = [{k: v for k, v in encoded_datasets["train"][i].items() if k in accepted_keys} for i in range(2)]

batch=MultipleChoiceDataCollator(tokenizer)(features)
print(batch["input_ids"].shape)
print(batch["attention_mask"].shape)
print(batch["labels"].shape)

{'input_ids': [[101, 2372, 1997, 1996, 14385, 3328, 2091, 1996, 2395, 3173, 2235, 7109, 8782, 5693, 1012, 102, 1037, 6943, 2240, 5235, 2011, 3788, 2091, 1996, 2395, 2652, 2037, 5693, 1012, 102], [101, 2372, 1997, 1996, 14385, 3328, 2091, 1996, 2395, 3173, 2235, 7109, 8782, 5693, 1012, 102, 1037, 6943, 2240, 2038, 2657, 8455, 2068, 1012, 102], [101, 2372, 1997, 1996, 14385, 3328, 2091, 1996, 2395, 3173, 2235, 7109, 8782, 5693, 1012, 102, 1037, 6943, 2240, 8480, 1998, 2027, 1005, 2128, 2648, 5613, 1998, 6680, 1012, 102], [101, 2372, 1997, 1996, 14385, 3328, 2091, 1996, 2395, 3173, 2235, 7109, 8782, 5693, 1012, 102, 1037, 6943, 2240, 4332, 1996, 2599, 3220, 12197, 1996, 2836, 1012, 102], [101, 1037, 6943, 2240, 5235, 2011, 3788, 2091, 1996, 2395, 2652, 2037, 5693, 1012, 102, 2372, 1997, 1996, 14385, 2024, 2652, 17852, 13433, 3070, 1998, 12964, 2028, 2187, 2169, 1999, 4248, 1012, 102], [101, 1037, 6943, 2240, 5235, 2011, 3788, 2091, 1996, 2395, 2652, 2037, 5693, 1012, 102, 2372, 1997, 1996

RuntimeError: shape '[2, 35]' is invalid for input of size 280

In [None]:
for i in range(4):
  print(batch["input_ids"][0][i])
  print(tokenizer.decode(batch["input_ids"][0][i]))

#### ${\color{red}{Comments\ 2.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 2: Fine-tuning a Hugging Face Model

To fine-tune our model, we first need to download the correct architecture from Hugging Face. Import the correct class for this task and download the pre-trained checkpoint for the base class from `distilbert-base-uncased`. Note that the weights in the classification head are initialized at random.

In [None]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model_hf = AutoModel.from_pretrained("distilbert-base-uncased", torch_dtype=torch.float16)


Next, we need to define our `Trainer` and pass in the correct `TrainingArguments` (a class that contains all the attributes to customize the training). Define a `TrainingArguments` that


* creates an output directory `distilbert-base-uncased-swag` to save the checkpoints and logs.
*   evaluates the model on the validation set after the `300` steps.
* a checkpoint should be saved after each `600` step and no more than 2 checkpoints should be saved in total.
* the random seed for training is `77`.
* batch size for training and evaluation: `48` (if you are running out of memory, feel free to change this setting but indicate it as a comment in your notebook, on a T4 GPU from google colab this takes about `13.2GB` of `15.0GB`).
* train for `1800` steps with a learning rate of `5e-5`, and add weight decay of `0.01` to the optimizer.
* the trainer should remove the columns from the data that are not used by the model.
* The final checkpoint should be the checkpoint that had the best overall validation metric not necessarily the last checkpoint.

**Note:** Please use GPU for to train your model. If on colab, you can use T4 GPU for free.

In [None]:
from transformers import TrainingArguments, Trainer

output_directory = "distilbert-base-uncased-swag"

training_args = TrainingArguments(
    output_dir=output_directory,
    evaluation_strategy="steps",
    eval_steps=300,
    save_total_limit=2,
    save_steps=600,
    seed=77,
    per_device_train_batch_size=48,
    per_device_eval_batch_size=48,
    max_steps=1800,
    learning_rate=5e-5,
    weight_decay=0.01,
    remove_unused_columns=True,
    logging_dir=output_directory,
)


Before we initialize the `Trainer`, we create a function that tells the trainer how to compute the metrics from the predictions. Fill the `compute_metrics` function to compute the accuracy based on the `predictions`. This object contains the prediction of the model, as well as the ground truth labels.

**Hint 1:** Keep in mind that the output of this function should be a dictionary containing the metric name and value.

**Hint 2:** Consider the shape of the example input. This is similar to the logits produced by the model.

In [None]:
import numpy as np

def compute_metrics(predictions):
    logits, labels = predictions[0], predictions[1]

    preds = np.argmax(logits, axis=-1)
    accuracy = (preds == labels).mean()

    return {"accuracy": accuracy}

In [None]:
preds=np.array([[0.9,0.2,0,0],
                [0.2,0.2,0.9,0.1],
                [0.2,0.9,0,0],
                [0.2,0.1,0.8,0],
                [0.9,0.1,0.8,0],
                [0.2,1,0.4,0],
                [0.2,1,0.4,0.9],
                [1,0.1,0.4,0.3],
                [0.1,0.1,0.9,0.3],
                [0.1,0.1,0.2,1]])
label_ids=np.array([0,3,1,2,0,1,3,0,2,3])
compute_metrics((preds,label_ids))

{'accuracy': 0.8}

Now it's time to pass everything to a `Trainer` object to start the training process. Initialize a `Trainer` object and pass all the necessary information, keep in mind that we also have the optional metric computation and that we tend to run an evaluation on the validation set during training. The training should take around 30 min on Google Colab T4 GPU.

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cpu')

In [None]:
trainer = Trainer(
    model=model_hf.to(device),
    args=training_args,
    train_dataset=datasets["train"],  # must be replaced with processed data from 2.1
    eval_dataset=datasets["validation"],    # must be replace with processed data from 2.1
)

In [None]:
trainer.train()# should take around 30 min on Google Colab T4 GPU

  0%|          | 0/1800 [00:00<?, ?it/s]

TypeError: forward() got an unexpected keyword argument 'labels'

Save the model in `distilbert-base-uncased-swag/final_model`.

In [None]:
trainer.save_model(output_directory)

Look at the saved files and answer the following questions (it is possible to answer these questions by writing some code, but we want you to explore the saved files):

**Question:**


1.   What is the vocabulary id for the `[CLS]` and `[MASK]` tokens?
2.   What is the dropout probability for the attention layer?

**Dropout:** With dropout, certain nodes are set to the value zero in a training run, i.e. removed from the network. Thus, they have no influence on the prediction and also in the backpropagation. Thus, a new, slightly modified network architecture is built in each run and the network learns to produce good predictions without certain inputs. Read more [here](https://databasecamp.de/en/ml/dropout-layer-en).



**Answer**

`
Enter your answer here
`

#### ${\color{red}{Comments\ 2.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 3: Fine-tune a Custom Model


In this case, we were lucky that Hugging Face had a pre-implemented architecture available for us to use. However, that is not always the case. Moreover, we might want to experiment beyond the default architectures to find a suitable one for a task. Therefore, it is important to learn to extend the Hugging Face models and train a custom model. The good news is that except for the model architecture the rest of the code can remain as it is.

Design a model for multiple choice model as follows:


1.   the config file for a feature extractor (must be a distilbert type) is  passed during initialization. The config file determines which model is used for feature extraction.
2.   From the `last_hidden_state` of the feature extractor, choose the `[CLS]` embedding (first one). This embedding is used as the compressed representation of first and second sentences. During pre-training it is used  for classifying whether these two sentences follow one another, making it a good candidate for our task.
3. `[CLS]` embedding is passed through a linear layer **that does not change the size of the embedding** and is passed through a tanh nonlinearity.
4. The output of tanh is passed through a dropout layer, where the dropout probability is the same as the dropout probability used for the `distilbert` model used as feature extractor.
5. The output of the previous stage is fed into another linear layer that shrinks the size of the embedding dimension to a quarter of the original size, e.g., if the embedding size is 12, the new embedding dimension is 3.
6. The output is followed by another dropout layer (you can use the one from stage 4).
7. Finally, a binary classifier is applied to determine the probability of sentence 1 being followed by sentence 2.
8. the cross-entropy loss is used to compute the loss.

**Hint:** Keep in mind that for a 4 choice system, you classify each of the four solutions independently. However, the final output should group the four logits together. For example, if input ids have the shape `[2, 4, 35]` (batch size=2, num choices=4, seq len=35), then the logits have the `[2, 4]` and labels have the dimension `[2, 1]`.



In [None]:
from transformers import DistilBertModel,BertConfig,DistilBertConfig,PretrainedConfig,PreTrainedModel,DistilBertPreTrainedModel
from torch import nn

class CustomMultipleChoice(DistilBertPreTrainedModel):
    def __init__(self, config: PretrainedConfig):
        super().__init__(config)
        ###your code ###
        self.distilbert =
        self.dense =
        self.activation =
        self.dropout =
        self.dense2 =
        self.classifier =
        ###your code ###


    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
    ):
        """
        input_ids: input sentences converted to ids
        attention_mask: the attention mask
        labels:  Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors.
        """

        num_choices = input_ids.shape[1]

        ###your code ###
        input_ids =
        attention_mask =



        loss = None
        if labels is not None:

        ###your code ###
        return {"loss":loss,"logits":reshaped_logits}


Initialize the feature extractor with `distilbert-base-uncased` and create your custome model.

In [None]:
from transformers import AutoConfig
###your code ###
config=
model_custom =
###your code ###

In [None]:
for name, param in model_custom.named_parameters():
    if param.requires_grad and not name.startswith("distilbert."):
        print(name, param.data.shape)

We keep the same training arguments but change the directory in which we save the model logs, the directory in which we save the model output and the name of the run, to `custom_model`.



In [None]:
###your code ###


###your code ###

Initialize the trainer for training the custom model.The training should take around 30 min on Google Colab T4 GPU.


In [None]:
trainer =
###your code ###

###your code ###


Save the model in `distilbert-base-uncased-swag/final_model`.

In [None]:
### your code ###

### your code ###

Look at the saved files and answer the following questions (it is possible to answer these questions by writing some code, but we want you to explore the saved files):

**Question:**


1.   What is the vocabulary id for the `[CLS]` and `[MASK]` tokens?
2.   What is the dropout probability for the attention layer?

**Dropout:** With dropout, certain nodes are set to the value zero in a training run, i.e. removed from the network. Thus, they have no influence on the prediction and also in the backpropagation. Thus, a new, slightly modified network architecture is built in each run and the network learns to produce good predictions without certain inputs. Read more [here](https://databasecamp.de/en/ml/dropout-layer-en).



**Answer**

`
Enter your answer here
`

#### ${\color{red}{Comments\ 2.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 3: Fine-tune a Custom Model


In this case, we were lucky that Hugging Face had a pre-implemented architecture available for us to use. However, that is not always the case. Moreover, we might want to experiment beyond the default architectures to find a suitable one for a task. Therefore, it is important to learn to extend the Hugging Face models and train a custom model. The good news is that except for the model architecture the rest of the code can remain as it is.

Design a model for multiple choice model as follows:


1.   the config file for a feature extractor (must be a distilbert type) is  passed during initialization. The config file determines which model is used for feature extraction.
2.   From the `last_hidden_state` of the feature extractor, choose the `[CLS]` embedding (first one). This embedding is used as the compressed representation of first and second sentences. During pre-training it is used  for classifying whether these two sentences follow one another, making it a good candidate for our task.
3. `[CLS]` embedding is passed through a linear layer **that does not change the size of the embedding** and is passed through a tanh nonlinearity.
4. The output of tanh is passed through a dropout layer, where the dropout probability is the same as the dropout probability used for the `distilbert` model used as feature extractor.
5. The output of the previous stage is fed into another linear layer that shrinks the size of the embedding dimension to a quarter of the original size, e.g., if the embedding size is 12, the new embedding dimension is 3.
6. The output is followed by another dropout layer (you can use the one from stage 4).
7. Finally, a binary classifier is applied to determine the probability of sentence 1 being followed by sentence 2.
8. the cross-entropy loss is used to compute the loss.

**Hint:** Keep in mind that for a 4 choice system, you classify each of the four solutions independently. However, the final output should group the four logits together. For example, if input ids have the shape `[2, 4, 35]` (batch size=2, num choices=4, seq len=35), then the logits have the `[2, 4]` and labels have the dimension `[2, 1]`.



In [None]:
from transformers import DistilBertModel,BertConfig,DistilBertConfig,PretrainedConfig,PreTrainedModel,DistilBertPreTrainedModel
from torch import nn

class CustomMultipleChoice(DistilBertPreTrainedModel):
    def __init__(self, config: PretrainedConfig):
        super().__init__(config)
        ###your code ###
        self.distilbert =
        self.dense =
        self.activation =
        self.dropout =
        self.dense2 =
        self.classifier =
        ###your code ###


    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
    ):
        """
        input_ids: input sentences converted to ids
        attention_mask: the attention mask
        labels:  Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors.
        """

        num_choices = input_ids.shape[1]

        ###your code ###
        input_ids =
        attention_mask =



        loss = None
        if labels is not None:

        ###your code ###
        return {"loss":loss,"logits":reshaped_logits}


Initialize the feature extractor with `distilbert-base-uncased` and create your custome model.

In [None]:
from transformers import AutoConfig
###your code ###
config=
model_custom =
###your code ###

In [None]:
for name, param in model_custom.named_parameters():
    if param.requires_grad and not name.startswith("distilbert."):
        print(name, param.data.shape)

We keep the same training arguments but change the directory in which we save the model logs, the directory in which we save the model output and the name of the run, to `custom_model`.



In [None]:
###your code ###


###your code ###

Initialize the trainer for training the custom model.The training should take around 30 min on Google Colab T4 GPU.


In [None]:
trainer =
###your code ###

###your code ###


In [None]:
trainer.train()# should take around 30 min on Colab T4 GPU

Save the model in `custom_model/final_model`. Note that with the custom model, you need to save it without the help of the trainer. The trainer would save the configuration but since this model is not a registered Hugging Face model only the base model would be saved. Loading the model weights is also effected by this.

In [None]:
###your code ###

###your code ###

#### ${\color{red}{Comments\ 2.3}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 4: Evaluation and Model Comparison

Many times you do not perform the final evaluation right after training, but load the checkpoints and evaluate them on the fly. To this end, load the two models from  disk.

In [None]:
from transformers import AutoModelForMultipleChoice,AutoConfig
### your code ###
model_hf =
model_custom =
### your code ###

To evaluate the data we load the validation split using a data loader and our previously defined data collator. Note that although we had a test split we cannot use it, since there are no labels available for this split (you can check the data to confirm this).

In [None]:
from torch.utils.data import DataLoader
import evaluate

eval_dataloader = DataLoader(encoded_datasets["validation"], batch_size=64, collate_fn=MultipleChoiceDataCollator(tokenizer))

To make things easier, let's use the `evaluate` library from Hugging Face to compute the accuracy metric. Here we load `accuracy` from the `evaluate` library two times, one for the custom model and one for the Hugging Face model. Further, we put the models on eval mode. Complete the code for evaluation using the capabilities of the `evaluate` library to simultaneously compute the metric for both models.


In [None]:
from tqdm import tqdm
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
metric_dict={"custom":evaluate.load("accuracy"),"hf":evaluate.load("accuracy")} #use to compute accuracy
models_dict= {"custom":model_custom,"hf":model_hf}# use to access models

for name, model in models_dict.items():
  model.to(device)
  model.eval()

for i,batch in tqdm(enumerate(eval_dataloader), total=len(eval_dataloader)):
  ### your code ###
  #evaluate on both model on each batch

acc_hf=
acc_custom
  ### your code ###
print("Hugging Face Model :",acc_hf)
print("Custom Model :",acc_custom)

#### ${\color{red}{Comments\ 2.4}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## **Task 3: Encoder-Decoder Architecture** (5 + 2 + 2 + 5 = 14 points)

We explored an encoder-based model (BERT) in the previous exercise. In this task, we look at another family of transformer architectures, the encoder-decoder. We use the [T5](https://arxiv.org/pdf/1910.10683.pdf) model, presented by Raffel et al.  T5 is an encoder-decoder architecture pre-trained on a multi-task mixture of unsupervised and supervised tasks. In this task, we set up a fine-tuning example for question answering using the [SQUAD](https://huggingface.co/datasets/squad) dataset. Since the actual fine-tuning is time-consuming and computational intensive for inference, we use an already pre-trained model. The main goal is to introduce you to the structure of the fine-tuning and its simplicity with the Hugging Face framework.

To fine-tune the BERT-based models, we usually add a task-specific head. On the other hand, T5 converts all NLP problems into a text-to-text format.  
It is trained using teacher forcing, meaning that we require an input sequence and a corresponding target sequence.


1.   The input sequence is fed to the model using `input_ids` from the tokenizer.
2.   The target sequence is shifted to the right, i.e., prepended by a start-sequence token and fed to the decoder using the `decoder_input_ids` (input_ids of the encoded target sequence). The target sequence is appended by EOS (end of the sentence) to denote the end of a generation and corresponds to the `labels`.
3. The task prefix defines what task is expected of T5. For example, we prepend the input sequence with `translate English to German: ` before encoding the input to tell the model to translate. T5 already has a set of pre-defined task prefixes, and it is best to stick to those since they were used during pre-training. With enough training data, you can also introduce your own custom task.


In contrast to the encoder model, where only a single `max_length` is required, for encoder-decoder architectures, one typically defines a `max_source_length` and `max_target_length`, which determine the maximum length of the input and output sequences, respectively. We must also ensure that the padding ID of the `labels` is not taken into account by the loss function. This can be done by replacing them with `-100`, which is the `ignore_index` of the `CrossEntropyLoss`.

### Subtask 1: Data Processing

We first start by loading the dataset from Hugging Face hub:

In [None]:
from datasets import load_dataset

datasets_squad = load_dataset("squad")
datasets_squad

In [None]:
print("context ---->" ,datasets_squad["train"][0]["context"])
print("question ---->",datasets_squad["train"][0]["question"])
print("answers ---->",datasets_squad["train"][0]["answers"])

Now let's load the needed pre-trained tokenizer for `t5-small`, which is the smallest T5 model. Set the maximum sequence length to `512`.

In [None]:
import torch
### your code ###
from transformers import ...
t5_tokenizer =
### your code ###

The next step is to pre-process the dataset using the tokenizer to convert the sequences to IDs and add the special tokens.
T5 is based on the SentencePiece tokenizer, and the end of sentence token is denoted by `</s>`.
Complete the function `add_eos_to_examples` to format the input and target sequence. Your input as `input_text` should have the format `question:{question_text} context:{context_text} <EOS_Token>` and your target as `target_text` should have the format `{answer_text} <EOS_Token>`.

In [None]:
def add_eos_to_examples(example):
    ### your code ###
    example['input_text'] =
    example['target_text'] =
    ### your code ###
    return example

Use the `map` function to process the data, and do not set the `batched` argument.

In [None]:
### your code ###
encoded_squad =
### your code ###

In [None]:
print(encoded_squad["train"][0]["input_text"])
print(encoded_squad["train"][0]["target_text"])

Complete the function `convert_to_features` that takes in the examples from the dataset and tokenizes them using the T5 tokenizer. However, our answers in this dataset are relatively short and do not require `512` tokens, in contrast to the input sequence which is a combination of question and context paragraphs and is usually long. To this end, we want to truncate the input sequence at `512` and the target sequence at `16`. If any input or target is smaller than the specified length, make sure you pad them. Finally, convert everything to PyTorch tensors to be easily used by the data collator and place them in the dictionary `encodings`.

In [None]:
def convert_to_features(examples):
    ### your code ###


    encodings = {
        'input_ids': ....
    }
    ### your code ###
    return encodings

Use the `map` function to process the data.

In [None]:
### your code ###
encoded_squad =
### your code ###

In [None]:
encoded_squad #new columns are added

Interestingly, although we specified PyTorch tensors as output, the type of the `input_ids` is still a list. To remedy this problem, you need to explicitly set the type of the column that contains PyTorch tensors.

In [None]:
type(encoded_squad["train"][0]["input_ids"])

In [None]:
### your code ###

### your code ###
type(encoded_squad["train"][0]["input_ids"])

In [None]:
print("Shape of the input_ids:",encoded_squad["train"][0]["input_ids"].shape)
print("Shape of the target_ids:",encoded_squad["train"][0]["target_ids"].shape)

The final step in the data processing is the creation of the data collator to
prepare `labels` from `target_ids` and return examples with keys as expected by the forward method of T5.
This is necessary because the trainer directly passes this dict as argument to the model so you need to check the input of T5 and rename the column based on that.
`input_ids`, `target_ids`, `attention_mask`, and `target_attention_mask` need to be stacked in a batch and the pad tokens in the target need to be set to `-100` to avoid loss computation.

In [None]:
from dataclasses import dataclass
from transformers import DataCollator
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
@dataclass
class T2TDataCollator:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    def __call__(self, batch):

      ### your code ###


        feature_dict=
        return feature_dict
      ### your code ###


In [None]:
accepted_keys = ['input_text', 'target_text', 'input_ids', 'attention_mask', 'target_ids', 'target_attention_mask']
features = [{k: v for k, v in encoded_squad["train"][i].items() if k in accepted_keys} for i in range(2)]
batch=T2TDataCollator(t5_tokenizer)(features)
print(batch["input_ids"].shape)
print(batch["attention_mask"].shape)
print(batch["labels"].shape)

#### ${\color{red}{Comments\ 3.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 2: Training

For training and inference, we can use `T5ForConditionalGeneration`, which includes the language modeling head on top of the decoder. Load the `t5-small` model.

In [None]:
### your code ###
from transformers import ...
t5 =
### your code ###

Next, similar to the previous task we initiate training arguments. Note that this time we are using a `Seq2SeqTrainingArguments` for a `Seq2SeqTrainer`. Set the parameters for training as follows:


*   T5 doesn't support GPU and TPU evaluation for now, so we only focus on training. You do not need to pass any parameters for evaluation setup.
*   The output directory should be named `t5-squad`.
* The T5 models need a slightly higher learning rate than the default one set in the `Trainer` when using the `AdamW` optimizer. Set the learning rate to `1e-4` and the regularization parameter to `0.01`.
* Random seed should be `77`, and we train for a maximum of `200` steps and save a checkpoint every `100` steps. A complete training of the T5 model requires far more than `200` steps, however, that is beyond the scope of this assignment.
* T5 models require a large batch size. The default model was trained with a batch size of `128`. However, we cannot fit that into a single GPU, therefore we use gradient accumulation. Set the batch size to `32` and choose the gradient accumulation step to reach the effective batch size of `128`.
* Make sure that your trainer does not remove unused columns during training, as this will cause a runtime error later on.


**Gradient accumulation:** is a technique that simulates a larger batch size by accumulating gradients from multiple small batches before performing a weight update.



In [None]:
from transformers import ...

training_args = ### your code ###


    ### your code ###


Once again make sure that you are using GPU before running the cell below.
Initilize your `Seq2SeqTrainer` with inputs necessary for training. The training should take around 15 min on Google Colab T4 GPU.


In [None]:
# Initialize our Trainer
from transformers import ...
trainer =
    ### your code ###

    ### your code ###


In [None]:
trainer.train()

#### ${\color{red}{Comments\ 3.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 3: Inference

Our trained model has seen far too few instances to make a coherent prediction. To this end, we load an already trained checkpoint from Hugging Face and perform inference. Load this [model](https://huggingface.co/mrm8488/t5-base-finetuned-squadv2) and the respective tokenizer. Note that we are loading a `base` model that is slightly larger than `t5-small`.

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
### your code ###
t5_tokenizer =
t5_model =
### your code ###

At inference time for T5, it is recommended to use the `generate()` function. This auto-regressively generates the decoder output. Complete the code for the `get_answer` function, which gives a model, a tokenizer, and a question and context pair, and generates the answer from the context given. The output should be the answer to the given question in natural text (without the special tokens).

**Hint:** Many of the steps are similar to how you prepared your input data for the model.

In [None]:
def get_answer(tokenizer,model, question, context):
  ### your code ###
  input_text =
  features =


  answer=
  ### your code ###
  return answer

Let's try it with an example.

In [None]:
context = "Sarah has joined NLP for transformers class and is working on her research project with the support of Harry."
question = "Who is supporting Sarah?"

get_answer(t5_tokenizer,t5_model,question, context)###your answer should be "Harry"

In [None]:
context = "TPUs are more power efficient in comparison to GPUs making them a better choice for machine learning projects."
question = "What is better for machine learning projects?"

get_answer(t5_tokenizer,t5_model,question, context)###your answer should be "TPUs"

#### ${\color{red}{Comments\ 3.3}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 4: T5 Paper

To answer questions of the final subtask you need to have a general overview of the [T5 paper](https://arxiv.org/pdf/1910.10683.pdf).



1.   Describe what a “text-to-text format" is and how T5 processes input and output for text classification tasks? What are the possible complications with a predefined set of classes?
2.   Describe the "masked language modeling" and "word dropout" unsupervised objective with sentinel tokens. Give an example of how this would look in a single sentence.
3. Explain "fully-visible", "causal" and "causal masking with prefix" masking.
4. Briefly describe "adapter layers" and "gradual unfreezing" as methods for fine-tuning on fewer parameters.



**Answer**

`
Enter your answer here
`

**Answer**

`
Enter your answer here
`

**Answer**

`
Enter your answer here
`

**Answer**

`
Enter your answer here
`

#### ${\color{red}{Comments\ 3.4}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$