**Heidelberg University**

**Data Science  Group**
    
Prof. Dr. Michael Gertz  

Ashish Chouhan, Satya Almasian, John Ziegler, Jayson Salazar, Nicolas Reuter
    
December 4, 2023
    
Natural Language Processing with Transformers

Winter Semster 2023/2024     
***

# **Assignment 3: “Transformers”**
**Due**: Monday, January 8, 2024, 2pm, via [Moodle](https://moodle.uni-heidelberg.de/course/view.php?id=19251)



### **Submission Guidelines**

- Solutions need to be uploaded as a **single** Jupyter notebook. You will find several pre-filled code segments in the notebook, your task is to fill in the missing cells.
- For the written solution, use LaTeX in markdown inside the same notebook. Do **not** hand in a separate file for it.
- Download the .zip file containing the dataset but do **not** upload it with your solution.
- It is sufficient if one person per group uploads the solution to Moodle, but make sure that the full names of all team members are given in the notebook.

***

## **Task 1: Diving into Attention** (3 + 4 + 4 + 1 = 12 points)

In this task, you work with self-attention equations and find out why multi-head attention is preferable to single-head attention.

Recall the equation of attention on slide 5-9 to compute self-attention on a series of input tokens. We simplify the formula by focusing on a single query vector $q \in R^d$, value vectors ($\{ v_1,v_2,...,v_i \},v_i \in R^d$), and key vectors ($\{ k_1,k_2,...,k_i \},k_i \in R^d$). We then have

$$
a_i=\frac{exp(q^Tk_i)}{\Sigma^n_{j=1}exp(q^Tk_j)}
$$

$$
 o= \Sigma^n_{i=1} a_i v_i
$$

with $a_i$ being the attention weight for query $q$ with respect to key $k_i$. Then the output $o$ is the new representation for the query token as a weighted average of value vectors with weights $a=\{ a_1,a_2,...,a_i \},a_i \in R^d$.
Answer the following questions with the help of the equations and the intuition behind attention that you learned in the class:



### Subtask 1: Copying  

1.   Explain why $a$ can be interpreted as a categorical distribution.
2.   This distribution is typically diffuse, where the mass is spread out between different values of $a_i$. Describe a scenario in which the categorical distribution puts all the weight on a single element, e.g., $a_j \gg \Sigma_{j\neq i}a_i$. What are the conditions on key and/or query for this to happen?
3. In this case of a single large $a$, what would the output $c$ look like and what it means intuitively?

In attention, it is easy to **copy** a value vector $v_i$ to the output $o$.





**Answer**



1.   We know that due to the softmax each $a_i$ is between 0 and 1,  $0<a_i<1$ and they sum up to one, $\Sigma a_i=1$, meaning that $a$ can be interpreted as categorical distribution over $n$  individually identified items, where $n$ is the dimensionality of the key, query, and value vectors.
2.  In order for the $a_i$ to be larger than others, the respective key $k_i$ has to be large in comparison to other key values. As a result, the dot product $q^Tk_i$ becomes large. This causes the softmax to put most of the probability on this value.

3. If $a_i$ has the largest weight and others are close to zero, then most of the weight is placed on $v_i$. The output would be $o \sim v_i$, where the attention output for $i\text{th}$ word will approach its value. As if the value is copied to the output.



✅ Point distribution ✅
- 1 points for each question, total of 3 points.

#### ${\color{red}{Comments\ 1.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 2: Averaging


Instead of focusing on just one value vector $v_j$, the Transformer model can incorporate information from multiple inputs. Consider the situation where we want to incorporate information from two value vectors $v_b$ and $v_c$ with keys $k_b$ and $k_c$. In machine learning one of the ways to combine this information is through averaging of vectors $o= \frac{1}{2}(v_b+v_c)$.  It might seem hard to extract information about the original vectors $v_b$ and $v_c$ from the resulting average. But under certain conditions, one can do so. In this subtask, we look at the following cases:

1. Suppose we know the following:


* $v_b$ lies in a subspace $B$ formed by the $m$ basis vectors $\{b_1, b_2, .. , b_m\}$, while $v_c$ lies in a subspace $C$ formed by the $p$ basis vectors $\{c_1, c_2, . . . , c_p\}$ (This means that any $v_b$ and $v_c$ can be expressed as a linear combination of their basis vectors).
*   All basis vectors have the norm 1 and are orthogonal to each other.
*   The two subspaces $B$ and $C$ are orthogonal, meaning $b_j^Tc_k=0$ for all $j$ and $k$.
* Given that $\{b_1, b_2, .. , b_m\}$ are both orthogonal and form a basis for $v_b$, we know that there exists some $d_1, ..., d_m$ such that $v_b=d_1 b_1+d_2 b_2+...+d_m b_m$. Use these $d\text{s}$ to solve this task.

Using the basis vectors $\{b_1, b_2, .. , b_m\}$, construct a matrix $M$ such that for arbitrary vectors $v_b$ and $v_c$ with the given conditions, we can use $M$ to extract $v_b$ from the sum of the vector $s = v_b + v_c$. In other words, construct an $M$ such that  $ Ms = v_b$ holds.


2. If we assume that
* all key vectors are orthogonal, i.e., $k_i^Tk_j=0$ for all $i \neq j$, and
* all key vectors have the norm 1.

Find an expression for the query vector $q$ such that $o \approx \frac{1}{2}(v_b+v_c)$. Justify your answer.

**Hint:** Use your finding in subtask 1 to solve part 2.

**Hint:** If the norm of a vector $x$ is 1, then $x^Tx=1$

**Hint:** Start with writing $v_b$ and $v_c$ as the linear combination of the bases.


**Answer**

1. We can rewrite $v_b$ and $v_c$ as the linear combination of the bases:
$$
v_b= d_1 b_1+d_2 b_2+...+d_m a_m= Bd
$$
$$
v_c= f_1 c_1+f_2 c_2+...+f_p c_p= Cf
$$
Now we need to construct an $M$ that when multiplied by $v_b$ produces the same vector ($v_b$) but when multiplied by $v_c$ results in zero. ($M v_b=v_b$ and $M v_c=0$)

$$
Ms=v_b\\
M(v_b+v_c)=v_b\\
Mv_b+Mv_c=v_b
$$

Since we know that subspaces are orthogonal $b_j^Tc_k=0$, then $B^TC=0$.

Because vectors are normalized and orthogonal to each other, then $b_ib_j^T=1$ for $i=j$ and $B^TB=I$.

Then, if we substitute $M$ with $B^T$:

$$
B^TBd+ B^TCf= Id+0f = d
$$
If we exclude the basis, then $v_b$ is just the collection of constants $d$. Thus $M=B^T$


2. For $o \approx \frac{1}{2}(v_b+v_c)$ to hold, the weights associated with $v_b$ and $v_c$ should be $\frac{1}{2}=a_b=a_c$.
From subtask 1 we know that this means that the keys associated with $b$ and $c$ should be equal and larger than the other components.

$$
q^Tk_b=q^Tk_c>> q^Tk_i
$$
for all $i \neq b,c$.

Let:
$$
q^T k_b=q^Tk_c= \beta,
$$
 where $\beta$ is a large value.

Considering that $exp(0)=0$ and $exp(\beta) \rightarrow \infty$, when $\beta$ is very large:
$$
\alpha_b=\alpha_c=\frac{exp(q^Tk_b)}{\Sigma^n_{j=1}exp(q^Tk_i)+exp(q^Tk_b)+exp(q^Tk_c)}=\frac{exp(\beta)}{n-2+2exp(\beta)} \approx \frac{1}{2}
$$

Therefore, $q = \beta (k_b+k_c)$ with $\beta >>0$ for such condition to hold.

$$
q^T k_b= \beta (k_b^T+k_c^T)k_b=\beta k_b^Tk_b+\beta k_c^Tk_b =\beta
$$

✅ Point distribution ✅
- 2 points for each part.


#### ${\color{red}{Comments\ 1.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 3: Drawbacks of Single-head Attention

You might have wondered why we need multi-heads for attention. In this subtask, we look at some of the drawbacks of having a single head attention. As shown in the previous subtask, it is possible for single head attention to focus equally on two values. The same can apply to any subset of values, which therefor can become problematic.

Consider a set of key vectors $\{ k_1,k_2,...,k_n \}$, randomly sampled from a normal distribution with a known mean value of $\mu_i \in R^d$ and unknown covariance $Σ_i, i \in \{1, \ldots, n\}$, where


*   $\mu_i\text{s}$ are all orthogonal $\mu_i^T\mu_j=0$ if $i \neq j$.
*   $\mu_i\text{s}$ all have unit norm $||\mu_i||=1$.

1. For a vanishingly small $\alpha$ (not to be confused with attention weights), the covariance matrices are  $Σ_i=\alpha I, \forall i  \in \{1,2,..,n\}$, design a query $q$ in terms of the $\mu_i$ such that as before, $o= \frac{1}{2}(v_b+v_c)$ and describe why it works.

2.  Large perturbations in key value might cause problems for single head attention.  Specifically, in some cases, one key vector $k_b$ may be larger or smaller in norm than the others, while still pointing in the same direction as $\mu_b$. As an example of such a case,
consider a covariance matrix for item $b$ for vanishingly small $\alpha$ as $Σ_b=\alpha I + \frac{1}{2}(\mu_b^T\mu_b)$. This causes $k_a$ to point to roughly the same direction as $\mu_b$ but with large differences in magnitude, while for other items. Further, let $Σ_i=\alpha I\  \forall_i i \neq b$. When you sample multiple keys from the distribution $\{ k_1,k_2,...,k_n \}$ and use the $q$ vector from the pervious part, what do you expect vector $o$ to look like? Explain why this shows the drawback of single-head attention.

**Hint:**
Think about how it differs from pervious part and how $o$'s variance would be affected by the change in $Σ_b$.

**Hint:** Considering that $\mu_b^T\mu_b=1$, think of what are the ranges $Σ_b$ can take and how does that effect a sampled $k_b$ value.

**Hint:** $\frac{exp(b)}{exp(b)+exp(c)}=\frac{exp(b)}{exp(b)+exp(c)}\frac{exp(-b)}{exp(-b)}= \frac{1}{1+exp(c-b)}$

**Answer:**

1. Since the $a$ values are small, the diagonals of the covariance matrices are also vanishingly small, then the sampled values of $k_i$ coverage to the mean. $k_i \sim N(\mu_i,\Sigma_i)$ will sample $k_i \sim \mu_i$.
We also have $\mu_i\text{s}$ are all orthogonal $\mu_i^T\mu_j=0$ if $i \neq j$, then similar to the pervious subtask, where all keys were orthogonal the $q = \beta (\mu_b+\mu_c)$ with $\beta >>0$.

2. We know that $\mu_i$ has unit norm and $\mu_i^T\mu_i=1$ and $\alpha$ is vanishingly small, and the covariance matrix for $b$ is:
$$
Σ_b=\alpha I + \frac{1}{2}(\mu_b^T\mu_b)
$$
So the sampled $k_b$ are:

$$
k_b \in [0.5\mu_b, 1.5\mu_b]
$$

All other $k_i$ almost don't vary due to the vanishingly small $\alpha$ and $k_i \sim \mu_i$ as in pervious part. As a result:

$$
k_b \sim \lambda \mu_b \ where \  \lambda \sim N(1,0.5)\\
k_i \sim \mu_i \ \forall_i i \neq b
$$

using pervious part ($q = \beta (\mu_b+\mu_c)$), to compute attention weights we need to compute $q^Tk_i$, where $q$ is the average of $k_b$ and $k_c$ and has the same direction. We also know that all key vectors are orthogonal, meaning that the dot product between $q$ and all those vectors that are not in the direction of $k_b$ and $k_c$ will be zero. Meaning:
$$
\forall_i i \neq b,c : \ q^Tk_i= \beta (\mu_b^T+\mu_c^T) \mu_i = \beta (\mu_b^T\mu_i+\mu_c^T\mu_i)= \beta (0+0)=0
$$
For items $b$ and $c$:
$$
q^Tk_b \approx \lambda \beta (\mu_b^T+\mu_c^T)\mu_b\approx \lambda\beta \ where \ \beta>>0\\
q^Tk_c \approx \beta (\mu_b^T+\mu_c^T)\mu_c \approx \beta\ where  \ \beta>>0
$$
now we can compute the attention weights for $a_b$ and $a_c$, knowing:
$$
\frac{exp(b)}{exp(b)+exp(c)}=\frac{exp(b)}{exp(b)+exp(c)}\frac{exp(-b)}{exp(-b)}= \frac{1}{1+exp(c-b)}
$$
we have:
$$
a_b \approx \frac{exp(\lambda \beta)}{exp(\lambda \beta)+exp( \beta)}\approx \frac{1}{1+exp(\beta(1-\lambda))}\\
a_c \approx \frac{exp(\beta )}{exp(\lambda \beta)+exp( \beta)}\approx \frac{1}{1+exp(\beta(\lambda-1))}
$$

We also know that $\lambda$ varies between $0.5$ and $1.5$ and $\beta>>0$:

$$
For \ \lambda \rightarrow 0.5 \ and \ \beta>>0: a_b \approx \frac{1}{1+exp(\beta(1-0.5))} \approx \frac{1}{1+\infty} \approx 0 \ , a_c \approx \frac{1}{1+exp(\beta(0.5-1))} \approx \frac{1}{1+0} \approx 1
$$
$$
For \ \lambda \rightarrow 1.5 \ and \ \beta>>0: a_b \approx \frac{1}{1+exp(\beta(1-1.5))} \approx \frac{1}{1+0} \approx 1 \ , a_c \approx \frac{1}{1+exp(\beta(1.5-1))} \approx \frac{1}{1+\infty} \approx 0
$$
Since $o \approx a_b v_b + a_c v_c$ then:
$$
For \ \lambda \rightarrow 0.5 \ and \ \beta>>0: o= v_b \\
For \ \lambda \rightarrow 1.5 \ and \ \beta>>0: o= v_c
$$
As a result, the output $o$ always oscillates between $v_b$ and $v_c$ and it is not the true average of them.  

✅ Point distribution ✅
- 1 points for the part 1.
- 3 points for the part 2.

#### ${\color{red}{Comments\ 1.3}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 4: Model Size  
1. Imagine you have an input sequence of  $l$ tokens, how much memory is required and what time complexity do we have for a single self-attention layer? (give your answer in terms of $l$)
2. If you have $N$ layers of self-attention, how  would the memory requirements and the time complexity change? (give your answer in terms of $l$ and $N$)
3. If you have $l=10,000$ and $10$ layers, with the ability to perform $10M$ operations per second, how long would it take to compute the attention output?


**Answer**



1. We have to compute the attention from each word to every other word, the complexity of time and memory is $l^2$
2.  If we have $N$ layers, the complexity is $N \times l^2$.

3. $10000*10000*10=1000000000$, $1000000000/10000000=100$ it takes 100 seconds.



✅ Point distribution ✅
- 0.25 points for the part 1.
- 0.25 points for the part 2.
- 0.5 points for the part 3.

#### ${\color{red}{Comments\ 1.4}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## **Task 2: Multiple Choice Question Answering** (4 + 3 + 5 + 2 = 14 points)

In this task, you will fine-tune a transformer model on a multiple-choice task, which is the task of selecting the most plausible inputs in a given selection. The dataset used here is [SWAG](https://www.aclweb.org/anthology/D18-1009/), which is available via the Hugging Face [hub](https://huggingface.co/datasets/swag). Check the link for an overview of the dataset. SWAG is a dataset about commonsense reasoning, where each example describes a situation and then proposes four options that could apply for it.
Let's start by installing the necessary packages.

In [1]:
%pip install transformers
%pip install datasets
%pip install evaluate
%pip install accelerate -U
%pip install sentencepiece

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6
Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━

In this task, you will use a BERT model with a `MultipleChoice` head from the Hugging Face library and then create your custom model.   Recall from the class that the BERT model has an auxiliary next sentence prediction task, in which two sentences are given to BERT separated by a `[SEP]` token and a classifier head decides if the second sentence logically follows the first one. Hugging Face has
 a `*ForMultipleChoice` architecture that uses the representation of the `[CLS]` token and a linear layer to classify if one sentence follows the other. We first start with this default architecture and then build a more complicated one in a later subtask.

### Subtask 1: Loading and Processing the Data

We use the `dataset` library to download the SWAG dataset, which already contains train, validation, and test splits.

In [None]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
from datasets import load_dataset, load_metric
datasets = load_dataset("swag", "regular")
datasets

Downloading builder script:   0%|          | 0.00/7.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/7.10k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.88k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/6.71M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.24M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.21M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/73546 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/20006 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/20005 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 73546
    })
    validation: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 20006
    })
    test: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 20005
    })
})

Lets look at the first item to see how the data looks like:

In [None]:
datasets["train"][0]

{'video-id': 'anetv_jkn6uvmqwh4',
 'fold-ind': '3416',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'gold-source': 'gold',
 'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'label': 0}

**Question:**
Look at the dataset card on the Hugging Face hub and define what each of these fields means, with respect to the task:

*   `sent1`:
*   `sent2`:
*    `ending0`, `ending1`, `ending2` and `ending3`:
*   `label`:




**Answer**

`
Enter your answer here
`

**Answer**
Each item has the following elements:


*   `sent1`: first sentence
*   `sent2`: introduction to second sentence
*    `ending0`, `ending1`, `ending2` and `ending3`: four possible endings
*   `label`: correct answer


Write a function that displays the context and each of the four choices, following the format


```
Context:...
A-
B-
C-
D-
Ground truth: option ...
```

How you display the results is not important. You should be able to extract different parts of the data correctly and know what each field represents.

In [None]:
def explain_example(example):
  ### your code ###
    print(f"Context: {example['sent1']}")
    print(f"  A - begining:{example['sent2']}... ending:{example['ending0']}")
    print(f"  B - begining:{example['sent2']}... ending:{example['ending1']}")
    print(f"  C - begining:{example['sent2']}... ending:{example['ending2']}")
    print(f"  D - begining:{example['sent2']}... ending:{example['ending3']}")
    print(f"\nGround truth: option {['A', 'B', 'C', 'D'][example['label']]}")
### your code ###

In [None]:
explain_example(datasets["train"][0])

Context: Members of the procession walk down the street holding small horn brass instruments.
  A - begining:A drum line... ending:passes by walking down the street playing their instruments.
  B - begining:A drum line... ending:has heard approaching them.
  C - begining:A drum line... ending:arrives and they're outside dancing and asleep.
  D - begining:A drum line... ending:turns the lead singer watches the performance.

Ground truth: option A


Before feeding the data into the model, we need to preprocess the text using `Tokenizer` to tokenize the inputs into tokens and put it in a format that the model expects. The tokenizer specific to the model we want to use for this task is `distilbert-base-uncased`. Complete the code below to load a fast tokenizer for this model. DistilBERT is similar to the BERT model, and we only use this particular architecture for faster training.


In [None]:
from transformers import AutoTokenizer

###your code###
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased', use_fast=True)
###your code###

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
tokenizer("This is the first sentence!", "And this is the second one.")

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 999, 102, 1998, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Write a function that preprocesses the samples.
The tricky part is to put all the possible pairs of sentences in two big lists before passing them to the tokenizer.
Each **first** sentence has to be repeated 4 times to go with different ending options.
There should be a separator token between the first and second sentence, to follow the BERT input logic.
The final output is a list of 4 elements, one for each choice, where the input is transformed by the tokenizer.
For example, with a list of 2 training examples, the output includes 2 lists, where each contains 4 elements. Each of those elements is the converted input ID of the first sentence followed by the second sentence with different endings.
When calling the `tokenizer`, we use the argument `truncation=True`. This will ensure that an input longer than what the model selected can handle will be truncated to the maximum length accepted by the model.

**Hint:** Flatten the lists (all choices are flattened into a single list) before feeding them into the tokenizer and unflatten them once again for the final output.

In [None]:
### your code ###
ending_names = ["ending0", "ending1", "ending2", "ending3"]
### your code ###
def preprocess_function(examples):
  ### your code ###
    # Repeat each first sentence four times to go with the four possibilities of second sentences.
    first_sentences = [[context] * 4 for context in examples["sent1"]]
    # Grab all second sentences possible for each context.
    question_headers = examples["sent2"]
    second_sentences = [[f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)]

    # Flatten everything
    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    # Tokenize
    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)

    # Un-flatten
    return {k: [v[i:i+4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}
    ### your code ###

In [None]:
examples = datasets["train"][:2]
features = preprocess_function(examples)
print(len(features["input_ids"]), len(features["input_ids"][0]), [len(x) for x in features["input_ids"][0]])# output should be 2 4 [30, 25, 30, 28]

2 4 [30, 25, 30, 28]


✅ Point distribution ✅
- 0.25 if all answers to the questions about the meaning of fields are correct.
- 0.5 if for the `explain_example` function, the first and second sentences and the endings are extracted correctly and the correct label is displayed. Formatting does not matter. The goal is the correct extraction of different fields from the dataset.
- 0.25 the tokenizer is created with the correct model.
- 0.5 if the output is correct as given in the comment.
- 1 point for the logic, make sure that the truncation is set and the input to the tokenizer is the list of both sentences. For the second sentence, make sure that the beginning of the sentence is added to the respective endings.

We can now apply our function to all the examples in the dataset. We use the `map` method to apply the function on all the elements of all the splits in the dataset (training, validation, and testing).
Note that we passed `batched=True` to leverage the fast tokenizer and use multi-threading to process the texts in batches concurrently.

In [None]:
encoded_datasets = datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/73546 [00:00<?, ? examples/s]

Map:   0%|          | 0/20006 [00:00<?, ? examples/s]

Map:   0%|          | 0/20005 [00:00<?, ? examples/s]

Our dataset is still not converted to tensors and not padded. This is the job of the `data collator`. A data collator takes a list of examples and converts them to a batch.
There is no data collator in the Hugging Face default library that works on our specific problem. We thus need to write our own one. In this collator:

*  All the inputs/attention masks are flattened.
* A flattened list is passed to the `tokenizer.pad ` method to apply dynamic padding to pad inputs to the maximum length in the batch. Output will be the size of `(batch_size * 4) x seq_length`.
* Everything needs to be unflattened for the output of the data collator.
* `input_ids` and `labels` should be returned as tensors.
* The output is a dictionary called `batch` that contains features needed for training (`input_ids`, `attention_mask`, `label`).



In [None]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class MultipleChoiceDataCollator:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        accepted_keys = ["input_ids", "attention_mask", "label"]

        if len(features[0])>len(accepted_keys):
          features=[{k: v for k, v in i.items() if k in accepted_keys} for i in features]
      ### your code ###

        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
        flattened_features = sum(flattened_features, [])
        #use the tokenizer and attributes from the class to pad the input
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add back labels
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        ### your code ###
        return batch

In [None]:
accepted_keys = ["input_ids", "attention_mask", "label"]
features = [{k: v for k, v in encoded_datasets["train"][i].items() if k in accepted_keys} for i in range(2)]
batch=MultipleChoiceDataCollator(tokenizer)(features)
print(batch["input_ids"].shape)
print(batch["attention_mask"].shape)
print(batch["labels"].shape)

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


torch.Size([2, 4, 35])
torch.Size([2, 4, 35])
torch.Size([2])


In [None]:
for i in range(4):
  print(batch["input_ids"][0][i])
  print(tokenizer.decode(batch["input_ids"][0][i]))

tensor([  101,  2372,  1997,  1996, 14385,  3328,  2091,  1996,  2395,  3173,
         2235,  7109,  8782,  5693,  1012,   102,  1037,  6943,  2240,  5235,
         2011,  3788,  2091,  1996,  2395,  2652,  2037,  5693,  1012,   102,
            0,     0,     0,     0,     0])
[CLS] members of the procession walk down the street holding small horn brass instruments. [SEP] a drum line passes by walking down the street playing their instruments. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD]
tensor([  101,  2372,  1997,  1996, 14385,  3328,  2091,  1996,  2395,  3173,
         2235,  7109,  8782,  5693,  1012,   102,  1037,  6943,  2240,  2038,
         2657,  8455,  2068,  1012,   102,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0])
[CLS] members of the procession walk down the street holding small horn brass instruments. [SEP] a drum line has heard approaching them. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
tensor([  101,  2372,  1997,  1996,

✅ Point distribution ✅
- 0.5 if the output shapes are correct

```
[2, 4, 35]
[2, 4, 35]
[2]
```
- 0.25 point if the outputs have the type tensor.
- 0.25 points for the correct use of `tokenizer.pad`
- 0.5 point if flattening and unflatting are performed correctly.

#### ${\color{red}{Comments\ 2.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 2: Fine-tuning a Hugging Face Model

To fine-tune our model, we first need to download the correct architecture from Hugging Face. Import the correct class for this task and download the pre-trained checkpoint for the base class from `distilbert-base-uncased`. Note that the weights in the classification head are initialized at random.

In [None]:
### your code ###
from transformers import AutoModelForMultipleChoice
model_hf = AutoModelForMultipleChoice.from_pretrained("distilbert-base-uncased")
### your code ###

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForMultipleChoice were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Next, we need to define our `Trainer` and pass in the correct `TrainingArguments` (a class that contains all the attributes to customize the training). Define a `TrainingArguments` that


* creates an output directory `distilbert-base-uncased-swag` to save the checkpoints and logs.
*   evaluates the model on the validation set after the `300` steps.
* a checkpoint should be saved after each `600` step and no more than 2 checkpoints should be saved in total.
* the random seed for training is `77`.
* batch size for training and evaluation: `48` (if you are running out of memory, feel free to change this setting but indicate it as a comment in your notebook, on a T4 GPU from google colab this takes about `13.2GB` of `15.0GB`).
* train for `1800` steps with a learning rate of `5e-5`, and add weight decay of `0.01` to the optimizer.
* the trainer should remove the columns from the data that are not used by the model.
* The final checkpoint should be the checkpoint that had the best overall validation metric not necessarily the last checkpoint.

**Note:** Please use GPU for to train your model. If on colab, you can use T4 GPU for free.

In [None]:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
    ### your code ###
    output_dir="distilbert-base-uncased-swag",
    evaluation_strategy = "steps",
    save_strategy="steps",
    save_steps=600,
    save_total_limit=2,
    eval_steps=300,
    learning_rate=5e-5,
    seed=77,
    do_predict=True,
    remove_unused_columns=True,
    per_device_train_batch_size=48,
    per_device_eval_batch_size=48,
    max_steps=1800,
    weight_decay=0.01,
    load_best_model_at_end=True
    ### your code ###
)

Before we initialize the `Trainer`, we create a function that tells the trainer how to compute the metrics from the predictions. Fill the `compute_metrics` function to compute the accuracy based on the `predictions`. This object contains the prediction of the model, as well as the ground truth labels.

**Hint 1:** Keep in mind that the output of this function should be a dictionary containing the metric name and value.

**Hint 2:** Consider the shape of the example input. This is similar to the logits produced by the model.

In [None]:
import numpy as np
def compute_metrics(predictions):
  ### your code ###
    preds, label_ids = predictions
    ps = np.argmax(preds, axis=1)
    return_dict={"accuracy": (ps == label_ids).astype(np.float32).mean().item()}
  ### your code ###
    return return_dict

In [None]:
preds=np.array([[0.9,0.2,0,0],
                [0.2,0.2,0.9,0.1],
                [0.2,0.9,0,0],
                [0.2,0.1,0.8,0],
                [0.9,0.1,0.8,0],
                [0.2,1,0.4,0],
                [0.2,1,0.4,0.9],
                [1,0.1,0.4,0.3],
                [0.1,0.1,0.9,0.3],
                [0.1,0.1,0.2,1]])
label_ids=np.array([0,3,1,2,0,1,3,0,2,3])
compute_metrics((preds,label_ids))

{'accuracy': 0.800000011920929}

Now it's time to pass everything to a `Trainer` object to start the training process. Initialize a `Trainer` object and pass all the necessary information, keep in mind that we also have the optional metric computation and that we tend to run an evaluation on the validation set during training. The training should take around 30 min on Google Colab T4 GPU.


In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cuda')

In [None]:
### your code ###
trainer = Trainer(
    model_hf,
    training_args,
    train_dataset=encoded_datasets["train"],
    eval_dataset=encoded_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=MultipleChoiceDataCollator(tokenizer),
    compute_metrics=compute_metrics,
)
### your code ###

In [None]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy
300,No log,0.918462,0.631111


Save the model in `distilbert-base-uncased-swag/final_model`.

In [None]:
### your code ###
trainer.save_model("distilbert-base-uncased-swag/final_model")
### your code ###

Look at the saved files and answer the following questions (it is possible to answer these questions by writing some code, but we want you to explore the saved files):

**Question:**


1.   What is the vocabulary id for the `[CLS]` and `[MASK]` tokens?
2.   What is the dropout probability for the attention layer?

**Dropout:** With dropout, certain nodes are set to the value zero in a training run, i.e. removed from the network. Thus, they have no influence on the prediction and also in the backpropagation. Thus, a new, slightly modified network architecture is built in each run and the network learns to produce good predictions without certain inputs. Read more [here](https://databasecamp.de/en/ml/dropout-layer-en).



**Answer**

`
Enter your answer here
`

**Answer**


1.   101 and 103.
2.   0.1.

✅ Point distribution ✅
- 0.5 if the correct model is imported and downloaded.
- 0.5 if the arguments are set correctly.
- 1 if the output of the example accuracy is `0.800000011920929` and the logic is correct.
- 0.25 point if the parameters of the trainer are set correctly.
- 0.25 point if the model is saved correctly.
- 0.25 point for questions regarding the `[CLS]` and `[MASK]` ids..
- 0.25 point for question regarding the dropout probability.

#### ${\color{red}{Comments\ 2.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 3: Fine-tune a Custom Model


In this case, we were lucky that Hugging Face had a pre-implemented architecture available for us to use. However, that is not always the case. Moreover, we might want to experiment beyond the default architectures to find a suitable one for a task. Therefore, it is important to learn to extend the Hugging Face models and train a custom model. The good news is that except for the model architecture the rest of the code can remain as it is.

Design a model for multiple choice model as follows:


1.   the config file for a feature extractor (must be a distilbert type) is  passed during initialization. The config file determines which model is used for feature extraction.
2.   From the `last_hidden_state` of the feature extractor, choose the `[CLS]` embedding (first one). This embedding is used as the compressed representation of first and second sentences. During pre-training it is used  for classifying whether these two sentences follow one another, making it a good candidate for our task.
3. `[CLS]` embedding is passed through a linear layer **that does not change the size of the embedding** and is passed through a tanh nonlinearity.
4. The output of tanh is passed through a dropout layer, where the dropout probability is the same as the dropout probability used for the `distilbert` model used as feature extractor.
5. The output of the previous stage is fed into another linear layer that shrinks the size of the embedding dimension to a quarter of the original size, e.g., if the embedding size is 12, the new embedding dimension is 3.
6. The output is followed by another dropout layer (you can use the one from stage 4).
7. Finally, a binary classifier is applied to determine the probability of sentence 1 being followed by sentence 2.
8. the cross-entropy loss is used to compute the loss.

**Hint:** Keep in mind that for a 4 choice system, you classify each of the four solutions independently. However, the final output should group the four logits together. For example, if input ids have the shape `[2, 4, 35]` (batch size=2, num choices=4, seq len=35), then the logits have the `[2, 4]` and labels have the dimension `[2, 1]`.



In [None]:
from transformers import DistilBertModel,BertConfig,DistilBertConfig,PretrainedConfig,PreTrainedModel,DistilBertPreTrainedModel
from torch import nn

class CustomMultipleChoice(DistilBertPreTrainedModel):
    def __init__(self, config: PretrainedConfig):
        super().__init__(config)
        ###your code ###
        self.distilbert = DistilBertModel(config)
        self.dense = nn.Linear(self.distilbert.config.hidden_size, self.distilbert.config.hidden_size)
        self.activation = nn.Tanh()
        self.dropout = nn.Dropout(self.distilbert.config.dropout)
        self.dense2 = nn.Linear(self.distilbert.config.hidden_size, int(self.distilbert.config.hidden_size/4))
        self.classifier = nn.Linear(int(self.distilbert.config.hidden_size/4), 1)
        ###your code ###


    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
    ):
        """
        input_ids: input sentences converted to ids
        attention_mask: the attention mask
        labels:  Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors.
        """

        num_choices = input_ids.shape[1]

        ###your code ###
        input_ids = input_ids.view(-1, input_ids.size(-1))
        attention_mask = attention_mask.view(-1, attention_mask.size(-1))


        outputs = self.distilbert(
            input_ids,
            attention_mask=attention_mask,
            return_dict=True
        )

        last_hidden_state = outputs.last_hidden_state
        cls_output= last_hidden_state[:, 0]
        cls_output=self.dense(cls_output)
        cls_output=self.activation(cls_output)
        cls_output=self.dropout(cls_output)
        compressed_cls=self.dense2(cls_output)
        compressed_cls=self.dropout(compressed_cls)
        logits = self.classifier(compressed_cls)
        reshaped_logits = logits.view(-1, num_choices)

        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(reshaped_logits, labels)
        ###your code ###
        return {"loss":loss,"logits":reshaped_logits}


Initialize the feature extractor with `distilbert-base-uncased` and create your custome model.

In [None]:
from transformers import AutoConfig
###your code ###
config= AutoConfig.from_pretrained("distilbert-base-uncased")
model_custom = CustomMultipleChoice(config)
###your code ###

In [None]:
for name, param in model_custom.named_parameters():
    if param.requires_grad and not name.startswith("distilbert."):
        print(name, param.data.shape)

dense.weight torch.Size([768, 768])
dense.bias torch.Size([768])
dense2.weight torch.Size([192, 768])
dense2.bias torch.Size([192])
classifier.weight torch.Size([1, 192])
classifier.bias torch.Size([1])


✅ Point distribution ✅
- 1 point if the shapes of the  head layers are correct. We attach the shape of our parameters as a reference.


```
dense.weight torch.Size([768, 768])
dense.bias torch.Size([768])
dense2.weight torch.Size([192, 768])
dense2.bias torch.Size([192])
classifier.weight torch.Size([1, 192])
classifier.bias torch.Size([1])
```


- 0.5 point check the init function for correctness and clarity, and make sure the configuration file of the `feature_model` is used to extract values for hidden dimension.
- 0.25 if the reshaping of `input_ids` and `attention_mask` is done correctly.
- 1 point if all the code related to extraction of the `cls_output` is correct, meaning that the `[CLS]` token is correctly selected and passed through a linear layer with activation and dropout.
- 0.5 points for correct use of the classifier and reshaping of the logits.
- 0.5 for correct calculation of loss.
- 0.5 for correct initialization of the model using the config file from `distilbert-base-uncased`.

We keep the same training arguments but change the directory in which we save the model logs, the directory in which we save the model output and the name of the run, to `custom_model`.




In [None]:
###your code ###
training_args.output_dir="custom_model"
training_args.logging_dir="custom_model/runs"
training_args.run_name="custom_model"
###your code ###

Initialize the trainer for training the custom model. The training should take around 30 min on Google Colab T4 GPU.

In [None]:
trainer = Trainer(
    ###your code ###
    model_custom,
    training_args,
    train_dataset=encoded_datasets["train"],
    eval_dataset=encoded_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=MultipleChoiceDataCollator(tokenizer),
    compute_metrics=compute_metrics,
    ###your code ###
)

In [None]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy
300,No log,1.383474,0.273418
600,1.380200,1.386152,0.274268
900,1.380200,1.400406,0.274918
1200,1.371400,1.400567,0.274668
1500,1.360100,1.389181,0.277067
1800,1.360100,1.400988,0.275317


TrainOutput(global_step=1800, training_loss=1.3665211147732206, metrics={'train_runtime': 2606.9583, 'train_samples_per_second': 33.142, 'train_steps_per_second': 0.69, 'total_flos': 5706281996793504.0, 'train_loss': 1.3665211147732206, 'epoch': 1.17})

Save the model in `custom_model/final_model`. Note that with the custom model, you need to save it without the help of the trainer. The trainer would save the configuration but since this model is not a registered Hugging Face model only the base model would be saved. Loading the model weights is also effected by this.

In [None]:
###your code ###
trainer.save_model("custom_model/final_model")
torch.save(model_custom.state_dict(),"custom_model/final_model/model.bin")
###your code ###

✅ Point distribution ✅
- 0.25 point if the output directory, logging directory and run name are changed.
- 0.25 if the parameters for training are set correctly.
-0.25 if model is saved correctly (if there is no trainer.save_model still give the points).

#### ${\color{red}{Comments\ 2.3}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 4: Evaluation and Model Comparison

Many times you do not perform the final evaluation right after training, but load the checkpoints and evaluate them on the fly. To this end, load the two models from  disk.

In [None]:
from transformers import AutoModelForMultipleChoice,AutoConfig
### your code ###
model_hf = AutoModelForMultipleChoice.from_pretrained("distilbert-base-uncased-swag/final_model")
config= AutoConfig.from_pretrained("distilbert-base-uncased")
model_custom = CustomMultipleChoice(config)
model_custom.load_state_dict(torch.load("custom_model/final_model/model.bin"))
### your code ###

<All keys matched successfully>

To evaluate the data we load the validation split using a data loader and our previously defined data collator. Note that although we had a test split we cannot use it, since there are no labels available for this split (you can check the data to confirm this).

In [None]:
from torch.utils.data import DataLoader
import evaluate

eval_dataloader = DataLoader(encoded_datasets["validation"], batch_size=64, collate_fn=MultipleChoiceDataCollator(tokenizer))

To make things easier, let's use the `evaluate` library from Hugging Face to compute the accuracy metric. Here we load `accuracy` from the `evaluate` library two times, one for the custom model and one for the Hugging Face model. Further, we put the models on eval mode. Complete the code for evaluation using the capabilities of the `evaluate` library to simultaneously compute the metric for both models.


In [None]:
from tqdm import tqdm
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
metric_dict={"custom":evaluate.load("accuracy"),"hf":evaluate.load("accuracy")}
models_dict= {"custom":model_custom,"hf":model_hf}

for name, model in models_dict.items():
  model.to(device)
  model.eval()

for i,batch in tqdm(enumerate(eval_dataloader), total=len(eval_dataloader)):
  ### your code ###
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
      for name, model in models_dict.items():
        outputs= model(**batch)
        logits = outputs["logits"]
        predictions = torch.argmax(logits, dim=-1)
        metric_dict[name].add_batch(predictions=predictions, references=batch["labels"])
acc_hf=metric_dict["hf"].compute()["accuracy"]
acc_custom=metric_dict["custom"].compute()["accuracy"]
  ### your code ###
print("Hugging Face Model :",acc_hf)
print("Custom Model :",acc_custom)

100%|██████████| 313/313 [05:06<00:00,  1.02it/s]

Hugging Face Model : 0.7099870038988304
Custom Model : 0.27426771968409475





✅ Point distribution ✅
- 0.5 point if both models are loaded correctly.
- 0.5 point if batches are put on the device and `torch.no_grad()` is present
- 0.5 point if predictions are computed correctly.
- 0.5 point if the `.add_batch` and `.compute` are used correctly to find the accuracy.


#### ${\color{red}{Comments\ 2.4}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## **Task 3: Encoder-Decoder Architecture** (5 + 2 + 2 + 5 = 14 points)

We explored an encoder-based model (BERT) in the previous exercise. In this task, we look at another family of transformer architectures, the encoder-decoder. We use the [T5](https://arxiv.org/pdf/1910.10683.pdf) model, presented by Raffel et al.  T5 is an encoder-decoder architecture pre-trained on a multi-task mixture of unsupervised and supervised tasks. In this task, we set up a fine-tuning example for question answering using the [SQUAD](https://huggingface.co/datasets/squad) dataset. Since the actual fine-tuning is time-consuming and computational intensive for inference, we use an already pre-trained model. The main goal is to introduce you to the structure of the fine-tuning and its simplicity with the Hugging Face framework.

To fine-tune the BERT-based models, we usually add a task-specific head. On the other hand, T5 converts all NLP problems into a text-to-text format.  
It is trained using teacher forcing, meaning that we require an input sequence and a corresponding target sequence.


1.   The input sequence is fed to the model using `input_ids` from the tokenizer.
2.   The target sequence is shifted to the right, i.e., prepended by a start-sequence token and fed to the decoder using the `decoder_input_ids` (input_ids of the encoded target sequence). The target sequence is appended by EOS (end of the sentence) to denote the end of a generation and corresponds to the `labels`.
3. The task prefix defines what task is expected of T5. For example, we prepend the input sequence with `translate English to German: ` before encoding the input to tell the model to translate. T5 already has a set of pre-defined task prefixes, and it is best to stick to those since they were used during pre-training. With enough training data, you can also introduce your own custom task.


In contrast to the encoder model, where only a single `max_length` is required, for encoder-decoder architectures, one typically defines a `max_source_length` and `max_target_length`, which determine the maximum length of the input and output sequences, respectively. We must also ensure that the padding ID of the `labels` is not taken into account by the loss function. This can be done by replacing them with `-100`, which is the `ignore_index` of the `CrossEntropyLoss`.

### Subtask 1: Data Processing

We first start by loading the dataset from Hugging Face hub:

In [2]:
from datasets import load_dataset

datasets_squad = load_dataset("squad")
datasets_squad

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [3]:
print("context ---->" ,datasets_squad["train"][0]["context"])
print("question ---->",datasets_squad["train"][0]["question"])
print("answers ---->",datasets_squad["train"][0]["answers"])

context ----> Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
question ----> To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
answers ----> {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}


Now let's load the needed pre-trained tokenizer for `t5-small`, which is the smallest T5 model. Set the maximum sequence length to `512`.

In [4]:
import torch
### your code ###
from transformers import T5Tokenizer
t5_tokenizer = T5Tokenizer.from_pretrained('t5-small',model_max_length=512)
### your code ###

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


The next step is to pre-process the dataset using the tokenizer to convert the sequences to IDs and add the special tokens.
T5 is based on the SentencePiece tokenizer, and the end of sentence token is denoted by `</s>`.
Complete the function `add_eos_to_examples` to format the input and target sequence. Your input as `input_text` should have the format `question:{question_text} context:{context_text} <EOS_Token>` and your target as `target_text` should have the format `{answer_text} <EOS_Token>`.

In [5]:
def add_eos_to_examples(example):
    ### your code ###
    example['input_text'] = 'question: %s  context: %s </s>' % (example['question'], example['context'])
    example['target_text'] = '%s </s>' % example['answers']['text'][0]
    ### your code ###
    return example

Use the `map` function to process the data, and do not set the `batched` argument.

In [6]:
### your code ###
encoded_squad = datasets_squad.map(add_eos_to_examples)
### your code ###

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [None]:
print(encoded_squad["train"][0]["input_text"])
print(encoded_squad["train"][0]["target_text"])

question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?  context: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary. </s>
Saint Bernadette Soubirous </s>


Complete the function `convert_to_features` that takes in the examples from the dataset and tokenizes them using the T5 tokenizer. However, our answers in this dataset are relatively short and do not require `512` tokens, in contrast to the input sequence which is a combination of question and context paragraphs and is usually long. To this end, we want to truncate the input sequence at `512` and the target sequence at `16`. If any input or target is smaller than the specified length, make sure you pad them. Finally, convert everything to PyTorch tensors to be easily used by the data collator and place them in the dictionary `encodings`.

In [7]:
def convert_to_features(examples):
    ### your code ###
    input_encodings = t5_tokenizer(examples['input_text'],max_length=512, truncation=True)
    target_encodings = t5_tokenizer(examples['target_text'], max_length=16, truncation=True)
    input_encodings=t5_tokenizer.pad(input_encodings,padding='max_length',max_length=512,return_tensors="pt")
    target_encodings=t5_tokenizer.pad(target_encodings,padding='max_length',max_length=16,return_tensors="pt")

    encodings = {
        'input_ids': input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'target_ids': target_encodings['input_ids'],
        'target_attention_mask': target_encodings['attention_mask']
    }
    ### your code ###
    return encodings

Use the `map` function to process the data.

In [8]:
### your code ###
encoded_squad = encoded_squad.map(convert_to_features, batched=True)
### your code ###

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]



Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [None]:
encoded_squad #new columns are added

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'input_text', 'target_text', 'input_ids', 'attention_mask', 'target_ids', 'target_attention_mask'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'input_text', 'target_text', 'input_ids', 'attention_mask', 'target_ids', 'target_attention_mask'],
        num_rows: 10570
    })
})

Interestingly, although we specified PyTorch tensors as output, the type of the `input_ids` is still a list. To remedy this problem, you need to explicitly set the type of the column that contains PyTorch tensors.

In [None]:
type(encoded_squad["train"][0]["input_ids"])

list

In [None]:
### your code ###
encoded_squad.set_format(type="torch", columns=["input_ids", "attention_mask", "target_ids", "target_attention_mask"])
### your code ###
type(encoded_squad["train"][0]["input_ids"])

torch.Tensor

In [None]:
print("Shape of the input_ids:",encoded_squad["train"][0]["input_ids"].shape)
print("Shape of the target_ids:",encoded_squad["train"][0]["target_ids"].shape)

Shape of the input_ids: torch.Size([512])
Shape of the target_ids: torch.Size([16])


✅ Point distribution ✅
- 0.5 point if the tokenizer is loaded correctly.
- 0.5 if the EOS sentence and prefixes are added correctly in `add_eos_to_examples`.
- 0.5 if the map function is used correctly for `add_eos_to_examples`.
- 1 point if the input and target are tokenized and padded correctly in `convert_to_features`.
- 0.5 point for use of the map function for `convert_to_features`.
- 0.5 point for conversion of the columns to tensors.

The final step in the data processing is the creation of the data collator to
prepare `labels` from `target_ids` and return examples with keys as expected by the forward method of T5.
This is necessary because the trainer directly passes this dict as argument to the model so you need to check the input of T5 and rename the column based on that.
`input_ids`, `target_ids`, `attention_mask`, and `target_attention_mask` need to be stacked in a batch and the pad tokens in the target need to be set to `-100` to avoid loss computation.

In [None]:
from dataclasses import dataclass
from transformers import DataCollator
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
@dataclass
class T2TDataCollator:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    def __call__(self, batch):

      ### your code ###
        input_ids = torch.stack([example['input_ids'] for example in batch])
        labels = torch.stack([example['target_ids'] for example in batch])
        labels[labels[:, :] == 0] = -100
        attention_mask = torch.stack([example['attention_mask'] for example in batch])
        decoder_attention_mask = torch.stack([example['target_attention_mask'] for example in batch])

        feature_dict={
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels,
            'decoder_attention_mask': decoder_attention_mask
        }
        return feature_dict
      ### your code ###


In [None]:
accepted_keys = ['input_text', 'target_text', 'input_ids', 'attention_mask', 'target_ids', 'target_attention_mask']
features = [{k: v for k, v in encoded_squad["train"][i].items() if k in accepted_keys} for i in range(2)]
batch=T2TDataCollator(t5_tokenizer)(features)
print(batch["input_ids"].shape)
print(batch["attention_mask"].shape)
print(batch["labels"].shape)

torch.Size([2, 512])
torch.Size([2, 512])
torch.Size([2, 16])


✅ Point distribution ✅
- 0.5 point if the tensors are stacked properly, look at the output shapes for hints.
- 0.5 points if the names are correctly defined.
- 0.5 point if the label pads are set to `-100`.

#### ${\color{red}{Comments\ 3.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 2: Training

For training and inference, we can use `T5ForConditionalGeneration`, which includes the language modeling head on top of the decoder. Load the `t5-small` model.

In [None]:
### your code ###
from transformers import T5ForConditionalGeneration
t5 = T5ForConditionalGeneration.from_pretrained("t5-small")
### your code ###

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Next, similar to the previous task we initiate training arguments. Note that this time we are using a `Seq2SeqTrainingArguments` for a `Seq2SeqTrainer`. Set the parameters for training as follows:


*   T5 doesn't support GPU and TPU evaluation for now, so we only focus on training. You do not need to pass any parameters for evaluation setup.
*   The output directory should be named `t5-squad`.
* The T5 models need a slightly higher learning rate than the default one set in the `Trainer` when using the `AdamW` optimizer. Set the learning rate to `1e-4` and the regularization parameter to `0.01`.
* Random seed should be `77`, and we train for a maximum of `200` steps and save a checkpoint every `100` steps. A complete training of the T5 model requires far more than `200` steps, however, that is beyond the scope of this assignment.
* T5 models require a large batch size. The default model was trained with a batch size of `128`. However, we cannot fit that into a single GPU, therefore we use gradient accumulation. Set the batch size to `32` and choose the gradient accumulation step to reach the effective batch size of `128`.
* Make sure that your trainer does not remove unused columns during training, as this will cause a runtime error later on.


**Gradient accumulation:** is a technique that simulates a larger batch size by accumulating gradients from multiple small batches before performing a weight update.



In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    ### your code ###
    output_dir="t5-squad",
    save_strategy="steps",
    save_steps=100,
    learning_rate=1e-4,
    seed=77,
    per_device_train_batch_size=32,
    max_steps=200,
    weight_decay=0.01,
    gradient_accumulation_steps= 4,
    remove_unused_columns=False
    ### your code ###
)

Once again make sure that you are using GPU before running the cell below.
Initilize your `Seq2SeqTrainer` with inputs necessary for training. The training should take around 15 min on Google Colab T4 GPU.

In [None]:
# Initialize our Trainer
from transformers import  Seq2SeqTrainer
trainer = Seq2SeqTrainer(
    ### your code ###
        model=t5,
        args=training_args,
        tokenizer=t5_tokenizer,
        train_dataset=encoded_squad["train"],
        data_collator=T2TDataCollator(t5_tokenizer),
        ### your code ###
    )

In [None]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=200, training_loss=0.47513809204101565, metrics={'train_runtime': 797.5955, 'train_samples_per_second': 32.096, 'train_steps_per_second': 0.251, 'total_flos': 3464750117683200.0, 'train_loss': 0.47513809204101565, 'epoch': 0.29})

✅ Point distribution ✅
- 0.5 point for initializing the correct model
- 1 points if all the training parameters are set correctly.
- 0.5 point if the trainer is correctly initiated.

#### ${\color{red}{Comments\ 3.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 3: Inference

Our trained model has seen far too few instances to make a coherent prediction. To this end, we load an already trained checkpoint from Hugging Face and perform inference. Load this [model](https://huggingface.co/mrm8488/t5-base-finetuned-squadv2) and the respective tokenizer. Note that we are loading a `base` model that is slightly larger than `t5-small`. Nonetheless, we can use the same tokenizer, since the tokenizer is not dependent on the model size.


In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
### your code ###
t5_tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-squadv2")
t5_model = AutoModelForSeq2SeqLM.from_pretrained("mrm8488/t5-base-finetuned-squadv2")
### your code ###

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


model.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

At inference time for T5, it is recommended to use the `generate()` function. This auto-regressively generates the decoder output. Complete the code for the `get_answer` function, which gives a model, a tokenizer, and a question and context pair, and generates the answer from the context given. The output should be the answer to the given question in natural text (without the special tokens).

**Hint:** Many of the steps are similar to how you prepared your input data for the model.

In [None]:
def get_answer(tokenizer,model, question, context):
  ### your code ###
  input_text = "question: %s  context: %s" % (question, context)
  print(input_text)
  features = tokenizer([input_text], return_tensors='pt')

  output = model.generate(input_ids=features['input_ids'],
               attention_mask=features['attention_mask'])
  answer=tokenizer.decode(output[0],skip_special_tokens=True)
  ### your code ###
  return answer

Let's try it with an example.

In [None]:
context = "Sarah has joined NLP for transformers class and is working on her research project with the support of Harry."
question = "Who is supporting Sarah?"

get_answer(t5_tokenizer,t5_model,question, context)###your answer should be "Harry"

question: Who is supporting Sarah?  context: Sarah has joined NLP for transformers class and is working on her research project with the support of Harry.




'Harry'

In [None]:
context = "TPUs are more power efficient in comparison to GPUs making them a better choice for machine learning projects."
question = "What is better for machine learning projects?"

get_answer(t5_tokenizer,t5_model,question, context)###your answer should be "TPUs"

question: What is better for machine learning projects?  context: TPUs are more power efficient in comparison to GPUs making them a better choice for machine learning projects.


'TPUs'

✅ Point distribution ✅
- 0.5 points for initializing the correct model and tokenizer.
- 0.5 points for preparing the input correctly.
- 1 point generation and decoding using the tokenizer.


#### ${\color{red}{Comments\ 3.3}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 4: T5 Paper

To answer questions of the final subtask you need to have a general overview of the [T5 paper](https://arxiv.org/pdf/1910.10683.pdf).



1.   Describe what a “text-to-text format" is and how T5 processes input and output for text classification tasks? What are the possible complications with a predefined set of classes?
2.   Describe the "masked language modeling" and "word dropout" unsupervised objective with sentinel tokens. Give an example of how this would look in a single sentence.
3. Explain "fully-visible", "causal" and "causal masking with prefix" masking.
4. Briefly describe "adapter layers" and "gradual unfreezing" as methods for fine-tuning on fewer parameters.



1.
**Text-to-text format** is a task where the model is fed
some text for context or conditioning and is then asked to produce some output text. For text classification tasks, the model simply predicts a single word corresponding to the target label. For example, on the MNLI benchmark, the goal is to predict whether a premise implies *entailment*, *contradiction*, or neither (*neutral*) hypothesis. Then, the input sequence becomes “mnli premise: I hate pigeons. hypothesis:
My feelings towards pigeons are filled with animosity.” with the corresponding target word *entailment*. If the model outputs text that does not correspond to any of the possible labels, the model output is considered wrong.



2.
Inspired by the BERTs "masked language modeling" objective and the "word dropout" regularization technique, they design an objective that
randomly samples and then drops out 15% of tokens in the input sequence. All consecutive spans of dropped-out tokens are replaced by a single sentinel token. Each sentinel token is assigned a token ID that is unique to the sequence. The sentinel IDs are special tokens that are added to our vocabulary and do not correspond to any wordpiece.

The example from the paper: "Thank you for inviting me to your party last week." becomes "Thank you `<X>` me to your party `<Y>` week.`<Z>`". The words "for", "inviting" and "last"  are randomly chosen for corruption. Each consecutive span of corrupted tokens is replaced by a sentinel token (shown as `<X>` and `<Y>`) that is unique over the example.  Since "for" and "inviting" occur consecutively, they are replaced by a single sentinel `<X>`. The output sequence then consists of the dropped-out spans, delimited by the sentinel tokens used to replace them in the input plus a final sentinel token `<Z>`.




3.
The encoder uses a **fully-visible** attention mask. Fully-visible masking allows a self-attention mechanism to attend to any entry of the input when producing each entry of its output.

The self-attention operations in the Transformer's decoder use a **causal**  masking pattern. When producing the $i$ th entry of the output sequence, causal masking prevents the model
from attending to the $j$ th entry of the input sequence for $j > i$.

**Causal with prefix** changes the causal masking to have fully-visible masking during
the prefix portion of the sequence.


4.
**Adapter layers** are a neural network architectural component designed to enhance the transferability and efficiency of pre-trained models. They were introduced to allow for quick and effective adaptation of a pre-trained model to new tasks without extensively retraining the entire model. Adapter layers  are additional dense-ReLU-dense blocks that are added after each of the preexisting feed-forward
networks in each block of the Transformer. These new feed-forward networks are designed so that their output dimensionality matches their input. This allows them to be inserted into the network with no additional changes to the structure or parameters. When fine-tuning, only the adapter layer and layer normalization parameters are updated. The main hyperparameter of this approach is the inner dimensionality $d$ of the feed-forward network, which changes the number of new parameters added to the model.

In **gradual unfreezing**, more and more of the model parameters are fine-tuned over time. Gradual unfreezing was originally applied to a language model architecture consisting of a single stack of layers. In this setting, at the start of fine-tuning only the parameters of the final layer are updated, then after training for a certain number of updates the parameters of the second-to-last layer are also included, and so on until the entire network parameters are fine-tuned. To adapt this approach to  encoder-decoder
model, they gradually unfreeze layers in the encoder and decoder in parallel, starting from the top in both cases.

✅ Point distribution ✅
- 1.5 point for part 1.
- 1 point for part 2.
- 1.5 point for part 3.
- 1 point for part 4.


#### ${\color{red}{Comments\ 3.4}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$