**Heidelberg University**

**Data Science  Group**
    
Prof. Dr. Michael Gertz  

Ashish Chouhan, Satya Almasian, John Ziegler, Jayson Salazar, Nicolas Reuter
    
December 4, 2023
    
Natural Language Processing with Transformers

Winter Semster 2023/2024     
***

# **Assignment 3: “Transformers”**
**Due**: Monday, January 8, 2024, 2pm, via [Moodle](https://moodle.uni-heidelberg.de/course/view.php?id=19251)



### **Submission Guidelines**

- Solutions need to be uploaded as a **single** Jupyter notebook. You will find several pre-filled code segments in the notebook, your task is to fill in the missing cells.
- For the written solution, use LaTeX in markdown inside the same notebook. Do **not** hand in a separate file for it.
- Download the .zip file containing the dataset but do **not** upload it with your solution.
- It is sufficient if one person per group uploads the solution to Moodle, but make sure that the full names of all team members are given in the notebook.

***

## **Task 1: Diving into Attention** (3 + 4 + 4 + 1 = 12 points)

In this task, you work with self-attention equations and find out why multi-head attention is preferable to single-head attention.

Recall the equation of attention on slide 5-9 to compute self-attention on a series of input tokens. We simplify the formula by focusing on a single query vector $q \in R^d$, value vectors ($\{ v_1,v_2,...,v_i \},v_i \in R^d$), and key vectors ($\{ k_1,k_2,...,k_i \},k_i \in R^d$). We then have

$$
a_i=\frac{exp(q^Tk_i)}{\Sigma^n_{j=1}exp(q^Tk_j)}
$$

$$
 o= \Sigma^n_{i=1} a_i v_i
$$

with $a_i$ being the attention weight for query $q$ with respect to key $k_i$. Then the output $o$ is the new representation for the query token as a weighted average of value vectors with weights $a=\{ a_1,a_2,...,a_i \},a_i \in R^d$.
Answer the following questions with the help of the equations and the intuition behind attention that you learned in the class:



### Subtask 1: Copying  

1.   Explain why $a$ can be interpreted as a categorical distribution.
2.   This distribution is typically diffuse, where the mass is spread out between different values of $a_i$. Describe a scenario in which the categorical distribution puts all the weight on a single element, e.g., $a_j \gg \Sigma_{j\neq i}a_i$. What are the conditions on key and/or query for this to happen?
3. In this case of a single large $a$, what would the output $c$ look like and what it means intuitively?

In attention, it is easy to **copy** a value vector $v_i$ to the output $o$.





**Answer**




1.1

As described in slide 5-9, the values of the $a_i$'s refer to the similarity of the query and the different keys. When applying softmax to the values of $q^Tk_i$, the respective values are mapped to a probability distribution from 0 to 1 due to the properties of the softmax function. Consequently, each value $a_i$ refers to a probability value of a categorical distribution among all keys, which makes the entire $a$ vector a categorical probability distribution among itself.

1.2

Whenever only a single key attends to a given query $q_i^*$ and all others only to a small extent, the described scenario can occur. The softmax function will then raise the value $a_i^*$ to almost 1 and all others to $\approx 0$.

1.3

In the case of a single large $a$, the output $o$ will be approximately equal to the value of the corresponding key that most closely matches the given query. In the above notation, this would be $v_i^*$. This happens because all $a_i$ except $a_i^*$ make only a small contribution to the summation and only the summand $a_i^* \cdot v_i^*$ remains for the summation. Since $a_i$ tends to be equal to 1, the total sum is approximately $v_i^*$.

In practical terms, this means that all the attention of the given query is focused on a single token, namely the best matching token.





#### ${\color{red}{Comments\ 1.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 2: Averaging


Instead of focusing on just one value vector $v_j$, the Transformer model can incorporate information from multiple inputs. Consider the situation where we want to incorporate information from two value vectors $v_b$ and $v_c$ with keys $k_b$ and $k_c$. In machine learning one of the ways to combine this information is through averaging of vectors $o= \frac{1}{2}(v_b+v_c)$.  It might seem hard to extract information about the original vectors $v_b$ and $v_c$ from the resulting average. But under certain conditions, one can do so. In this subtask, we look at the following cases:

1. Suppose we know the following:


* $v_b$ lies in a subspace $B$ formed by the $m$ basis vectors $\{b_1, b_2, .. , b_m\}$, while $v_c$ lies in a subspace $C$ formed by the $p$ basis vectors $\{c_1, c_2, . . . , c_p\}$ (This means that any $v_b$ and $v_c$ can be expressed as a linear combination of their basis vectors).
*   All basis vectors have the norm 1 and are orthogonal to each other.
*   The two subspaces $B$ and $C$ are orthogonal, meaning $b_j^Tc_k=0$ for all $j$ and $k$.
* Given that $\{b_1, b_2, .. , b_m\}$ are both orthogonal and form a basis for $v_b$, we know that there exists some $d_1, ..., d_m$ such that $v_b=d_1 b_1+d_2 b_2+...+d_m b_m$. Use these $d\text{s}$ to solve this task.

Using the basis vectors $\{b_1, b_2, .. , b_m\}$, construct a matrix $M$ such that for arbitrary vectors $v_b$ and $v_c$ with the given conditions, we can use $M$ to extract $v_b$ from the sum of the vector $s = v_b + v_c$. In other words, construct an $M$ such that  $ Ms = v_b$ holds.


2. If we assume that
* all key vectors are orthogonal, i.e., $k_i^Tk_j=0$ for all $i \neq j$, and
* all key vectors have the norm 1.

Find an expression for the query vector $q$ such that $o \approx \frac{1}{2}(v_b+v_c)$. Justify your answer.

**Hint:** Use your finding in subtask 1 to solve part 2.

**Hint:** If the norm of a vector $x$ is 1, then $x^Tx=1$

**Hint:** Start with writing $v_b$ and $v_c$ as the linear combination of the bases.


**Answer**

1. We want to solve $Ms = v_b$ for $s = v_b + v_c$ by using the rules of orthonormal projections matrices. To do so, we introduce matrix $X$, which is constructed by the basis $b_1, \dots, b_m$ as its columns.
$$
X = \begin{pmatrix} b_1 & \dots & b_m \end{pmatrix}
$$

  We now define the matrix $M$ based on $X$ as follows:

  $$
  M = X(X^TX)^{-1}X^T
  $$

  Now one notices a few things regarding the definition of $X$, respectively $X^T$
  - $X^Tv_c = 0$, since $X^Tv_c = \begin{pmatrix}
  b^T_1 \\
  \dots \\
  b^T_m
  \end{pmatrix}
  (e_1c_1 + \dots e_pc_p) = c_1\begin{pmatrix}
  b^T_1 \\
  \dots \\
  b^T_m
  \end{pmatrix}d_1 + \dots + c_p\begin{pmatrix}
  b^T_1 \\
  \dots \\
  b^T_m
  \end{pmatrix}d_p =
  c_1\begin{pmatrix}
  0 \\
  \dots \\
  0
  \end{pmatrix} + \dots + c_p\begin{pmatrix}
  0 \\
  \dots \\
  0
  \end{pmatrix}
  =
  0
  $, due to the orthogonality of $b_j$ and $c_k$.

  - $v_b = X\begin{pmatrix}
  d_1 \\
  \dots \\
  d_m
  \end{pmatrix}$ since $X = \begin{pmatrix} b_1 & \dots & b_m \end{pmatrix}\begin{pmatrix}
  d_1\\
  \dots \\
  d_m
  \end{pmatrix} = d_1b_1 + \dots + d_mb_m$.

  Combining these two observations, we get:

  \begin{align}
  Ms &= M(v_b + v_c) \\
      &= X(X^TX)^{-1}X^T(v_b + v_c) \\
      &= X(X^TX)^{-1}X^Tv_b + X(X^TX)^{-1}\underbrace{X^Tv_c}_{= 0} \\
      &= X\underbrace{(X^TX)^{-1}X^TX}_{= I}\begin{pmatrix}
      d_1 \\
      \cdots \\
      d_m
      \end{pmatrix} + 0 \\
      &= X\begin{pmatrix}
      d_1 \\
      \cdots \\
      d_m
      \end{pmatrix} \\
      &= v_b
  \end{align}

2. We propose $q = \alpha (k_b + k_c)$ with $0 << \alpha$ as solution to be $c \approx \frac{1}{2} (v_b + v_c)$.
Given this $q$, due to the orthogonality of $k_j$ and $k_i$ and $||k_j|| = 1$, it yields for each $k_j$:
\begin{align}
    q^Tk_j&=\alpha(k_b+k_c)^Tk_j \\
        &= \begin{cases}
    \alpha,& \text{if } j = b,c\\
    0,              & \text{otherwise}
    \end{cases} \\
\end{align}

It follows:

\begin{align}
    \exp(q^Tk_j)&=\exp((k_b+k_c)^Tk_j) \\
        &= \begin{cases}
    \exp(\alpha),& \text{if } j = b,c\\
    1,              & \text{otherwise}
    \end{cases} \\
\end{align}

And therefore it yields: $\Sigma^n_{j=1}exp(q^Tk_j) = 2\exp(\alpha)+(n-2)$.

It follows:
\begin{align}
    \Rightarrow o&= \Sigma^n_{i=1} a_i v_i \\
    &= \Sigma^n_{i=1} \frac{exp(q^Tk_i)}{\Sigma^n_{j=1}exp(q^Tk_j)} v_i \\
    &= \frac{1}{2\exp(\alpha)+(n-2)}\Sigma^n_{i=1} \exp(q^Tk_i) v_i \\
    &= \frac{1}{2\exp(\alpha)+(n-2)}\left(\exp(\alpha)v_b + \exp(\alpha)v_c + \Sigma^n_{i=1, i \neq b,c} v_i\right) \\
    &= \frac{1}{2\exp(\alpha)+(n-2)}\left(\exp(\alpha)(v_b + v_c) + \Sigma^n_{i=1, i \neq b,c} v_i\right)
\end{align}

We now assume two things based on the fact that we choose a huge $\alpha$:
- $\exp(\alpha)(v_b + v_c) + \Sigma^n_{i=1, i \neq b,c} v_i \approx \exp(\alpha)(v_b + v_c)$
- $2\exp(\alpha)+(n-2) \approx 2\exp(\alpha)$

In other words: The impact of the keys and values which aren't $b$ or $c$ becomes nearly 0 and therefore we finally get:

\begin{align}
    o&= \Sigma^n_{i=1} a_i v_i \\
    &= \dots \\
    &= \frac{1}{2\exp(\alpha)+(n-2)}\left(\exp(\alpha)(v_b + v_c) + \Sigma^n_{i=1, i \neq b,c} v_i\right) \\
    &\approx \frac{1}{2\exp(\alpha)}(\exp(\alpha)(v_b + v_c)) \\
    &= \frac{1}{2}(v_b + v_c)
\end{align}

#### ${\color{red}{Comments\ 1.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 3: Drawbacks of Single-head Attention

You might have wondered why we need multi-heads for attention. In this subtask, we look at some of the drawbacks of having a single head attention. As shown in the previous subtask, it is possible for single head attention to focus equally on two values. The same can apply to any subset of values, which therefor can become problematic.

Consider a set of key vectors $\{ k_1,k_2,...,k_n \}$, randomly sampled from a normal distribution with a known mean value of $\mu_i \in R^d$ and unknown covariance $Σ_i, i \in \{1, \ldots, n\}$, where


*   $\mu_i\text{s}$ are all orthogonal $\mu_i^T\mu_j=0$ if $i \neq j$.
*   $\mu_i\text{s}$ all have unit norm $||\mu_i||=1$.

1. For a vanishingly small $\alpha$ (not to be confused with attention weights), the covariance matrices are  $Σ_i=\alpha I, \forall i  \in \{1,2,..,n\}$, design a query $q$ in terms of the $\mu_i$ such that as before, $o= \frac{1}{2}(v_b+v_c)$ and describe why it works.

2.  Large perturbations in key value might cause problems for single head attention.  Specifically, in some cases, one key vector $k_b$ may be larger or smaller in norm than the others, while still pointing in the same direction as $\mu_b$. As an example of such a case,
consider a covariance matrix for item $b$ for vanishingly small $\alpha$ as $Σ_b=\alpha I + \frac{1}{2}(\mu_b^T\mu_b)$. This causes $k_a$ to point to roughly the same direction as $\mu_b$ but with large differences in magnitude, while for other items. Further, let $Σ_i=\alpha I\  \forall_i i \neq b$. When you sample multiple keys from the distribution $\{ k_1,k_2,...,k_n \}$ and use the $q$ vector from the pervious part, what do you expect vector $o$ to look like? Explain why this shows the drawback of single-head attention.

**Hint:**
Think about how it differs from pervious part and how $o$'s variance would be affected by the change in $Σ_b$.

**Hint:** Considering that $\mu_b^T\mu_b=1$, think of what are the ranges $Σ_b$ can take and how does that effect a sampled $k_b$ value.

**Hint:** $\frac{exp(b)}{exp(b)+exp(c)}=\frac{exp(b)}{exp(b)+exp(c)}\frac{exp(-b)}{exp(-b)}= \frac{1}{1+exp(c-b)}$

**Answer:**

1. We propose $q = \beta (\mu_b + \mu_c)$ with $0 << \beta << \frac{1}{\alpha}$ as solution to be $c \approx \frac{1}{2} (v_b + v_c)$.
Before deriving the result of $o$ we want to make note of an important result based on the given assumptions.

  Since each $k_j \sim N(\mu_j, \alpha I)$ and each $\mu_j$ is fixed and known it follows based on the rules of Gaussian distribution and the orthonormality of all $\mu_j$:

  \begin{align}
    \beta\mu_b^Tk_j &\sim N(\beta\mu_b^T\mu_j, \mu_b^T\Sigma_j\mu_b) \\
        &\sim N(\beta\mu_b^T\mu_j, \beta^2\mu_b^T\alpha I\mu_b) \\
        &\sim N(\beta\mu_b^T\mu_j, \beta^2\alpha \cdot 1) \\
        &\sim \begin{cases}
        N(\beta, \beta^2\alpha),& \text{if } j = b,c\\
        N(0, \beta^2\alpha),& \text{otherwise }\\
        \end{cases}
  \end{align}

  Since $\alpha$ is per definition vanishing small and $0 << \beta << \frac{1}{\alpha}$, it follows that the variance of $\mu_b^Tk_j \approx 0$. Therefore we have with probability $\approx 1$ the value of
  $$
  \beta\mu_b^Tk_j \approx
  \begin{cases}
      \beta,& \text{if } j = b\\
      0,& \text{otherwise }\\
  \end{cases}
  $$

  The same applies for $\mu_c$ yielding
  $$
  \beta\mu_c^Tk_j \approx
  \begin{cases}
      \beta,& \text{if } j = c\\
      0,& \text{otherwise }\\
  \end{cases}
  $$

  For the proposed query $q = \beta (\mu_b + \mu_c)$ we get:
  
  \begin{align}
      q^Tk_j &= \beta(\mu_b + \mu_c)^Tk_j \\
      &= \beta\mu_b^Tk_j + \beta\mu_b^Tk_j \\
      &\approx
  \begin{cases}
      \beta,& \text{if } j = b,c\\
      0,& \text{otherwise }\\
  \end{cases}
  \end{align}

  Now we are in the same setting as before in the first equation of Subtask 2.2 which leads to the desired result $o = \frac{1}{2}(v_b + v_c)$ as derived above.


2. If we now assume that $\Sigma_b = \alpha I + \frac{1}{2}(\mu_b\mu_b^T)$, we get the following based on the fact that $||\mu_j|| = 1$:

    \begin{align}
    \beta\mu_b^Tk_b &\sim N(\beta\mu_b^T\mu_b, \mu_b^T\Sigma_b\mu_b)\\
        &\sim N(\beta, \beta^2\mu_b^T(\alpha I + \frac{1}{2}(\mu_b\mu_b^T))\mu_b) \\
        &\sim N(\beta, \beta^2\mu_b^T\alpha I\mu_b + \frac{1}{2}\beta^2\mu_b^T\mu_b\mu_b^T\mu_b) \\
        &\sim N(\beta, \beta^2\alpha + \frac{1}{2}\beta^2) \\
    \end{align}

    Further if one multiplies $\beta\mu_c^Tk_b$ one gets due to the orthonormality of all $\mu_j$:

    \begin{align}
    \beta\mu_c^Tk_b &\sim N(\beta\mu_c^T\mu_b, \mu_c^T\Sigma_b\mu_c)\\
        &\sim N(0, \beta^2\mu_c^T(\alpha I + \frac{1}{2}(\mu_b\mu_b^T))\mu_c) \\
        &\sim N(0, \beta^2\mu_c^T\alpha I\mu_c + \frac{1}{2}\beta^2\mu_c^T\mu_b\mu_b^T\mu_c) \\
        &\sim N(0, \beta^2\alpha) \\
    \end{align}

    For the query $q = \beta (\mu_b + \mu_c)$ we get:
    
    \begin{align}
        q^Tk_b &= \beta(\mu_b + \mu_c)^Tk_b \\
        &= \beta\mu_b^Tk_b + \beta\mu_b^Tk_b \\
        &\sim N(\beta, \beta^2(2 + \alpha)) \\
    \end{align}

    One notices, that the variance of $q^Tk_b$ is not reaching zero due to the factor $2+ \alpha$ and therefore might be quite high. Especially applying exponential function within softmax might yield to large variations for $\exp(q^Tk_b)$ compared to proposed value in the setting of subtask 3.1 beeing $\approx \exp(\beta)$.

    As a conseqeunce $o = \Sigma^n_{i=1} a_i v_i = \Sigma^n_{i=1} \frac{exp(q^Tk_i)}{\Sigma^n_{j=1}exp(q^Tk_j)} v_i$ varies also a lot from the derived solution in subtask 3.1 and 2.2. Specifially, a large value of $\exp(q^Tk_b)$ leads to a bias towards $v_b$ and a low value of $\exp(q^Tk_b)$ to bias towards $v_c$ in contrast to the average result.

    This problem may be resolved in the multi-head setting, as multiple sampling might compensate for the variation. A higher value of $exp(q^Tk_b)$ of a single head would be evened out by other heads that have a lower value in $exp(q^Tk_b)$.


#### ${\color{red}{Comments\ 1.3}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 4: Model Size  
1. Imagine you have an input sequence of  $l$ tokens, how much memory is required and what time complexity do we have for a single self-attention layer? (give your answer in terms of $l$)
2. If you have $N$ layers of self-attention, how  would the memory requirements and the time complexity change? (give your answer in terms of $l$ and $N$)
3. If you have $l=10,000$ and $10$ layers, with the ability to perform $10M$ operations per second, how long would it take to compute the attention output?


**Answer**


1. - Memory: Assuming the model utilizes scaled dot-product attention, the memory requirement for a single self-attention layer can be calculated as approximately $O(l^2)$ for storing the attention weights and $O(l \times d)$ for storing the intermediate representations, where $d$ is the dimension of the embeddings.
   - Time Complexity: The time taken for a single self-attention layer depends on matrix multiplications, leading to a time complexity roughly proportional to the square of the sequence length multiplied by the embedding dimension, e.g. $O(l^2 \times d)$.

2. - Memory: With $N$ layers of self-attention, the overall memory demand becomes a sum of the memory requirements for each layer.
   - Time Complexity: The time complexity for $N$ layers grows linearly with $N$, and it is proportional to the product of the sequence length squared and the embedding dimension.

3. Given:
   - $l = 10,000$ tokens (input sequence length)
   - $N = 10$ layers
   - Computational capacity = $10,000,000$ ops/sec
   - Let's assume a typical embedding dimension, $d = 512$

   Calculations:
      - Time complexity per layer is approximately $O(l^2 \times d)$. So, for one layer: $10,000^2 \times 512$ operations.
      - Since there are $10$ layers, the total operations are $10 \times (10,000^2 \times 512) = 512,000,000,000$.
      - Total time = $512,000,000,000$ ops / $10,000,000$ ops/sec = $51,200 \text{ seconds} \approx 14.22 \text{ hours}$.



#### ${\color{red}{Comments\ 1.4}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## **Task 2: Multiple Choice Question Answering** (4 + 3 + 5 + 2 = 14 points)

In this task, you will fine-tune a transformer model on a multiple-choice task, which is the task of selecting the most plausible inputs in a given selection. The dataset used here is [SWAG](https://www.aclweb.org/anthology/D18-1009/), which is available via the Hugging Face [hub](https://huggingface.co/datasets/swag). Check the link for an overview of the dataset. SWAG is a dataset about commonsense reasoning, where each example describes a situation and then proposes four options that could apply for it.
Let's start by installing the necessary packages.

In [1]:
%pip install transformers
%pip install datasets
%pip install evaluate
%pip install accelerate -U
%pip install sentencepiece

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.16.1 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6
Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━

In this task, you will use a BERT model with a `MultipleChoice` head from the Hugging Face library and then create your custom model.   Recall from the class that the BERT model has an auxiliary next sentence prediction task, in which two sentences are given to BERT separated by a `[SEP]` token and a classifier head decides if the second sentence logically follows the first one. Hugging Face has
 a `*ForMultipleChoice` architecture that uses the representation of the `[CLS]` token and a linear layer to classify if one sentence follows the other. We first start with this default architecture and then build a more complicated one in a later subtask.

### Subtask 1: Loading and Processing the Data

We use the `dataset` library to download the SWAG dataset, which already contains train, validation, and test splits.

In [21]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
from datasets import load_dataset, load_metric
datasets = load_dataset("swag", "regular")
datasets

Using the latest cached version of the module from /home/l/.cache/huggingface/modules/datasets_modules/datasets/swag/9640de08cdba6a1469ed3834fcab4b8ad8e38caf5d1ba5e7436d8b1fd067ad4c (last modified on Mon Dec 11 12:45:22 2023) since it couldn't be found locally at swag., or remotely on the Hugging Face Hub.


DatasetDict({
    train: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 73546
    })
    validation: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 20006
    })
    test: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 20005
    })
})

Lets look at the first item to see how the data looks like:

In [22]:
datasets["train"][0]

{'video-id': 'anetv_jkn6uvmqwh4',
 'fold-ind': '3416',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'gold-source': 'gold',
 'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'label': 0}

**Question:**
Look at the dataset card on the Hugging Face hub and define what each of these fields means, with respect to the task:

*   `sent1`:
*   `sent2`:
*    `ending0`, `ending1`, `ending2` and `ending3`:
*   `label`:




**Answer**


*   `sent1`: the first sentence
*   `sent2`: the beginning of the second sentence
*    `ending0`, `ending1`, `ending2` and `ending3`: candidate endings of the second sentence
*   `label`: the index (0, 1, 2 or 3) of the correct ending



Write a function that displays the context and each of the four choices, following the format


```
Context:...
A-
B-
C-
D-
Ground truth: option ...
```

How you display the results is not important. You should be able to extract different parts of the data correctly and know what each field represents.

In [23]:
def explain_example(example):
  ### your code ###
  alphabet = "ABCD"
  print("Context:", example["startphrase"], "...")
  for i, letter in enumerate(alphabet):
    print(f"{letter}- ... {example['ending' + str(i)]}")
  print("Ground truth: option", alphabet[example["label"]])
  ### your code ###

In [24]:
explain_example(datasets["train"][0])

Context: Members of the procession walk down the street holding small horn brass instruments. A drum line ...
A- ... passes by walking down the street playing their instruments.
B- ... has heard approaching them.
C- ... arrives and they're outside dancing and asleep.
D- ... turns the lead singer watches the performance.
Ground truth: option A


Before feeding the data into the model, we need to preprocess the text using `Tokenizer` to tokenize the inputs into tokens and put it in a format that the model expects. The tokenizer specific to the model we want to use for this task is `distilbert-base-uncased`. Complete the code below to load a fast tokenizer for this model. DistilBERT is similar to the BERT model, and we only use this particular architecture for faster training.


In [25]:
from transformers import AutoTokenizer

###your code###
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
###your code###

In [26]:
tokenizer("This is the first sentence!", "And this is the second one.")

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 999, 102, 1998, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [27]:
# demonstration: There is a [SEP] token at the end of each sentence (102)
tokenizer("This.", "This.")

{'input_ids': [101, 2023, 1012, 102, 2023, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [28]:
# demonstration: [SEP] token only at the end of overall input (102), not between two sentences
tokenizer("This. This.")

{'input_ids': [101, 2023, 1012, 2023, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1]}

In [29]:
# demonstration: [SEP] token at the end of overall input (102) and where manually inserted
# - equivalent to passing the two sentences as separate arguments (see above)
tokenizer("This. [SEP] This.")

{'input_ids': [101, 2023, 1012, 102, 2023, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [30]:
# demonstration: [CLS] is represented as 101 and is automatically inserted ad the beginning of the input
tokenizer("This. [CLS] This.")

{'input_ids': [101, 2023, 1012, 101, 2023, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [31]:
# demonstration: you can pass the list of first sentences as the first argument
# and the list of second sentences as the second; they are then merged
# and glued with [SEP] (102) automatically
tokenizer(["A", "A", "D"], ["B", "C", "E"])

{'input_ids': [[101, 1037, 102, 1038, 102], [101, 1037, 102, 1039, 102], [101, 1040, 102, 1041, 102]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

Write a function that preprocesses the samples.
The tricky part is to put all the possible pairs of sentences in two big lists before passing them to the tokenizer.
Each **first** sentence has to be repeated 4 times to go with different ending options.
There should be a separator token between the first and second sentence, to follow the BERT input logic.
The final output is a list of 4 elements, one for each choice, where the input is transformed by the tokenizer.
For example, with a list of 2 training examples, the output includes 2 lists, where each contains 4 elements. Each of those elements is the converted input ID of the first sentence followed by the second sentence with different endings.
When calling the `tokenizer`, we use the argument `truncation=True`. This will ensure that an input longer than what the model selected can handle will be truncated to the maximum length accepted by the model.

**Hint:** Flatten the lists (all choices are flattened into a single list) before feeding them into the tokenizer and unflatten them once again for the final output.

In [32]:
### your code ###
def unflatten(data: list, n_endings: int) -> list[list]:
    assert len(data) % n_endings == 0
    n_sets = len(data) // n_endings
    return [[data[i * n_endings + j] for j in range(n_endings)]
            for i in range(n_sets)]

n_endings = 4
ending_names = [f"ending{i}" for i in range(n_endings)]
### your code ###
def preprocess_function(examples):
  ### your code ###
    # repeat each first sentence four times
    first_sentences = [[ex]*4 for ex in examples["sent1"]]
    # second sentences possible are combination of header and ending
    question_headers = [[ex]*4 for ex in examples["sent2"]]  # ← why do we need this?
    second_sentences = [[" ".join([header, examples[en][i]]) for en in ending_names]
                        for i, header in enumerate(examples["sent2"])]

    # flatten everything
    # ↑ It would have been easier to define the sentences flattened from the beginning,
    #   but your wish is my command ;)
    first_sentences_flat = [sent for four in first_sentences for sent in four]
    second_sentences_flat = [sent for four in second_sentences for sent in four]

    # tokenize
    tokenized = tokenizer(first_sentences_flat, second_sentences_flat, truncation=True)

    # un-flatten
    # demonstration: attention_mask is garbage, as always ones:
    assert all(el == 1 for mask in tokenized["attention_mask"] for el in mask)

    n_sets = len(first_sentences)
    assert n_sets == len(second_sentences)
    assert n_sets * n_endings == len(second_sentences_flat)

    for key, value in tokenized.items():
      tokenized[key] = unflatten(value, n_endings=n_endings)

    return tokenized
    ### your code ###

In [33]:
datasets["train"]

Dataset({
    features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
    num_rows: 73546
})

In [34]:
examples = datasets["train"][:2]
features = preprocess_function(examples)
print(len(features["input_ids"]), len(features["input_ids"][0]), [len(x) for x in features["input_ids"][0]])# output should be 2 4 [30, 25, 30, 28]

2 4 [30, 25, 30, 28]


In [35]:
print(features)

{'input_ids': [[[101, 2372, 1997, 1996, 14385, 3328, 2091, 1996, 2395, 3173, 2235, 7109, 8782, 5693, 1012, 102, 1037, 6943, 2240, 5235, 2011, 3788, 2091, 1996, 2395, 2652, 2037, 5693, 1012, 102], [101, 2372, 1997, 1996, 14385, 3328, 2091, 1996, 2395, 3173, 2235, 7109, 8782, 5693, 1012, 102, 1037, 6943, 2240, 2038, 2657, 8455, 2068, 1012, 102], [101, 2372, 1997, 1996, 14385, 3328, 2091, 1996, 2395, 3173, 2235, 7109, 8782, 5693, 1012, 102, 1037, 6943, 2240, 8480, 1998, 2027, 1005, 2128, 2648, 5613, 1998, 6680, 1012, 102], [101, 2372, 1997, 1996, 14385, 3328, 2091, 1996, 2395, 3173, 2235, 7109, 8782, 5693, 1012, 102, 1037, 6943, 2240, 4332, 1996, 2599, 3220, 12197, 1996, 2836, 1012, 102]], [[101, 1037, 6943, 2240, 5235, 2011, 3788, 2091, 1996, 2395, 2652, 2037, 5693, 1012, 102, 2372, 1997, 1996, 14385, 2024, 2652, 17852, 13433, 3070, 1998, 12964, 2028, 2187, 2169, 1999, 4248, 1012, 102], [101, 1037, 6943, 2240, 5235, 2011, 3788, 2091, 1996, 2395, 2652, 2037, 5693, 1012, 102, 2372, 1997, 1

We can now apply our function to all the examples in the dataset. We use the `map` method to apply the function on all the elements of all the splits in the dataset (training, validation, and testing).
Note that we passed `batched=True` to leverage the fast tokenizer and use multi-threading to process the texts in batches concurrently.

In [36]:
encoded_datasets = datasets.map(preprocess_function, batched=True)

Our dataset is still not converted to tensors and not padded. This is the job of the `data collator`. A data collator takes a list of examples and converts them to a batch.
There is no data collator in the Hugging Face default library that works on our specific problem. We thus need to write our own one. In this collator:

*  All the inputs/attention masks are flattened.
* A flattened list is passed to the `tokenizer.pad ` method to apply dynamic padding to pad inputs to the maximum length in the batch. Output will be the size of `(batch_size * 4) x seq_length`.
* Everything needs to be unflattened for the output of the data collator.
* `input_ids` and `labels` should be returned as tensors.
* The output is a dictionary called `batch` that contains features needed for training (`input_ids`, `attention_mask`, `label`).



In [37]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class MultipleChoiceDataCollator:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features: list[dict]):
        accepted_keys = ["input_ids", "attention_mask", "label"]
        if len(features[0])>len(accepted_keys):
          features = [{k: v for k, v in i.items() if k in accepted_keys} for i in features]
        ### your code ###

        # flatten
        keys_flatten = {"input_ids", "attention_mask"}
        flattened_features = []
        for row in features:
          for key in keys_flatten:
            assert len(row[key]) == n_endings
          for j in range(n_endings):
            flattened_features.append({
              key: row[key][j] if key in keys_flatten else row[key]
              for key in accepted_keys
            })


        # pad
        batch = self.tokenizer.pad(flattened_features,
                          padding=self.padding,
                          max_length=self.max_length,
                          pad_to_multiple_of=self.pad_to_multiple_of)


        # un-flatten
        for key, flat in batch.items():
          values = []
          for i in range(len(features)):
            unflattened = flat[i*n_endings:(i+1)*n_endings]
            if key not in keys_flatten:
              value = unflattened[0]
              for j in range(n_endings):
                assert unflattened[j] == value
              unflattened = value
            values.append(unflattened)

          batch[key] = torch.tensor(values)

        # very ugly, but it seems to be necessary to rename this to "labels", as
        # otherwise we encounter an unexpected keyword argument error on training
        batch["labels"] = batch["label"]
        del batch["label"]

        ### your code ###
        return batch

In [38]:
accepted_keys = ["input_ids", "attention_mask", "label"]  # note "label" without "s"
features = [{k: v for k, v in encoded_datasets["train"][i].items() if k in accepted_keys} for i in range(2)]
batch=MultipleChoiceDataCollator(tokenizer)(features)
print(batch["input_ids"].shape)
print(batch["attention_mask"].shape)
print(batch["labels"].shape)  # note the "s" here

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


torch.Size([2, 4, 35])
torch.Size([2, 4, 35])
torch.Size([2])


In [39]:
for i in range(4):
  print(batch["input_ids"][1][i])
  print(tokenizer.decode(batch["input_ids"][1][i]))

tensor([  101,  1037,  6943,  2240,  5235,  2011,  3788,  2091,  1996,  2395,
         2652,  2037,  5693,  1012,   102,  2372,  1997,  1996, 14385,  2024,
         2652, 17852, 13433,  3070,  1998, 12964,  2028,  2187,  2169,  1999,
         4248,  1012,   102,     0,     0])
[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession are playing ping pong and celebrating one left each in quick. [SEP] [PAD] [PAD]
tensor([  101,  1037,  6943,  2240,  5235,  2011,  3788,  2091,  1996,  2395,
         2652,  2037,  5693,  1012,   102,  2372,  1997,  1996, 14385,  3524,
         3254,  2875,  1996, 15724,  1012,   102,     0,     0,     0,     0,
            0,     0,     0,     0,     0])
[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession wait slowly towards the cadets. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
tensor([  101,  1037,  6943,  2240,  5235,  2011, 

#### ${\color{red}{Comments\ 2.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 2: Fine-tuning a Hugging Face Model

To fine-tune our model, we first need to download the correct architecture from Hugging Face. Import the correct class for this task and download the pre-trained checkpoint for the base class from `distilbert-base-uncased`. Note that the weights in the classification head are initialized at random.

In [21]:
### your code ###
from transformers import DistilBertForMultipleChoice
model_hf = DistilBertForMultipleChoice.from_pretrained("distilbert-base-uncased")

### your code ###

Some weights of DistilBertForMultipleChoice were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Next, we need to define our `Trainer` and pass in the correct `TrainingArguments` (a class that contains all the attributes to customize the training). Define a `TrainingArguments` that


* creates an output directory `distilbert-base-uncased-swag` to save the checkpoints and logs.
*   evaluates the model on the validation set after the `300` steps.
* a checkpoint should be saved after each `600` step and no more than 2 checkpoints should be saved in total.
* the random seed for training is `77`.
* batch size for training and evaluation: `48` (if you are running out of memory, feel free to change this setting but indicate it as a comment in your notebook, on a T4 GPU from google colab this takes about `13.2GB` of `15.0GB`).
* train for `1800` steps with a learning rate of `5e-5`, and add weight decay of `0.01` to the optimizer.
* the trainer should remove the columns from the data that are not used by the model.
* The final checkpoint should be the checkpoint that had the best overall validation metric not necessarily the last checkpoint.

**Note:** Please use GPU for to train your model. If on colab, you can use T4 GPU for free.

In [22]:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
    ### your code ###
    output_dir="distilbert-base-uncased-swag",
    evaluation_strategy="steps",
    eval_steps=300,
    save_strategy="steps",
    save_steps=600,
    save_total_limit=2,
    seed=77,
    # reducing batch size to half of the suggested value due to GPU memory pressure
    per_device_train_batch_size=24,
    per_device_eval_batch_size=24,
    max_steps=1800,
    learning_rate=5e-5,  # also the default
    weight_decay=0.01,
    load_best_model_at_end=True,  # undocumented, but used in tutorial
)
    ### your code ###


Before we initialize the `Trainer`, we create a function that tells the trainer how to compute the metrics from the predictions. Fill the `compute_metrics` function to compute the accuracy based on the `predictions`. This object contains the prediction of the model, as well as the ground truth labels.

**Hint 1:** Keep in mind that the output of this function should be a dictionary containing the metric name and value.

**Hint 2:** Consider the shape of the example input. This is similar to the logits produced by the model.

In [23]:
import numpy as np
from transformers import EvalPrediction

def compute_metrics(predictions: EvalPrediction):
  ### your code ###
   preds, label_ids = predictions

   choices = np.argmax(preds, axis=-1)
   return_dict = {"accuracy": np.mean(choices == label_ids)}
  ### your code ###
   return return_dict

In [24]:
preds=np.array([[0.9,0.2,0,0],
                [0.2,0.2,0.9,0.1],
                [0.2,0.9,0,0],
                [0.2,0.1,0.8,0],
                [0.9,0.1,0.8,0],
                [0.2,1,0.4,0],
                [0.2,1,0.4,0.9],
                [1,0.1,0.4,0.3],
                [0.1,0.1,0.9,0.3],
                [0.1,0.1,0.2,1]])
label_ids=np.array([0,3,1,2,0,1,3,0,2,3])
compute_metrics((preds,label_ids))

{'accuracy': 0.8}

Now it's time to pass everything to a `Trainer` object to start the training process. Initialize a `Trainer` object and pass all the necessary information, keep in mind that we also have the optional metric computation and that we tend to run an evaluation on the validation set during training. The training should take around 30 min on Google Colab T4 GPU.

In [25]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cuda')

In [26]:
### your code ###
trainer = Trainer(
    model=model_hf,
    args=training_args,
    # data collator removes columns from the data not used by model
    data_collator=MultipleChoiceDataCollator(tokenizer),
    train_dataset=encoded_datasets["train"],
    eval_dataset=encoded_datasets["validation"],
    tokenizer=tokenizer,
)
### your code ###

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [27]:
trainer.train()# should take around 30 min on Google Colab T4 GPU

Step,Training Loss,Validation Loss
300,No log,0.978314
600,1.110400,0.914558
900,1.110400,0.868163
1200,0.994700,0.847908
1500,0.932300,0.814828
1800,0.932300,0.789878


TrainOutput(global_step=1800, training_loss=0.9945607503255208, metrics={'train_runtime': 1076.1998, 'train_samples_per_second': 40.141, 'train_steps_per_second': 1.673, 'total_flos': 2564887381385472.0, 'train_loss': 0.9945607503255208, 'epoch': 0.59})

Save the model in `distilbert-base-uncased-swag/final_model`.

In [28]:
### your code ###
trainer.save_model("distilbert-base-uncased-swag/final_model")
### your code ###

Look at the saved files and answer the following questions (it is possible to answer these questions by writing some code, but we want you to explore the saved files):

**Question:**


1.   What is the vocabulary id for the `[CLS]` and `[MASK]` tokens?
2.   What is the dropout probability for the attention layer?

**Dropout:** With dropout, certain nodes are set to the value zero in a training run, i.e. removed from the network. Thus, they have no influence on the prediction and also in the backpropagation. Thus, a new, slightly modified network architecture is built in each run and the network learns to produce good predictions without certain inputs. Read more [here](https://databasecamp.de/en/ml/dropout-layer-en).



**Answer**

```
1. [CLS]: 101
   [MASK]: 103
   (see tokenizer_config.json, added_tokens_decoder)
2. attention_dropout: 0.1
   (see config.json)
```

#### ${\color{red}{Comments\ 2.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 3: Fine-tune a Custom Model


In this case, we were lucky that Hugging Face had a pre-implemented architecture available for us to use. However, that is not always the case. Moreover, we might want to experiment beyond the default architectures to find a suitable one for a task. Therefore, it is important to learn to extend the Hugging Face models and train a custom model. The good news is that except for the model architecture the rest of the code can remain as it is.

Design a model for multiple choice model as follows:


1.   the config file for a feature extractor (must be a distilbert type) is  passed during initialization. The config file determines which model is used for feature extraction.
2.   From the `last_hidden_state` of the feature extractor, choose the `[CLS]` embedding (first one). This embedding is used as the compressed representation of first and second sentences. During pre-training it is used  for classifying whether these two sentences follow one another, making it a good candidate for our task.
3. `[CLS]` embedding is passed through a linear layer **that does not change the size of the embedding** and is passed through a tanh nonlinearity.
4. The output of tanh is passed through a dropout layer, where the dropout probability is the same as the dropout probability used for the `distilbert` model used as feature extractor.
5. The output of the previous stage is fed into another linear layer that shrinks the size of the embedding dimension to a quarter of the original size, e.g., if the embedding size is 12, the new embedding dimension is 3.
6. The output is followed by another dropout layer (you can use the one from stage 4).
7. Finally, a binary classifier is applied to determine the probability of sentence 1 being followed by sentence 2.
8. the cross-entropy loss is used to compute the loss.

**Hint:** Keep in mind that for a 4 choice system, you classify each of the four solutions independently. However, the final output should group the four logits together. For example, if input ids have the shape `[2, 4, 35]` (batch size=2, num choices=4, seq len=35), then the logits have the `[2, 4]` and labels have the dimension `[2, 1]`.



In [79]:
from transformers import DistilBertModel,BertConfig,DistilBertConfig,PretrainedConfig,PreTrainedModel,DistilBertPreTrainedModel
from torch import nn

class CustomMultipleChoice(DistilBertPreTrainedModel):
    def __init__(self, config: PretrainedConfig):
        super().__init__(config)
        ###your code ###
        self.distilbert = DistilBertModel(config)
        self.dense = nn.Linear(config.dim, config.dim)
        self.activation = nn.Tanh()
        self.dropout = nn.Dropout(config.seq_classif_dropout)
        self.dense2 = nn.Linear(config.dim, config.dim//4)
        self.classifier = nn.Linear(config.dim//4, 1)
        ###your code ###


    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
    ):
        """
        input_ids: input sentences converted to ids
        attention_mask: the attention mask
        labels:  Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors.
        """

        num_choices = input_ids.shape[1]

        ###your code ###
        batch_size = input_ids.shape[0]
        seq_len = input_ids.shape[2]
        input_ids = input_ids.view(batch_size*num_choices, seq_len) if input_ids is not None else None
        attention_mask = attention_mask.view(batch_size*num_choices, seq_len) if attention_mask is not None else None


        distilbert_out = self.distilbert(input_ids=input_ids, attention_mask=attention_mask)
        cls_embedding = distilbert_out.last_hidden_state.view(batch_size, num_choices, seq_len, -1)[:, :, 0]

        x = self.dense(cls_embedding)
        x = self.activation(x)
        x = self.dropout(x)
        x = self.dense2(x)
        x = self.dropout(x)
        logits = self.classifier(x).squeeze(-1)
        reshaped_logits = logits.view(batch_size, num_choices)

        loss = None
        if labels is not None:
            loss_fn = nn.CrossEntropyLoss()
            loss = loss_fn(reshaped_logits, labels.view(-1))

        ###your code ###
        return {"loss":loss, "logits":reshaped_logits}

Initialize the feature extractor with `distilbert-base-uncased` and create your custome model.

In [80]:
from transformers import AutoConfig
###your code ###
config = AutoConfig.from_pretrained("distilbert-base-uncased")
model_custom = CustomMultipleChoice(config)
###your code ###

In [81]:
for name, param in model_custom.named_parameters():
    if param.requires_grad and not name.startswith("distilbert."):
      print(name, param.data.shape)

dense.weight torch.Size([768, 768])
dense.bias torch.Size([768])
dense2.weight torch.Size([192, 768])
dense2.bias torch.Size([192])
classifier.weight torch.Size([1, 192])
classifier.bias torch.Size([1])


We keep the same training arguments but change the directory in which we save the model logs, the directory in which we save the model output and the name of the run, to `custom_model`.



In [82]:
###your code ###
training_args = TrainingArguments(
    ### your code ###
    output_dir="custom_model",  # only kwarg changed compared to subtask 2
    evaluation_strategy="steps",
    eval_steps=300,
    save_strategy="steps",
    save_steps=600,
    save_total_limit=2,
    seed=77,
    # reducing batch size to half of the suggested value due to GPU memory pressure
    per_device_train_batch_size=24,
    per_device_eval_batch_size=24,
    max_steps=1800,
    learning_rate=5e-5,  # also the default
    weight_decay=0.01,
    load_best_model_at_end=True,  # undocumented, but used in tutorial
)

###your code ###

Initialize the trainer for training the custom model.The training should take around 30 min on Google Colab T4 GPU.


In [83]:
###your code ###
trainer = Trainer(
    model=model_custom,  # only kwarg changed
    args=training_args,
    # data collator removes columns from the data not used by model
    data_collator=MultipleChoiceDataCollator(tokenizer),
    train_dataset=encoded_datasets["train"],
    eval_dataset=encoded_datasets["validation"],
    tokenizer=tokenizer,
)
###your code ###

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [84]:
trainer.train()# should take around 30 min on Colab T4 GPU

Step,Training Loss,Validation Loss
300,No log,1.392269
600,1.391200,1.383739
900,1.391200,1.386545
1200,1.381300,1.389654
1500,1.372600,1.389665
1800,1.372600,1.388726


TrainOutput(global_step=1800, training_loss=1.379572516547309, metrics={'train_runtime': 1070.6342, 'train_samples_per_second': 40.35, 'train_steps_per_second': 1.681, 'total_flos': 2573635572211968.0, 'train_loss': 1.379572516547309, 'epoch': 0.59})

Save the model in `custom_model/final_model`. Note that with the custom model, you need to save it without the help of the trainer. The trainer would save the configuration but since this model is not a registered Hugging Face model only the base model would be saved. Loading the model weights is also effected by this.

In [85]:
###your code ###
model_custom.save_pretrained("custom_model/final_model")
###your code ###

#### ${\color{red}{Comments\ 2.3}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 4: Evaluation and Model Comparison

Many times you do not perform the final evaluation right after training, but load the checkpoints and evaluate them on the fly. To this end, load the two models from  disk.

In [40]:
from transformers import AutoModelForMultipleChoice,AutoConfig
### your code ###
model_hf = AutoModelForMultipleChoice.from_pretrained("distilbert-base-uncased-swag/final_model")
model_custom = AutoModelForMultipleChoice.from_pretrained("custom_model/final_model", ignore_mismatched_sizes=True)
### your code ###

Some weights of DistilBertForMultipleChoice were not initialized from the model checkpoint at custom_model/final_model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForMultipleChoice were not initialized from the model checkpoint at custom_model/final_model and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([1, 192]) in the checkpoint and torch.Size([1, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


To evaluate the data we load the validation split using a data loader and our previously defined data collator. Note that although we had a test split we cannot use it, since there are no labels available for this split (you can check the data to confirm this).

In [41]:
from torch.utils.data import DataLoader
import evaluate

eval_dataloader = DataLoader(encoded_datasets["validation"], batch_size=64, collate_fn=MultipleChoiceDataCollator(tokenizer))

To make things easier, let's use the `evaluate` library from Hugging Face to compute the accuracy metric. Here we load `accuracy` from the `evaluate` library two times, one for the custom model and one for the Hugging Face model. Further, we put the models on eval mode. Complete the code for evaluation using the capabilities of the `evaluate` library to simultaneously compute the metric for both models.


In [45]:
from tqdm import tqdm
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
metric_dict={"custom":evaluate.load("accuracy"),"hf":evaluate.load("accuracy")} #use to compute accuracy
models_dict= {"custom":model_custom,"hf":model_hf}# use to access models

for name, model in models_dict.items():
  model.to(device)
  model.eval()

for i,batch in tqdm(enumerate(eval_dataloader), total=len(eval_dataloader)):
  ### your code ###
  #evaluate on both model on each batch
  for name,model in models_dict.items():
    with torch.no_grad():
      outputs = model(**batch)
      preds = torch.argmax(outputs.logits, axis=-1)
      metric_dict[name].add_batch(predictions=preds, references=batch["labels"])
acc_hf= metric_dict["hf"].compute()
acc_custom= metric_dict["custom"].compute()
  ### your code ###
print("Hugging Face Model :",acc_hf)
print("Custom Model :",acc_custom)

  1%|          | 2/313 [00:25<1:07:05, 12.94s/it]

#### ${\color{red}{Comments\ 2.4}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## **Task 3: Encoder-Decoder Architecture** (5 + 2 + 2 + 5 = 14 points)

We explored an encoder-based model (BERT) in the previous exercise. In this task, we look at another family of transformer architectures, the encoder-decoder. We use the [T5](https://arxiv.org/pdf/1910.10683.pdf) model, presented by Raffel et al.  T5 is an encoder-decoder architecture pre-trained on a multi-task mixture of unsupervised and supervised tasks. In this task, we set up a fine-tuning example for question answering using the [SQUAD](https://huggingface.co/datasets/squad) dataset. Since the actual fine-tuning is time-consuming and computational intensive for inference, we use an already pre-trained model. The main goal is to introduce you to the structure of the fine-tuning and its simplicity with the Hugging Face framework.

To fine-tune the BERT-based models, we usually add a task-specific head. On the other hand, T5 converts all NLP problems into a text-to-text format.  
It is trained using teacher forcing, meaning that we require an input sequence and a corresponding target sequence.


1.   The input sequence is fed to the model using `input_ids` from the tokenizer.
2.   The target sequence is shifted to the right, i.e., prepended by a start-sequence token and fed to the decoder using the `decoder_input_ids` (input_ids of the encoded target sequence). The target sequence is appended by EOS (end of the sentence) to denote the end of a generation and corresponds to the `labels`.
3. The task prefix defines what task is expected of T5. For example, we prepend the input sequence with `translate English to German: ` before encoding the input to tell the model to translate. T5 already has a set of pre-defined task prefixes, and it is best to stick to those since they were used during pre-training. With enough training data, you can also introduce your own custom task.


In contrast to the encoder model, where only a single `max_length` is required, for encoder-decoder architectures, one typically defines a `max_source_length` and `max_target_length`, which determine the maximum length of the input and output sequences, respectively. We must also ensure that the padding ID of the `labels` is not taken into account by the loss function. This can be done by replacing them with `-100`, which is the `ignore_index` of the `CrossEntropyLoss`.

### Subtask 1: Data Processing

We first start by loading the dataset from Hugging Face hub:

In [None]:
from datasets import load_dataset

datasets_squad = load_dataset("squad")
datasets_squad

  from .autonotebook import tqdm as notebook_tqdm
Downloading builder script: 100%|██████████| 5.27k/5.27k [00:00<00:00, 12.1MB/s]
Downloading metadata: 100%|██████████| 2.36k/2.36k [00:00<00:00, 16.4MB/s]
Downloading readme: 100%|██████████| 7.67k/7.67k [00:00<00:00, 16.3MB/s]


Downloading and preparing dataset squad/plain_text to /Users/konradgoldenbaum/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data: 30.3MB [00:00, 35.8MB/s]/2 [00:00<?, ?it/s]
Downloading data: 4.85MB [00:00, 54.3MB/s]                   .07s/it]
Downloading data files: 100%|██████████| 2/2 [00:01<00:00,  1.46it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 1456.36it/s]
                                                                                           

Dataset squad downloaded and prepared to /Users/konradgoldenbaum/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


100%|██████████| 2/2 [00:00<00:00, 302.94it/s]


DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [None]:
print("context ---->" ,datasets_squad["train"][0]["context"])
print("question ---->",datasets_squad["train"][0]["question"])
print("answers ---->",datasets_squad["train"][0]["answers"])

context ----> Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
question ----> To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
answers ----> {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}


Now let's load the needed pre-trained tokenizer for `t5-small`, which is the smallest T5 model. Set the maximum sequence length to `512`.

In [None]:
import torch
### your code ###
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

t5_tokenizer = AutoTokenizer.from_pretrained("t5-small", model_max_length=512)
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
### your code ###

The next step is to pre-process the dataset using the tokenizer to convert the sequences to IDs and add the special tokens.
T5 is based on the SentencePiece tokenizer, and the end of sentence token is denoted by `</s>`.
Complete the function `add_eos_to_examples` to format the input and target sequence. Your input as `input_text` should have the format `question:{question_text} context:{context_text} <EOS_Token>` and your target as `target_text` should have the format `{answer_text} <EOS_Token>`.

In [None]:
def add_eos_to_examples(example):
    #print(example)
    ### your code ###
    example['input_text'] = "question:" + example["question"] + " context:" + example["context"] + "</s>"
    example['target_text'] = example["answers"]["text"][0] + "</s>"
    ### your code ###
    return example

Use the `map` function to process the data, and do not set the `batched` argument.

In [None]:
### your code ###
encoded_squad = datasets_squad.map(add_eos_to_examples)
### your code ###

Loading cached processed dataset at /Users/konradgoldenbaum/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-03f42b974da75c48.arrow
Loading cached processed dataset at /Users/konradgoldenbaum/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-4644bfce6c5753de.arrow


In [None]:
print(encoded_squad["train"][0]["input_text"])
print(encoded_squad["train"][0]["target_text"])

question:To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? context:Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.</s>
Saint Bernadette Soubirous</s>


Complete the function `convert_to_features` that takes in the examples from the dataset and tokenizes them using the T5 tokenizer. However, our answers in this dataset are relatively short and do not require `512` tokens, in contrast to the input sequence which is a combination of question and context paragraphs and is usually long. To this end, we want to truncate the input sequence at `512` and the target sequence at `16`. If any input or target is smaller than the specified length, make sure you pad them. Finally, convert everything to PyTorch tensors to be easily used by the data collator and place them in the dictionary `encodings`.

In [None]:
def convert_to_features(examples):
    ### your code ###
    tokenizer = AutoTokenizer.from_pretrained("t5-small", model_max_length=512, use_fast=True)

    encodings = {
        'input_ids': tokenizer(examples['input_text'], truncation=True, padding='max_length', max_length=512, return_tensors="pt").input_ids,
        'target_ids': tokenizer(examples['target_text'], truncation=True, padding='max_length', max_length=16, return_tensors="pt").input_ids
    }
    ### your code ###
    return encodings

Use the `map` function to process the data.

**Takes long (>10min)**

In [None]:
### your code ###
encoded_squad = encoded_squad.map(convert_to_features)
### your code ###

Map:   0%|          | 1/87599 [00:00<4:46:59,  5.09 examples/s]

{'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}, 'input_text': 'question:To whom did the Virgin Mary al

Map:   0%|          | 3/87599 [00:00<4:03:57,  5.98 examples/s]

{'id': '5733be284776f41900661180', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'The Basilica of the Sacred heart at Notre Dame is beside to which structure?', 'answers': {'text': ['the Main Building'], 'answer_start': [279]}, 'input_text': 'question:The Basilica of the Sacred heart a

Map:   0%|          | 5/87599 [00:00<3:57:50,  6.14 examples/s]

{'id': '5733be284776f4190066117e', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'What sits on top of the Main Building at Notre Dame?', 'answers': {'text': ['a golden statue of the Virgin Mary'], 'answer_start': [92]}, 'input_text': 'question:What sits on top of the Main Building at N

Map:   0%|          | 7/87599 [00:01<4:01:10,  6.05 examples/s]

{'id': '5733bf84d058e614000b61bf', 'title': 'University_of_Notre_Dame', 'context': "As at most other universities, Notre Dame's students run a number of news media outlets. The nine student-run outlets include three newspapers, both a radio and television station, and several magazines and journals. Begun as a one-page journal in September 1876, the Scholastic magazine is issued twice monthly and claims to be the oldest continuous collegiate publication in the United States. The other magazine, The Juggler, is released twice a year and focuses on student literature and artwork. The Dome yearbook is published annually. The newspapers have varying publication interests, with The Observer published daily and mainly reporting university and other news, and staffed by students from both Notre Dame and Saint Mary's College. Unlike Scholastic and The Dome, The Observer is an independent publication and does not have a faculty advisor or any editorial oversight from the University. In 1987, wh

                                                               

KeyboardInterrupt: 

In [None]:
encoded_squad #new columns are added

<map at 0x28c32eef0>

Interestingly, although we specified PyTorch tensors as output, the type of the `input_ids` is still a list. To remedy this problem, you need to explicitly set the type of the column that contains PyTorch tensors.

In [None]:
type(encoded_squad["train"][0]["input_ids"])

TypeError: 'map' object is not subscriptable

In [None]:
### your code ###
encoded_squad.set_format(type='torch', columns=['input_ids', 'target_ids'])
### your code ###
type(encoded_squad["train"][0]["input_ids"])

In [None]:
print("Shape of the input_ids:",encoded_squad["train"][0]["input_ids"].shape)
print("Shape of the target_ids:",encoded_squad["train"][0]["target_ids"].shape)

The final step in the data processing is the creation of the data collator to
prepare `labels` from `target_ids` and return examples with keys as expected by the forward method of T5.
This is necessary because the trainer directly passes this dict as argument to the model so you need to check the input of T5 and rename the column based on that.
`input_ids`, `target_ids`, `attention_mask`, and `target_attention_mask` need to be stacked in a batch and the pad tokens in the target need to be set to `-100` to avoid loss computation.

In [None]:
from dataclasses import dataclass
from transformers import DataCollator
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
@dataclass
class T2TDataCollator:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = -100
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    def __call__(self, batch):

      ### your code ###


        feature_dict = DataCollator(tokenizer=self.tokenizer, padding=self.padding, max_length=self.max_length, pad_to_multiple_of=self.pad_to_multiple_of)(batch)
        return feature_dict
      ### your code ###


In [None]:
accepted_keys = ['input_text', 'target_text', 'input_ids', 'attention_mask', 'target_ids', 'target_attention_mask']
features = [{k: v for k, v in encoded_squad["train"][i].items() if k in accepted_keys} for i in range(2)]
batch=T2TDataCollator(t5_tokenizer)(features)
print(batch["input_ids"].shape)
print(batch["attention_mask"].shape)
print(batch["labels"].shape)

NameError: name 't5_tokenizer' is not defined

#### ${\color{red}{Comments\ 3.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$

Point distribution \
✅ 0.5 point if the tokenizer is loaded correctly. \
Note: Slightly different sintax (with Autotokenizer) but virtually the same. \
✅ 0.5 if the EOS sentence and prefixes are added correctly in `add_eos_to_examples`. \
✅ 0.5 if the map function is used correctly for `add_eos_to_examples`. \
✅ 1 point if the input and target are tokenized and padded correctly in `convert_to_features`. \
Note: Slightly different sintax but virtually the same. \
❌ 0.5 point for use of the map function for `convert_to_features`. \
Note: They didn't set batched=True, which by default is False \
❌✅ (0.25) 0.5 point for conversion of the columns to tensors. \
Note: They called the right method "set_format" but they didn't set up the correct columns.\ 


✅ 0.5 point if the tensors are stacked properly, look at the output shapes for hints. \
Note: Their code is not running for previous errors in cascade, so we cannot check the shapes.\ 
      Moreover, their implementation in substatially different from ours and the one present in the solutions. \
      Anyway, I checked the data_collator.py script, and read through the code in which the DataCollator class is defined and it seems to me that the stacking is done properly by the call of that function! \
      Their implementation is less transparent than the one presented in the solutions, but it's also much more compact. \
      I suggest to read the definition of the method "torch_default_data_collator(...)", which is invoked since they did not pass any "return_tensor" parameter and that's defaulted to call the quoted method. \
❌ 0.5 points if the names are correctly defined. \
Note: Using the DataCollator, they had no control over this. \
✅ 0.5 point if the label pads are set to `-100`. \
Note: They did it straight away in the beginning.

Points: \
3.75/5.0

${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 2: Training

For training and inference, we can use `T5ForConditionalGeneration`, which includes the language modeling head on top of the decoder. Load the `t5-small` model.

In [None]:
### your code ###
from transformers import AutoModelForSeq2SeqLM

t5 = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

### your code ###

Next, similar to the previous task we initiate training arguments. Note that this time we are using a `Seq2SeqTrainingArguments` for a `Seq2SeqTrainer`. Set the parameters for training as follows:


*   T5 doesn't support GPU and TPU evaluation for now, so we only focus on training. You do not need to pass any parameters for evaluation setup.
*   The output directory should be named `t5-squad`.
* The T5 models need a slightly higher learning rate than the default one set in the `Trainer` when using the `AdamW` optimizer. Set the learning rate to `1e-4` and the regularization parameter to `0.01`.
* Random seed should be `77`, and we train for a maximum of `200` steps and save a checkpoint every `100` steps. A complete training of the T5 model requires far more than `200` steps, however, that is beyond the scope of this assignment.
* T5 models require a large batch size. The default model was trained with a batch size of `128`. However, we cannot fit that into a single GPU, therefore we use gradient accumulation. Set the batch size to `32` and choose the gradient accumulation step to reach the effective batch size of `128`.
* Make sure that your trainer does not remove unused columns during training, as this will cause a runtime error later on.


**Gradient accumulation:** is a technique that simulates a larger batch size by accumulating gradients from multiple small batches before performing a weight update.



In [None]:
from transformers import TrainingArguments
### your code ###
training_args = TrainingArguments(output_dir="./t5_squad",
                                  learning_rate=1e-4,
                                  weight_decay=0.01,
                                  per_device_train_batch_size=32,
                                  gradient_accumulation_steps=4,
                                  num_train_epochs=1,
                                  max_steps=200,
                                  save_steps=100,
                                  load_best_model_at_end=True,
                                  metric_for_best_model="eval_loss",
                                  remove_unused_columns=False,
                                  greater_is_better=False,
                                  random_seed=77)


    ### your code ###


Once again make sure that you are using GPU before running the cell below.
Initilize your `Seq2SeqTrainer` with inputs necessary for training. The training should take around 15 min on Google Colab T4 GPU.


In [None]:
# Initialize our Trainer
from transformers import Trainer
### your code ###
trainer = Trainer(TrainingArguments=training_args,
                  model=t5,
                  data_collator=T2TDataCollator(t5_tokenizer),
                  train_dataset=encoded_squad["train"],
                  eval_dataset=encoded_squad["validation"],
                  device=torch.device("mps") if torch.cuda.is_available() else torch.device("cpu"))


    ### your code ###


In [None]:
trainer.train()

#### ${\color{red}{Comments\ 3.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


Point distribution \
✅ 0.5 point for initializing the correct model \
Note: Different call, but AutoModelForSeq2SeqLM loads the same underlying model class as T5ForConditionalGeneration. \
❌✅ (-0.25 points) 1 points if all the training parameters are set correctly. \
Note: Using TrainingArguments instead of Seq2SeqTrainingArguments isn't entirely wrong, but it lacks all the additional functionalities that Seq2SeqTrainingArguments specifically has to tackle seq2seq tasks. \
❌✅ (-0.25 points) 0.5 point if the trainer is correctly initiated. \
Note: Same as for previous point. They used Trainer instead of Seq2SeqTrainer. \

1.50/2.00 points


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 3: Inference

Our trained model has seen far too few instances to make a coherent prediction. To this end, we load an already trained checkpoint from Hugging Face and perform inference. Load this [model](https://huggingface.co/mrm8488/t5-base-finetuned-squadv2) and the respective tokenizer. Note that we are loading a `base` model that is slightly larger than `t5-small`.

In [1]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline
### your code ###
t5_tokenizer = AutoTokenizer.from_pretrained("t5-small", model_max_length=512)
t5_model = AutoModelForSeq2SeqLM.from_pretrained("./t5_squad/t5-small")

### your code ###

: 

At inference time for T5, it is recommended to use the `generate()` function. This auto-regressively generates the decoder output. Complete the code for the `get_answer` function, which gives a model, a tokenizer, and a question and context pair, and generates the answer from the context given. The output should be the answer to the given question in natural text (without the special tokens).

**Hint:** Many of the steps are similar to how you prepared your input data for the model.

In [None]:
def get_answer(tokenizer,model, question, context):
  ### your code ###
  t5_pipeline = pipeline("text2text-generation", model=model, tokenizer=tokenizer)


  answer = t5_pipeline(question, context)[0]['generated_text']
  ### your code ###
  return answer

Let's try it with an example.

In [None]:
context = "Sarah has joined NLP for transformers class and is working on her research project with the support of Harry."
question = "Who is supporting Sarah?"

get_answer(t5_tokenizer,t5_model,question, context) ###your answer should be "Harry"

In [None]:
context = "TPUs are more power efficient in comparison to GPUs making them a better choice for machine learning projects."
question = "What is better for machine learning projects?"

get_answer(t5_tokenizer,t5_model,question, context)###your answer should be "TPUs"

#### ${\color{red}{Comments\ 3.3}}$

${\color{red}{⚠️Comments\ begin⚠️}}$

Point distribution \
❌ 0.5 points for initializing the correct model and tokenizer.
Note: The loaded tokenizer is wrong. They might have downloaded on local and loaded the right model, but we don't have access to that. Indeed they do: \
\>\>\> t5_model = AutoModelForSeq2SeqLM.from_pretrained("./t5_squad/t5-small")

✅ 0.5 points for preparing the input correctly.
Note: Nice approach with the pipeline, very elegant. We checked and it should accomplish the exact same results as the provided solution.  \
❌ 1 point generation and decoding using the tokenizer.    \
Note: The notebook we received contains no output on the corresponding cells. Since they didn't even provide a working initialization of the model we will unfiortunately not run the code for them.

0.5/2.0 Points


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 4: T5 Paper

To answer questions of the final subtask you need to have a general overview of the [T5 paper](https://arxiv.org/pdf/1910.10683.pdf).



1.   Describe what a “text-to-text format" is and how T5 processes input and output for text classification tasks? What are the possible complications with a predefined set of classes?
2.   Describe the "masked language modeling" and "word dropout" unsupervised objective with sentinel tokens. Give an example of how this would look in a single sentence.
3. Explain "fully-visible", "causal" and "causal masking with prefix" masking.
4. Briefly describe "adapter layers" and "gradual unfreezing" as methods for fine-tuning on fewer parameters.



**Answer**

1. **Text-to-Text Format**: This format refers to a method in natural language processing (NLP) where both the input and output are in text form. Unlike other formats where input might be text and output could be a label, category, or action, in text-to-text, everything is converted into a textual representation. This approach simplifies the processing model since it deals with only one type of data - text.

2. **T5 (Text-to-Text Transfer Transformer) Processing for Text Classification Tasks**: T5 treats every NLP problem as a text-to-text problem. For text classification tasks, both the input and the desired output (e.g., a class label) are formulated as text.

   - Input Processing: The input text is prefixed with a task-specific identifier (like "classify:") to provide context to the model about what task it needs to perform. This text is then tokenized and fed into the model.
   
   - Output Processing: T5 generates text as output, which in the case of classification, would be the name of the class. This output is interpreted as the class to which the input text belongs.

3. **Complications with a Predefined Set of Classes**:

   - **Limited Flexibility**: The model can only classify inputs into the predefined classes, which might not cover all possible or relevant categories, leading to misclassification or oversimplification.
   
   - **Bias and Imbalance**: If the predefined classes are not representative of the diversity in the real world, the model could be biased. This is especially problematic in datasets where some classes are overrepresented compared to others.
   
   - **Adaptability**: In dynamic domains where new categories might emerge over time (like in news topics), a fixed set of classes can render the model outdated or less effective.

**Answer**

**Masked Language Modeling (MLM):** This is a technique used in language model training where some of the words in a sentence are randomly masked or hidden, and the model's objective is to predict these masked words. This approach helps the model learn context, grammar, and word relationships.

**Word Dropout:** This involves randomly dropping words from the input during training, forcing the model to make predictions or understand context without relying on all available information. It's a form of regularization that prevents overfitting and encourages the model to learn more robust features.

**Sentinel Tokens:** These are special tokens used to represent masked words. In MLM, original words are replaced with sentinel tokens, and the model predicts the original word based on the context provided by the surrounding words.

**Example:**
- Original Sentence: "The quick brown fox jumps over the lazy dog."
- With MLM and Sentinel Tokens: "The quick [MASK] fox jumps over the [MASK] dog."
- With Word Dropout: "The quick brown jumps over the lazy."

In the MLM example, the model would aim to predict the words "brown" and "lazy" based on the context provided by the rest of the sentence. The sentinel tokens (e.g., [MASK]) indicate the positions of the masked words. In the Word Dropout example, the model sees a sentence with missing words ("fox" and "dog") and must understand or process the sentence without them.
`

**Answer**

1. **Fully-Visible Masking**: In fully-visible masking, every token in the input sequence can attend to every other token in the sequence. This means there's no restriction on what each token can see during the processing.

2. **Causal Masking**: Causal masking, also known as autoregressive masking, is employed for text generation tasks. Here, a token can only attend to previous tokens in the sequence, not the future ones. This is akin to reading or generating text left-to-right; each word only "knows" about the words that came before it, not the ones that follow. This type of masking ensures that the model generates text in a forward direction, predicting one word at a time based on the preceding context.

3. **Causal Masking with Prefix**: This is a variation of causal masking where the model is provided with a prefixed context or initial sequence of tokens. The model then continues generating text based on this prefixed input. The masking still ensures that each token can only attend to the prefix and the previously generated tokens, not to any of the subsequent tokens. This approach is useful in tasks where you want the model to start with a specific context or theme and then generate content that logically follows from that starting point.
`

**Answer**

1. **Adapter Layers**: Adapter layers are small neural network modules inserted between the layers of a pre-trained model. When fine-tuning on a specific task, only these adapter layers are trained, while the original pre-trained layers are kept frozen. This method significantly reduces the number of parameters that need to be trained, making the fine-tuning process more efficient and requiring less computational resources. Adapters enable the model to adapt to new tasks or datasets with minimal changes to the overall network architecture.

2. **Gradual Unfreezing**: Gradual unfreezing is a technique where layers of a pre-trained model are unfrozen and fine-tuned incrementally, rather than all at once. Initially, only the top layers (the ones closest to the output) are unfrozen and trained. As training progresses, more layers are gradually unfrozen. This approach allows the model to retain much of its pre-trained knowledge while adapting to new data in a controlled manner. It helps in preventing catastrophic forgetting and ensures that the fine-tuning process is more stable and efficient, especially when dealing with a limited dataset or fewer parameters.
`

#### ${\color{red}{Comments\ 3.4}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


Point distribution

✅ 1.5 point for part 1.

✅ 1 point for part 2.

✅ 1.5 point for part 3.

✅ 1 point for part 4.

Note: All of them are very extensive answers.


5.0/5.0 points



${\color{red}{⚠️Comments\ end⚠️}}$