# Praktikum 2 - Transformer

Note: the praktikums are for your own practice. They will **not be graded**!

You have around one week to work on it. Then we will go over the solutions together in the praktikum time slots!

Remember to make a copy of this notebook to your own Colab. Changes made directly here will not be stored!

In this exercise, you'll implement a basic encoder-only Transformer architecture with PyTorch. We will start with building the basic building blocks and then integrate them into a fully-fleged Transformer model.


<!-- We train the model to solve a POS-Tagging problem (more on that later). In the previous exercise, you implemented your work in numpy. Now, we will switch to PyTorch, which will track the gradients for us and allows us to focus more on the network itself. -->

**Notice**: Whenenver you see an ellipsis `...`, you're supposed to insert code or text answers.

In [1]:
import tensorflow as tf

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)

tf.random.set_seed(42)

2024-05-14 17:47:11.574764: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-05-14 17:47:11.598662: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-14 17:47:11.598682: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-14 17:47:11.599257: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-14 17:47:11.603288: I tensorflow/core/platform/cpu_feature_guar

2.15.1


Let's actually start with a few basic functions that we will need throughout the exercise, namely **Softmax** and **ReLu**.

$\text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}$

$\text{ReLU}(x) = \max(0, x)$

## Transformer Block

A typical transformer block consists of the following
- Multi-Head Attention
- Layer Normalization
- Linear Layer
- Residual Connections

<center><img src="https://i.imgur.com/ZKgcoe4.png" alt="transformer block visualization" width="200">

In the next few subsections, we will build these basic building blocks.

### Multi-Head Attention

Multi-Head Attention concatenates the outputs of several so called **attention heads**.

$\textrm{MHA}(Q,K,V) = \textrm{Concat}(H_1,...,H_h)$

<center><img src="https://www.tensorflow.org/images/tutorials/transformer/multi_head_attention.png" width=300>

One attention head consists of linear projections for each of $Q, K$ and $V$ and an attention mechanism called **Scaled Dot-Product Attention**. The attention mechanism scales down the dot products by $\sqrt{d_k}$.

$\textrm{Attention}(Q,K,V)=\textrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V$



If we assume that $q$ and $v$ are $d_k$-dimensional vectors and its components are independent random variables with mean $0$ and a variance of $d_k$, then their dot product has a mean of $0$ and variance of $d_k$. It is preferred to have a variance of $1$ and that's why we scale them down by $\sqrt{d_k}$.

The dot product $q \cdot v$ resembles a measure of similarity.


<center><img src="https://www.tensorflow.org/images/tutorials/transformer/scaled_attention.png" width="350">

Let's start implementing these components. Note that our classes inherit from PyTorch's `nn.Module`. These modules allow us to hold our parameters and easily move them to the GPU (with `.to(...)`). It also let's us define the computation that is performed at every call, in the `forward()` method. For example, when we have an `Attention` module, initialize it like `attention = Attention(...)`, we are able to call it with `attention(Q, K, V)` (it'll execute the `forward` function in an optimized way).

### Layer Normalization

Layer normalization is when the values are normalized across the feature dimension, independently for each sample in the batch. For that, first calculate mean and standard-deviation across the feature dimension and then scale them appropriately such that the mean is 0 and the standard deviation is 1. Introduce **two sets of learnable parameters**, one for shifting the mean (addition) and one for scaling the variance (multiplication) the normalized features (i.e., two parameters for each feature). Tip: Use `nn.Parameter` for that.

$y_{\textrm{norm}}=\frac{x-\mu}{\sqrt{\sigma+\epsilon}}$

$y=y_{\textrm{norm}}\cdot\beta+\alpha$

<center>
<img src="https://i.stack.imgur.com/E3104.png" alt="visualization of layer norm vs. batch norm" width="420">

### Transformer Block

Here, we bring all ingredients together into a single module. Don't forget to add the residual connections. Let's use a 2-layer MLP with ReLU activation.

In [2]:
def transformer_block(hidden_n:int, h:int = 2, emd_n:int = 128):
    """_summary_

    Args:
        hidden_n (int): _description_
        h (int, optional): _description_. Defaults to 2.

    Returns:
        _type_: _description_
    """


    # Nicht richtig... 
    input_layer = layers.InputLayer()
    x = layers.Attention(hidden_n)(input_layer) # Attention layer braucht Key, Value und Query
    x = layers.MultiHeadAttention(hidden_n, h)(x)
    x = layers.LayerNormalization()(x)
    x = layers.Dense(hidden_n, activation='relu')(x)
    output_layer = layers.Dense(emd_n, activation='relu')(x)

    model = keras.Model(inputs=input_layer, outputs=output_layer)
    
    return model

## A Simple Transformer Architecture

Let's stack our transformer blocks and add an embedding layer for a simple transformer architecture. You are allowed to use `nn.Embedding` here.

In [3]:
def stacl_transformer(emb_n: int, hidden_n: int, n:int =3, h:int =2):
    """_summary_

    Args:
        emb_n (int): Number of embeddings.
        hidden_n (int): Number of neurons in the hidden layer.
        n (int, optional): Number of layers. Defaults to 3.
        h (int, optional): Number of heads for Multihead attention layer. Defaults to 2.

    Returns:
        _type_: _description_
    """

    model = keras.Sequential()

    for _ in range(n):
        model.add(transformer_block(hidden_n, h, emd_n=emb_n))

    return model

## POS-Tagging

Part-Of-Speech-Tagging (**POS-Tagging**) is a **sequence labeling problem** where we categorize words in a text in correspondence with a particular part of speech (e.g., "noun" or "adjective"). A few examples and classes are shown in the following table:

|  POS Tag  |  Description  |  Examples  |
|-----------|------------|------------|
|  NN | Noun (singular, common) | mass, wind, ...  |
|  NNP | Noun (singular, proper) | Obama, Liverpool, ...  |
| CD  | Numeral (cardinal)  | 1890, 0.5, ...  |
|  DT | Determiner  | all, any, ... |
| JJ | Adjective (ordinal) | oiled, third, ... |
... many more

### CoNLL2000 Dataset

Let's load our dataset which is the **CoNLL2000 dataset** and look at an example.

In [4]:
import tensorflow_datasets as tfds
ds = tfds.load('huggingface:conll2000')

train_df = pd.DataFrame(ds['train'])

train_df.head(100)

  from .autonotebook import tqdm as notebook_tqdm
  hf_names = hf_datasets.list_datasets()
2024-05-14 17:48:46.334855: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-05-14 17:48:46.352792: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-05-14 17:48:46.352830: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-05-14 17:48:46.354873: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built

Unnamed: 0,chunk_tags,id,pos_tags,tokens
0,"(tf.Tensor(11, shape=(), dtype=int64), tf.Tens...","tf.Tensor(b'0', shape=(), dtype=string)","(tf.Tensor(19, shape=(), dtype=int64), tf.Tens...","(tf.Tensor(b'Confidence', shape=(), dtype=stri..."
1,"(tf.Tensor(0, shape=(), dtype=int64), tf.Tenso...","tf.Tensor(b'1', shape=(), dtype=string)","(tf.Tensor(20, shape=(), dtype=int64), tf.Tens...","(tf.Tensor(b'Chancellor', shape=(), dtype=stri..."
2,"(tf.Tensor(0, shape=(), dtype=int64), tf.Tenso...","tf.Tensor(b'2', shape=(), dtype=string)","(tf.Tensor(9, shape=(), dtype=int64), tf.Tenso...","(tf.Tensor(b'But', shape=(), dtype=string), tf..."
3,"(tf.Tensor(11, shape=(), dtype=int64), tf.Tens...","tf.Tensor(b'3', shape=(), dtype=string)","(tf.Tensor(11, shape=(), dtype=int64), tf.Tens...","(tf.Tensor(b'This', shape=(), dtype=string), t..."
4,"(tf.Tensor(0, shape=(), dtype=int64), tf.Tenso...","tf.Tensor(b'4', shape=(), dtype=string)","(tf.Tensor(8, shape=(), dtype=int64), tf.Tenso...","(tf.Tensor(b'``', shape=(), dtype=string), tf...."
...,...,...,...,...
95,"(tf.Tensor(17, shape=(), dtype=int64), tf.Tens...","tf.Tensor(b'95', shape=(), dtype=string)","(tf.Tensor(14, shape=(), dtype=int64), tf.Tens...","(tf.Tensor(b'After', shape=(), dtype=string), ..."
96,"(tf.Tensor(11, shape=(), dtype=int64), tf.Tens...","tf.Tensor(b'96', shape=(), dtype=string)","(tf.Tensor(25, shape=(), dtype=int64), tf.Tens...","(tf.Tensor(b'We', shape=(), dtype=string), tf...."
97,"(tf.Tensor(11, shape=(), dtype=int64), tf.Tens...","tf.Tensor(b'97', shape=(), dtype=string)","(tf.Tensor(11, shape=(), dtype=int64), tf.Tens...","(tf.Tensor(b'No', shape=(), dtype=string), tf...."
98,"(tf.Tensor(11, shape=(), dtype=int64), tf.Tens...","tf.Tensor(b'98', shape=(), dtype=string)","(tf.Tensor(11, shape=(), dtype=int64), tf.Tens...","(tf.Tensor(b'The', shape=(), dtype=string), tf..."


First, we need to create a vocabulary. Our dataset is already tokenized. However, we need to assign ids to them in order to input them to the embedding layer. We also need the number of embeddings (`num_embeddings`) for the size of our lookup table of `nn.Embedding`.

Thus, we will iterate over all sentences replace them with ids and the mapping to our vocabulary. It'll be handy to have two different mappings, from id to token, as well as, from token to id. Note that we will add a special token `<unk>` with id `0` for words that are unknown (that are not in the training dataset but could possibly be in the test dataset).

Now, let's use PyTorch's `Dataset` and `DataLoader` to help us batching our data. Let's also replace tokens and classes with our ids. For that, complete `get_token_ids` and `get_class_ids`.

We will use a **batch size of 32**.

In [5]:
BATCH_SIZE = 32

However, since our examples are of different length, we need to pad shorter examples to the length of the example with the maximum length in our batch. So, let's define a special **padding token** in our vocabulary:

The `collate_fn` is the function that actually receives a batch and needs to add the padding tokens, then returns `src` and `tgt` as `Tensor`s of size `[B, S]` where `B` is our batch size and `S` our maximum sequence length. This function should additionally return a `mask`, a `Tensor` with binary values to indicate whether the specific element is a padding token or not (0 if it's a padding token, 1 if not), such that we can ignore padding tokens in our attention mechanism and loss calculation.

In [6]:
test_df = pd.DataFrame(ds['test'])

test_df.head(10)

Unnamed: 0,chunk_tags,id,pos_tags,tokens
0,"(tf.Tensor(11, shape=(), dtype=int64), tf.Tens...","tf.Tensor(b'0', shape=(), dtype=string)","(tf.Tensor(20, shape=(), dtype=int64), tf.Tens...","(tf.Tensor(b'Rockwell', shape=(), dtype=string..."
1,"(tf.Tensor(11, shape=(), dtype=int64), tf.Tens...","tf.Tensor(b'1', shape=(), dtype=string)","(tf.Tensor(20, shape=(), dtype=int64), tf.Tens...","(tf.Tensor(b'Rockwell', shape=(), dtype=string..."
2,"(tf.Tensor(11, shape=(), dtype=int64), tf.Tens...","tf.Tensor(b'2', shape=(), dtype=string)","(tf.Tensor(11, shape=(), dtype=int64), tf.Tens...","(tf.Tensor(b'These', shape=(), dtype=string), ..."
3,"(tf.Tensor(13, shape=(), dtype=int64), tf.Tens...","tf.Tensor(b'3', shape=(), dtype=string)","(tf.Tensor(14, shape=(), dtype=int64), tf.Tens...","(tf.Tensor(b'Under', shape=(), dtype=string), ..."
4,"(tf.Tensor(11, shape=(), dtype=int64), tf.Tens...","tf.Tensor(b'4', shape=(), dtype=string)","(tf.Tensor(20, shape=(), dtype=int64), tf.Tens...","(tf.Tensor(b'Rockwell', shape=(), dtype=string..."
5,"(tf.Tensor(11, shape=(), dtype=int64), tf.Tens...","tf.Tensor(b'5', shape=(), dtype=string)","(tf.Tensor(20, shape=(), dtype=int64), tf.Tens...","(tf.Tensor(b'Frank', shape=(), dtype=string), ..."
6,"(tf.Tensor(11, shape=(), dtype=int64), tf.Tens...","tf.Tensor(b'6', shape=(), dtype=string)","(tf.Tensor(20, shape=(), dtype=int64), tf.Tens...","(tf.Tensor(b'Mr.', shape=(), dtype=string), tf..."
7,"(tf.Tensor(13, shape=(), dtype=int64), tf.Tens...","tf.Tensor(b'7', shape=(), dtype=string)","(tf.Tensor(14, shape=(), dtype=int64), tf.Tens...","(tf.Tensor(b'In', shape=(), dtype=string), tf...."
8,"(tf.Tensor(11, shape=(), dtype=int64), tf.Tens...","tf.Tensor(b'8', shape=(), dtype=string)","(tf.Tensor(20, shape=(), dtype=int64), tf.Tens...","(tf.Tensor(b'SHEARSON', shape=(), dtype=string..."
9,"(tf.Tensor(11, shape=(), dtype=int64), tf.Tens...","tf.Tensor(b'9', shape=(), dtype=string)","(tf.Tensor(20, shape=(), dtype=int64), tf.Tens...","(tf.Tensor(b'Thomas', shape=(), dtype=string),..."


### Architecture

Let's build a transformer model with three layers, three attention heads and an embedding dimension of 128. Also, let's not forget to add a classification head to our model.

In [7]:
transformer = stacl_transformer(emb_n=128, hidden_n=64, n=3, h=2)

ValueError: Exception encountered when calling layer 'attention' (type Attention).

Attention layer must be called on a list of inputs, namely [query, value] or [query, value, key]. Received: <keras.src.engine.input_layer.InputLayer object at 0x7fc3e5e8c890>.

Call arguments received by layer 'attention' (type Attention):
  • inputs=<keras.src.engine.input_layer.InputLayer object at 0x7fc3e5e8c890>
  • mask=None
  • training=None
  • return_attention_scores=False
  • use_causal_mask=False

### Training

Initialize the **AdamW** optimizer from the `torch.optim` module and choose the most appropriate loss function for our task.

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
criterion = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

Build a basic training loop and train the network for three epochs.
- Use everything we've built to far, including `train_data_loader`, `model`, `optimizer` and `criterion`.
- At every 50th step print the average loss of the last 50 steps.
- It is suggested to make a basic training procedure to work on the CPU first. Once it successfully runs on the CPU, you can switch to the GPU (click on change runtime and add an hardware accelerator if you use Colab) and run for the whole three epochs. Note: For this to work, you need to transfer the `model` and the input tensors to the GPU memory. This simply works by calling `.to(device)` on the model and tensors, where `device` and either be `cpu` or `cuda` (for the GPU).

In [None]:
EPOCHS = 3

train, test = ds

train = train.shuffle(1000).batch(BATCH_SIZE)
test = test.batch(BATCH_SIZE)

transformer.compile(optimizer=optimizer, loss=criterion, metrics=['accuracy'])

transformer.train(train, epochs=EPOCHS)


5


### Evaluation

Let's see what's the accuracy is of our model. Since we already implemented accuracy in the previous exercise, we'll now let you use the torchmetrics package.

In [None]:
from torchmetrics import Accuracy

accuracy = Accuracy(average='micro')

Calculate the average accuracy of all examples in the test dataset.

In [None]:
...

Let's also look at the accuracy **for each class separately**:

In [None]:
...

## Positional Embeddings

The attention mechanism does not consider the position of the tokens which hurts its performance for many problems. We can solve this issue in several ways. We can either add a positional encoding (via trigonometric functions) or we can learn positional embeddings along the way, in a similar way as BERT does. Here, we will add learnable positional embeddings to our exisisting model with another embedding layer.

The longest sequence in our dataset has 78 tokens (you can trust us on that). So, let's set the number of embeddings for our positional embedding layer to that number. Again, you should use `nn.Embedding`.

Copy the inner parts of your `Transformer` class and add positional embeddings to it.

In [None]:
class TransformerPos(nn.Module):
    def __init__(self, emb_n: int, pos_emb_n: int, hidden_n: int, n:int =3, h:int =2):
        """
        emb_n: number of token embeddings
        pos_emb_n: number of position embeddings
        hidden_n: hidden dimension
        n: number of layers
        h: number of heads per layer
        """
        super().__init__()
        self.positional_embeddings = ...
        ...

    def forward(self):
        ...

In [None]:
model_pos = CoNLL2000Transformer(TransformerPos(...), ...)

### Training

Same procedure as before. Let's reinitialize our optimizer and our loss function and run the same training loop with our new model `model_pos`.

In [None]:
optimizer = ....
criterion = ...

In [None]:
...

### Evaluation

Now, let's check if our performance on the accuracy got improved.

In [None]:
...

Again, let's also check each class. Which classes got improved the most by adding positional embeddings?

In [None]:
...

As an optional task, you can play around with the model by switching out the transformer component for other architecture, e.g., LSTM, an observe the change in performance.