# Attention and transformers

In this exercise you will apply a transformer model to the task of sentiment classification. We highlight two very important properties of transformer models.
* self-attention: this mechanism enables the models to capture the context of each word within an incoming text.
* positional encoding: Just like RNN's, transformers process sequence data, but instead of using hidden states to capture the ordering of words, transformers use positional embeddings.

The first part of the exercise is meant to give an intuition of the self-attention principle. In addition to solving the exercise, you can watch the video https://www.youtube.com/watch?v=g2BRIuln4uc, which illustrates the idea of attention very clearly. Positional encodings are a rather abstract topic and will not be handled in this exercise. However, the video https://www.youtube.com/watch?v=1biZfFLPRSY offers a simple and understandable illustration of this topic.
 
In the second part of the tutorial you need to apply a BERT (Bidirectional Encoder Representations from Transformers) model, which belongs to the family of transformer models.  



In [3]:
## required libraries
import numpy as np
import pandas as pd
from scipy.special import softmax
import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
## definition of the vocabulary
voc = ['I', 'swam', 'across', 'the', 'river', 'to', 'get', 'other', 'bank', 'drove', 'road']

In [5]:
## artificial two dimensional embeddings
emb = pd.DataFrame([[1, 1], [2, 2], [1.2, 1.2], [0.9,0.9], [1.9,1.9], [0.8,0.8], [0.85,0.85], [0.95,0.95], [0,2],[2,-2],[2,-1.9]])

### Exercise 1
In this exercise we want to generate contextualized embeddings. Let $[v_{1},...,v_{n}]$ be a sentence were $v_{i}$ is the (non-contextualized) embedding of token $i$. A contextualized embedding $y_{i}$ of word $i$ is the weighted sum of the (non-contextualized) embeddings of the tokens in that sentence $y_{i}=\sum_{j=1}^{n} w_{ij}v_{j}$. The $w_{ij}$ are the attention weights, which measure the importance of token $j$ for the context of token $i$.

a) Your first exercise is to calculate the attention weights for each token in the sentence "I swam across the river, to get to the other bank." The weights should be stored in a matrix \begin{bmatrix}
w_{11} & ... & w_{1n}\\
\vdots & \vdots & \vdots \\
w_{n1} & ... & w_{nn}
\end{bmatrix} Below, we provide a function to calculate the weights.

In [1]:
## function to calculate the attention weights
def attention_weights(token, sent, emb,voc):
    idx_token=voc.index(token)
    
    idx_sentence = []
    for i in range(0, len(sent)):
        idx_sentence.append(voc.index(sent[i]))
    
    weights = softmax([np.dot(emb.iloc[idx_token],emb.iloc[i]) for i in idx_sentence])

    return weights

b) Plot a heat map of the weights, using sns.heatmap. 

c) Calculate the contextualized embedding of the word "bank" in the two sentences "I swam across the river, to get to the other bank." and "I drove across the road, to get to the other bank". You can do this by multiplying the transposed embedding matrix $E'=\begin{bmatrix}
v_{1} & ... & v_{n}\\
\end{bmatrix}$ with the vector $\begin{bmatrix}
w_{i1} & ... & w_{in}\\
\end{bmatrix}$, of the weights of "bank".  

d) Now, you generated contextualized embeddings in a very simple way by calculating scalar products between the un-contextualized embeddings $s_{ij}=\langle\,v_{i},v_{j}\rangle$, calculating weights by softmax $w_{ij}=\frac{e^{s_{ij}}}{\sum_{j}e^{s_{ij}}}$ and building a weighted sum of the un-contextualized embeddings $y_{i}=\sum_{j}w_{ij}v_{j}$. Does it make sense to integrate this transformation procedure for the embeddings into a machine learning model or is there a way we could modify this procedure, such that it makes more sense?  

## Exercise 2 (demonstration of BERT model use)
In this exercise you will use a pre-trained BERT model. You will load the model and than do some fine-tuning on the model weights. We recommend to do the exercise in Google Colab because we faced some errors when loading the transformer packages on our own environment. The exercise follows the original tuturial [TF-Tutorial](https://www.tensorflow.org/text/tutorials/classify_text_with_bert). 

### Install transformers
Unlike many other libraries, Colab does not have the transformers package pre-installed. You will have to install it every time that you start Colab again. This is the package where you will find most of the critical tools for BERT including the pre-trained models and tokenizer.

In [None]:
!pip install -q -U "tensorflow-text==2.8.*"

[K     |████████████████████████████████| 4.9 MB 27.4 MB/s 
[?25h

In [None]:
!pip install -q tf-models-official==2.7.0

[K     |████████████████████████████████| 1.8 MB 40.0 MB/s 
[K     |████████████████████████████████| 92 kB 12.5 MB/s 
[K     |████████████████████████████████| 48.3 MB 145 kB/s 
[K     |████████████████████████████████| 99 kB 4.8 MB/s 
[K     |████████████████████████████████| 1.1 MB 26.1 MB/s 
[K     |████████████████████████████████| 596 kB 50.5 MB/s 
[K     |████████████████████████████████| 1.2 MB 45.2 MB/s 
[K     |████████████████████████████████| 352 kB 65.1 MB/s 
[K     |████████████████████████████████| 237 kB 57.5 MB/s 
[K     |████████████████████████████████| 43 kB 1.7 MB/s 
[?25h  Building wheel for py-cpuinfo (setup.py) ... [?25l[?25hdone
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


### Import libraries
Now that you have the transfomers library on hand, it will be necessary to import it and the rest of the libraries that you will need in the task. Here we will need tensorflow, pandas, OS and shutil for basic tasks and also specific parts of the transformers package for BERT.

In [None]:
import os
import shutil

import tensorflow as tf
import tensorflow_hub as hub # for BERT models
import tensorflow_text as text
from official.nlp import optimization  # for AdamW optimizer

import matplotlib.pyplot as plt

tf.get_logger().setLevel('ERROR')

### Load and set up the dataset

In this task, we will be using the IMDB reviews dataset. Unlike in the previous exercises, we download the data and store it in a directory.

In [None]:
url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'

df = tf.keras.utils.get_file('aclImdb_v1.tar.gz', url,
                                  untar=True, cache_dir='.',
                                  cache_subdir='')

df_dir = os.path.join(os.path.dirname(df), 'aclImdb')
X_train_dir = os.path.join(df_dir, 'train')
X_test_dir = os.path.join(df_dir, 'test')

# we only need labeled data (data for supervised learning), so we can remove the unsupervised folder
remove_dir = os.path.join(X_train_dir, 'unsup')
shutil.rmtree(remove_dir)

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


BERT is going to take up a lot of processing power. It is highly advisable to organize your data into batches so that the amount of data that you are working with is manageable. For now, we will set the size of the batches of data that we will take to 32. You can experiment with this number when working with the program for later tasks.

In [None]:
batch_size = 32


Note that the function [prefetch](https://www.tensorflow.org/guide/data_performance#prefetching) is just used to prepare the data as the machine would expect to receive it. It is normally used to make sure that the next batch of data is ready for use.

In [None]:
# set seed for reproducibility in train-test split
seed = 888

# Create the pre-processing train df and create a seperate subset training only
X_train_raw = tf.keras.preprocessing.text_dataset_from_directory(
    X_train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

X_train = X_train_raw.cache().prefetch(buffer_size=tf.data.AUTOTUNE)

# Take the validation data subset for processing
X_val = tf.keras.preprocessing.text_dataset_from_directory(
    X_train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

X_val = X_val.cache().prefetch(buffer_size=tf.data.AUTOTUNE)

# Prepare the test data for processing
X_test = tf.keras.preprocessing.text_dataset_from_directory(
    X_test_dir,
    batch_size=batch_size)

X_test = X_test.cache().prefetch(buffer_size=tf.data.AUTOTUNE)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


### Load a BERT model

You can explore a [large list of versions of BERT here](https://huggingface.co/models). These pretrained versions of BERT differ mainly in size and/or the topics of text. Using specific versions of BERT can sometimes help with the performance of your model, though this is not always the case. It is a very good idea to test several versions of BERT for your purposes to see which one is optimal for your situation. For our purposes, we will use a small uncased BERT. Here, uncased means that BERT will ignore capitalization and small means that BERT will only take shorter inputs. 

As the BERT model we chose needs input of a specific format, we also need to load a customized pre-processor, which converts the text in exactly the right format. 

In [None]:
## load the preprocessor
bert_preprocessor = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3')

## load the BERT model
bert_model = hub.KerasLayer('https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1')
#example_bert_results = bert_model(example_preprocessed)

Now, we have a look how the preprocessor works. For this purpose we define an example sentence and then examine the preprocessed sentence.


In [None]:
example_text = ['hated every minute of it.']
example_preprocessed = bert_preprocessor(example_text)

print(f'Shape of input_word_ids: {example_preprocessed["input_word_ids"].shape}')
print(f'First 12 input_word_ids: {example_preprocessed["input_word_ids"][0, :12]}')


Shape of input_word_ids: (1, 128)
First 12 input_word_ids: [ 101 6283 2296 3371 1997 2009 1012  102    0    0    0    0]


First, we note that input sequences for our BERT model need a sequence length of 128. This can be achieved by truncation and padding. We can see that all the words in the example sentence are converted to ID's in the vocabulary. This shows that BERT has already encountered most words that will be important for classification through its pre-training. 

Was the number of non-zero tokens what you expected? You may have only anticipated the following tokens: 'hated', 'every', 'minute', 'of', 'it', '.'. Why do we have an extra 2 tokens? BERT automatically adds tokens to indicate the beginning and end of a sentence as well. The rest of the sequence will be 0s as padding to keep the input length the same which is necessary for mathematical convenience.

### Build the classifier
Now we will stack some layers to create a classifier model. We will use:
- an input layer which receives the raw text
- a layer to preprocess the text for the BERT encoder
- an encoding layer which returns BERT outputs
- a dropout layer to prevent overfitting
- a final dense layer for the final classification

In [None]:
def build_classifier_model():
  # create input layer
  input_layer = tf.keras.layers.Input(shape=(), dtype=tf.string, name='input text')
  # add preprocessing layer and input text
  preprocessor = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', name='preprocessing')
  encoder_inputs = preprocessor(input_layer)
  # add encoding layer and feed preprocessed text into layer
  encoder = hub.KerasLayer('https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1', trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  # take the pooled output and apply a dropout layer to it to prevent overfitting
  pooled = outputs['pooled_output']
  pooled = tf.keras.layers.Dropout(0.1)(pooled)
  # create output layer which is the final classifier
  pooled = tf.keras.layers.Dense(1, activation=None, name='classifier')(pooled)
  return tf.keras.Model(input_layer, pooled) 

In [None]:
## instantiate model
classifier_model = build_classifier_model()

## check architecture
classifier_model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input text (InputLayer)        [(None,)]            0           []                               
                                                                                                  
 preprocessing (KerasLayer)     {'input_word_ids':   0           ['input text[0][0]']             
                                (None, 128),                                                      
                                 'input_mask': (Non                                               
                                e, 128),                                                          
                                 'input_type_ids':                                                
                                (None, 128)}                                                  

### Setting up loss metric and meta parameters

BERT usually gets trained for less epochs than traditional deep learning models. Depending on your task and system abilities, you can of course experiment with adding more epochs to see how it affects the model's performance. Here we set the number of epochs especially low, to get quick results during the tutorial.

`steps_per_epoch` is the total number of steps (batches of observations) to yield from generator before declaring one epoch finished and starting the next epoch. We will set his equal to the cardinality (the unique items per column) as recommended by tensorflow.

We will keep our learning rate at the highest level for the first 10% of training steps then it will follow a linear decay. According to the paper on BERT, you can also try learning rates of 5e-5 and 2e-5 if you'd like to experiment, but these seem to be best for fine-tuning BERT.

Lastly, for an optimizer, AdamW will be used, which is Adaptive Movements with weight decay (instead of regular Adam which is based on moments).

In [None]:
## loss
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
metrics = tf.metrics.BinaryAccuracy()

## meta parameters
epochs = 1
steps_per_epoch = tf.data.experimental.cardinality(X_train).numpy()
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)

init_lr = 3e-5 # Best options for BERT: 5e-5, 3e-5, 2e-5
optimizer = optimization.create_optimizer(init_lr=init_lr,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')

Now that we have all of these set, we can compile the model with them.

In [None]:
classifier_model.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=metrics)

### Fit and evaluate model 


In [None]:
## fit model
history = classifier_model.fit(x=X_train,
                               validation_data=X_val,
                               epochs=epochs)



In [None]:
## evaluate model
loss_res, acc_res = classifier_model.evaluate(X_test)

print(f'Loss: {loss_res}')
print(f'Accuracy: {acc_res}')

Loss: 0.37105128169059753
Accuracy: 0.8310800194740295


## Exercise 3
Now you have loaded a pre-trained BERT model and fine-tuned its parameters. Which parameters of the model did you modify during training compared to the pre-trained model? Load the same pre-trained model again and only train the parameters of the dense layer. Which differences do you notice? 