<a target="_blank" href="https://colab.research.google.com/github/PaulLerner/aivancity_nlp/blob/main/pw2_transformers.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Make a group for the Homework

Before starting this Practical Work, make sure that you have a group (3 students) for the Homework. 

If you have trouble finding a group, please tell the teacher. If you do the homework alone without authorization, you will get 0/20.


# Installation and imports

Hit `Ctrl+S` to save a copy of the Colab notebook to your drive

Run on Google Colab GPU:
- Connect
- Modify execution
- GPU

![image.png](https://paullerner.github.io/aivancity_nlp/_static/colab_gpu.png)

In [None]:
%pip install transformers datasets

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F


In [2]:
assert torch.cuda.is_available(), "Connect to GPU and try again (ask teacher for help)"

# Attention

Attention is a crucial component in the transformer, it allows to capture dependencies between different positions of two sequence of elements. In our case, and in most cases in NLP applications, sequences are sentences and elements are (sub)words.
It is a powerful operation that allows to learn an alignment between each element in two sequences. It generates a score of how related each element in sequence1 and sequence2 are between each other.
Understanding how attention works and being able to implement it are essential for anyone working with transformers. 

Given a query ($Q$), key ($K$), and value ($V$) tensors, the attention mechanism computes a weighted sum of the value tensor based on the similarity between the query and key tensors as shown in the following equation:

$$
\text{Attention}(Q,K,V) = \text{softmax}\Big(\frac{QK^T}{\sqrt{d_k}}\Big)V
$$

where 
- $Q$ represents the query tensor.
- $K$ represents the key tensor.
- $V$ represents the value tensor.
- $d_k$ represents the dimensionality of the key tensor.

This is the image that was in the [original Transformer paper](https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html) and that shows the computations used in the attention.

Forget about the right part, we'll get back to that later in the lab.

![image](https://miro.medium.com/v2/resize:fit:1270/1*LpDpZojgoKTPBBt8wdC4nQ.png)


In this exercise, we will dive into the attention mechanism. To do so, we are going to build a simple cross-attention function that we will then extend to a more complex multi-head self-attention module that incorporates the concept of causality.

## Building a Simple Self-Attention Function

In self-attention, a single sequence acts as the query $Q$, key $K$, and value $V$, allowing attention to be computed within the sequence itself. This can be useful for syntactic where an attention head can model the relationship between part of speech like subjects and verbs. 


Given an input sequence $S$ and the transformation weights $W_Q$, $W_K$ and $W_V$, complete the `self_attention` function in the cell below. 

You need to implement the following:
- Calculate the query, key, and value projections using linear transformations.
- Compute the attention scores by performing the dot product between the query and key tensors.
- Apply softmax activation to the attention scores to obtain the attention weights.
- Multiply the attention weights with the value tensor to get the attended values.
- Return the attended values.



Hint: Matrix sizes

- q: query size
- d: hidden dimension
- c: context length


- Q: 1xqxd
- K, V: 1xcxd
- Q x K:  1xqxd x 1xsxd.T 
- (QK) x V:  1xqxs  x  V  1xsxd

  
- Attn: 1xqxd

In [2]:
def self_attention(S, W_Q, W_K, W_V):
    pass

In [6]:
# Sequence
S = torch.rand((1,13,3))

# Projections
W_Q = torch.rand((3, 2))  # Query weights
W_K = torch.rand((3, 2))  # Key weights
W_V = torch.rand((3, 2))  # Value weights

# Perform self-attention
attended_values = self_attention(S, W_Q, W_K, W_V)

# Expected output # 1,13, 2 (B, Sequence, Projection)
print(f"Output Shape: {attended_values.shape}")

Output Shape: torch.Size([1, 13, 2])


## Multi-Head

However, the relations present even in a single sentence are more than one. Think about number and gender agreement as one, the semantic relation between subject and object, the functional aspect that verb arguments have etc. All this cannot be modeled by a single head.

For this reason, we are going to extend the single-head attention function to **multi-head attention**. In the previous implementation, we had one set of weights for the input query, resulting in a single type of _relationship between the the source and target sequence_. With multi-head attention, we can utilize _multiple parallel single-head attention modules_ to obtain diverse relationships between the query and the values. The attention operation works by projecting the sequences through a multiplication with a projection matrix, and then computing the alignment score. These are are all operation that can be parallelized since there's no interdependency between each each head. For this reasons, each head could learn to model a different linguistic intereation useful for many downstream tasks, be it syntactic, semantic or generation-based..


As we've seen in class, this can be done simply by reshaping queries, keys and values.

Project back the results using $W_O$

In [3]:
def multi_head_attention(S, W_Q, W_K, W_V, W_O, num_heads=4):    
    return output

In [13]:
# Sequence
S = torch.rand((1,13,8))

# Projections
W_Q = torch.rand((8, 8))  # Query weights
W_K = torch.rand((8, 8))  # Key weights
W_V = torch.rand((8, 8))  # Value weights
W_O = torch.rand((8, 8))  # Output proj

# Perform self-attention
attended_values = multi_head_attention(S, W_Q, W_K, W_V, W_O)

# Expected output # 1, 13, 8 (B, Sequence, Projection)
print(f"Output Shape: {attended_values.shape}")

Output Shape: torch.Size([1, 13, 8])


## Causal Mask

GPT uses a version of self-attention called causal self-attention. When training our models for tasks like language modeling and machine translation, in practice we feed the entire train sequence to the model but, at every timestep, we want to prevent it to compute the alignment with future tokens. For this reason we use a mask that we incrementally lift at every timestep. For instance, we have a sentence that says "Libson is a great city to live in". At time 0, we feed the entire sentence to the model masking everything but the first token. Using the strikethrough format as masking, this will be what the model sees at step 0:

- Time 0: Libson ~is a great city to live in~

We then let the model generate a token a and move to step 1 where we are masking everything but the first two tokens
 
- Time 1: Libson is ~a great city to live in~ 

and so on...

- Time 2: Libson is a ~great city to live in~ 
- Time 3: Libson is a great ~city to live in~ 
- Time 4: Libson is a great city ~to live in~ 
- Time 5: Libson is a great city to ~live in~ 
- Time 6: Libson is a great city to live ~in~ 


![transformer](https://paullerner.github.io/aivancity_nlp/_static/attention_mask.png)

Apply mask on attention using `torch.tril` and `masked_fill`

In [4]:
def causal_multi_head_attention(S, W_Q, W_K, W_V, W_O, num_heads=4):    
    return output

In [16]:
# Sequence
S = torch.rand((1,3,8))

# Projections
W_Q = torch.rand((8, 8))  # Query weights
W_K = torch.rand((8, 8))  # Key weights
W_V = torch.rand((8, 8))  # Value weights
W_O = torch.rand((8, 8))  # Output proj

# Perform self-attention
attended_values = causal_multi_head_attention(S, W_Q, W_K, W_V, W_O)

# Expected output # 1, 3, 8 (B, Sequence, Projection)
print(f"Output Shape: {attended_values.shape}")

bidirectional attention:
tensor([[[[ 7.6750,  9.8763,  7.8718],
          [11.7096, 14.9262, 11.9657],
          [ 8.8149, 11.3644,  9.0476]],

         [[ 6.9848,  8.4330,  6.1530],
          [ 8.9807, 10.8143,  7.9368],
          [ 7.9229,  9.5573,  6.9868]],

         [[ 8.4009, 10.4845,  9.2640],
          [10.3813, 12.9258, 11.4504],
          [ 8.6193, 10.7093,  9.5088]],

         [[ 9.4824, 11.5997, 10.2568],
          [13.4200, 16.4683, 14.5326],
          [10.5152, 12.8738, 11.3774]]]])

causal attention:
tensor([[[[ 7.6750,    -inf,    -inf],
          [11.7096, 14.9262,    -inf],
          [ 8.8149, 11.3644,  9.0476]],

         [[ 6.9848,    -inf,    -inf],
          [ 8.9807, 10.8143,    -inf],
          [ 7.9229,  9.5573,  6.9868]],

         [[ 8.4009,    -inf,    -inf],
          [10.3813, 12.9258,    -inf],
          [ 8.6193, 10.7093,  9.5088]],

         [[ 9.4824,    -inf,    -inf],
          [13.4200, 16.4683,    -inf],
          [10.5152, 12.8738, 11.3774]]]])
Ou


We can now look back at the attention figure from the paper. Hopefully, you are now able to understand also the right side of the figure.

![image](https://miro.medium.com/v2/resize:fit:1270/1*LpDpZojgoKTPBBt8wdC4nQ.png)

## Pytorch Module

The last modification involves embedding our function into a PyTorch module. As you may have noticed, in the previous exercise, we passed the transformation weights as inputs to the function. In a real-world scenario, these matrices are learned, and PyTorch can keep track of them for us.

- Complete the missing lines on the initialization of the module and the forward pass.
- add dropout on the attention weights and the output


In [7]:
class CausalSelfAttention(nn.Module):
    def __init__(self, hidden_size=8, num_heads=2, dropout=0.1, seq_len=3):
        super().__init__()

    def forward(self, x):
        return output

In [None]:
attention_module = CausalSelfAttention()

In [100]:
# Sequence
S = torch.rand((1,3,8))

# Perform self-attention
attended_values = attention_module(S)

# Expected output # 1, 3, 8 (B, Sequence, Projection)
print(f"Output Shape: {attended_values.shape}")

Output Shape: torch.Size([1, 3, 8])


# Transformer


![transformer](https://paullerner.github.io/aivancity_nlp/_static/transformer_decoder.png)

## Attention is almost all you need: feedforward neural network

Simple Neural network of two layers with a ReLU activation in-between and dropout at output. The intermediate dimension should be 4 times `hidden_size`

In [9]:
class FeedForward(nn.Module):
    def __init__(self, hidden_size, dropout):
        super().__init__()

    def forward(self, x):
        pass

## Transformer Block

- stack CausalSelfAttention and FeedForward
- add residual connections
- add layer norms

In [10]:
class Block(nn.Module):
    def __init__(self, hidden_size=8, num_heads=2, dropout=0.1, seq_len=3):
        super().__init__()

    def forward(self, x):
        pass

In [None]:
# Sequence
S = torch.rand((1,3,8))

block = Block()
# Perform self-attention
output = block(S)

# Expected output # 1, 3, 8 (B, Sequence, Projection)
print(f"Output Shape: {output.shape}")

## Complete Transformer
- word embeddings
- position embeddings
- as many blocks as you like to stack
- output layer back to the vocabulary (no need for softmax)

In [11]:
class Transformer(nn.Module):

    def __init__(self, vocab_size=100, hidden_size=8, num_heads=2, dropout=0.1, seq_len=3, num_layers=2):
        super().__init__()        
            
    def forward(self, input_ids):
        return logits


In [None]:
transformer = Transformer()

In [71]:
input_ids = torch.randint(0,100,(1, 3))

In [72]:
input_ids

tensor([[85, 10,  4]])

In [73]:
logits = transformer(input_ids)

In [74]:
# scores (not probabilities because not normalized) over the complete vocabulary, for each token in the sentence
# shape: batch size, seq_len, V
logits.shape

torch.Size([1, 3, 100])

# Training

A peak into Language Modeling (next class)


![lm](https://paullerner.github.io/aivancity_nlp/_static/lm.png)

A language model estimates the probability of a sequence of words $w$:
$$P(w)=\prod_t^{|w|} P(w_t | w_{<t}) = P(w_1)  P(w_2|w_1)  P(w_3 | w_1 w_2)...$$

See how this turns into a sequence of classification problem:
- first $P(w_1)$
- then $P(w_2|w_1)$
- etc.

The model "predicts the next word" given a context

## data

In [2]:
from datasets import load_dataset, DatasetDict

dataset = load_dataset('wikitext', 'wikitext-103-raw-v1')

dataset = {k: v["text"] for k, v in dataset.items()}

for k, v in dataset.items():
    print(k, len(v))

  from .autonotebook import tqdm as notebook_tqdm


test 4358
train 1801350
validation 3760


In [3]:
dataset["train"][3]

' Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . \n'

## tokenization

We almost did not talk about tokenization yet! We've assumed words, which is impractical given finite vocabulary size.

Instead, LLMs rely on BPE, a data compression technique, which segments rare words into subwords

In [4]:
from transformers import AutoTokenizer

In [5]:
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")

tokenizer.pad_token = tokenizer.eos_token

seq_len=128

In [6]:
print(tokenizer.tokenize(dataset["train"][3]))

['ĠSen', 'j', 'Åį', 'Ġno', 'ĠV', 'alky', 'ria', 'Ġ3', 'Ġ:', 'ĠUn', 'recorded', 'ĠChronicles', 'Ġ(', 'ĠJapanese', 'Ġ:', 'Ġæ', 'Ī', '¦', 'å', 'ł', '´', 'ãģ®', 'ãĥ´ãĤ¡', 'ãĥ«', 'ãĤŃ', 'ãĥ¥', 'ãĥª', 'ãĤ¢', '3', 'Ġ,', 'Ġlit', 'Ġ.', 'ĠV', 'alky', 'ria', 'Ġof', 'Ġthe', 'ĠBattlefield', 'Ġ3', 'Ġ)', 'Ġ,', 'Ġcommonly', 'Ġreferred', 'Ġto', 'Ġas', 'ĠV', 'alky', 'ria', 'ĠChronicles', 'ĠIII', 'Ġoutside', 'ĠJapan', 'Ġ,', 'Ġis', 'Ġa', 'Ġtactical', 'Ġrole', 'Ġ@', '-', '@', 'Ġplaying', 'Ġvideo', 'Ġgame', 'Ġdeveloped', 'Ġby', 'ĠSega', 'Ġand', 'ĠMedia', '.', 'Vision', 'Ġfor', 'Ġthe', 'ĠPlayStation', 'ĠPortable', 'Ġ.', 'ĠReleased', 'Ġin', 'ĠJanuary', 'Ġ2011', 'Ġin', 'ĠJapan', 'Ġ,', 'Ġit', 'Ġis', 'Ġthe', 'Ġthird', 'Ġgame', 'Ġin', 'Ġthe', 'ĠV', 'alky', 'ria', 'Ġseries', 'Ġ.', 'ĠEmploy', 'ing', 'Ġthe', 'Ġsame', 'Ġfusion', 'Ġof', 'Ġtactical', 'Ġand', 'Ġreal', 'Ġ@', '-', '@', 'Ġtime', 'Ġgameplay', 'Ġas', 'Ġits', 'Ġpredecessors', 'Ġ,', 'Ġthe', 'Ġstory', 'Ġruns', 'Ġparallel', 'Ġto', 'Ġthe', 'Ġfirst', 'Ġgame', 'Ġan

In [7]:
text_batch = dataset["train"][:4]

huggingface's `transformers` provides a convenient way to tokenize text, it also takes care of padding the text so that we can wrap all examples of a batch in the same `Tensor`

In [8]:
input_ids = tokenizer(text_batch, return_tensors='pt', padding=True, truncation=True, max_length=seq_len)['input_ids']

In [9]:
input_ids

tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 5

In [10]:
input_ids.shape

torch.Size([4, 128])

Notice the padding: small texts are padded by `tokenizer.eos_token_id`

In [13]:
tokenizer.eos_token_id

50256

In [29]:
transformer = Transformer(vocab_size=tokenizer.vocab_size, seq_len=seq_len)

In [31]:
# same as before (only larger seq_len and V)
logits = transformer(input_ids)
logits.shape

torch.Size([4, 128, 50257])

## Self-supervision

Remember the greatest thing about Language Modeling: we don't need to annotate data!

The model should predict the next word given the context so we just need to shift the input by 1 to get the labels!

Compute the loss on one batch using `nn.CrossEntropyLoss`. Be careful about the padding! We don't want our model to learn to predict padding at the end of text!

Like in the previous Practical Work, remember to flatten the batch dimension with the sequence dimension

In [32]:
# loss of randomly initialized model

tensor(11.0828, grad_fn=<NllLossBackward0>)

Notice anything about this value? What about its exponentiate? Ever heard of perplexity? More about this in the next class

## Training loop

Ensure that everything is on GPU by calling `.cuda()` or passing `device="cuda"` on init

In [15]:
%load_ext tensorboard

In [16]:
import torch
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter("logs")

Run tensorboard before training. Refresh during training.

In [None]:
%tensorboard --logdir logs

In [None]:
seq_len=128
transformer = Transformer(vocab_size=tokenizer.vocab_size, hidden_size=256, num_layers=3, num_heads=4, dropout=0.1, seq_len=seq_len).cuda()

optimizer = torch.optim.AdamW(transformer.parameters(), lr=0.0001)

batch_size = 32
# in the interest of time, we simply overfit on a single batch
# try to train on the complete texts when you have more time
train_loader = torch.utils.data.DataLoader(dataset["train"][:batch_size], batch_size=batch_size, shuffle=True)
validation_loader = torch.utils.data.DataLoader(dataset["validation"], batch_size=batch_size, shuffle=False)

steps = 0
for epoch in range(1000):
    for text_batch in train_loader:
        input_ids = tokenizer(text_batch, return_tensors='pt', padding=True, truncation=True, max_length=seq_len)['input_ids'].cuda()
        logits = transformer(input_ids)
        raise NotImplementedError("TODO compute loss")
        writer.add_scalar("Loss/train", loss.item(), steps)
        loss.backward()
        nn.utils.clip_grad_norm_(transformer.parameters(), 1.0)
        optimizer.step()
        steps += 1

        # validation
        if steps % 100 == 0:
            with torch.no_grad():
                transformer.eval()
                valid_loss = 0
                valid_batches = 0
                for text_batch in validation_loader:
                    input_ids = tokenizer(text_batch, return_tensors='pt', padding=True, truncation=True, max_length=seq_len)['input_ids'].cuda()
                    logits = transformer(input_ids)
                    raise NotImplementedError("TODO compute loss")
                    valid_loss += loss.item()
                    valid_batches += 1
                transformer.train()
                writer.add_scalar("Loss/validation", valid_loss/valid_batches, steps)

Save model

In [20]:
torch.save(transformer.state_dict(), "transformer.bin")

## Generate text

In [14]:
@torch.no_grad()
def decode(model, input_ids, max_new_tokens=32):
    # idx is (B, T) array of indices in the current context
    for _ in range(max_new_tokens):
        # get the predictions
        logits = model(input_ids)
        # focus only on the last time step
        logits = logits[:, -1] # becomes (B, C)
        # greedy decoding
        idx_next = logits.argmax(1).unsqueeze(0)
        # append sampled index to the running sequence
        input_ids = torch.cat((input_ids, idx_next), dim=1) # (B, T+1)
    return input_ids

In [16]:
## load previously saved model
#transformer.load_state_dict(torch.load("transformer.bin"))

In [16]:
text_batch[0]

" As with previous Valkyira Chronicles games , Valkyria Chronicles III is a tactical role @-@ playing game where players take control of a military unit and take part in missions against enemy forces . Stories are told through comic book @-@ like panels with animated character portraits , with characters speaking partially through voiced speech bubbles and partially through unvoiced text . The player progresses through a series of linear missions , gradually unlocked as maps that can be freely scanned through and replayed as they are unlocked . The route to each story location on the map varies depending on an individual player 's approach : when one option is selected , the other is sealed off to the player . Outside missions , the player characters rest in a camp , where units can be customized and character growth occurs . Alongside the main story missions are character @-@ specific sub missions relating to different squad members . After the game 's completion , additional episodes

When overfitting on one single batch, the model simply memorizes training data

In [37]:
prompt = " As with previous"


In [None]:
input_ids = tokenizer([prompt], return_tensors='pt', padding=True, truncation=True, max_length=seq_len)['input_ids'].cuda()
output = decode(transformer, input_ids)

tokenizer.batch_decode(output)

It gets a bit better when you train on 10,000 examples for 20,000 steps, but that roughly takes one hour on a labtop GPU

In [None]:
prompt = " The"

# Bonus: Visualize Attentions

Now that we understand the basic mechanisms of attention, we can check the activated attention patterns in a pretrained BERT model (Devlin et al. 2018). Recall that BERT is an encoder-based transformer model which is based on a stack of self-attention blocks.

In [None]:
from transformers import BertTokenizer, BertModel
from bertviz import head_view

# Define a sample input text
text = "I will go for a run and will jump into a lake."

# Instantiate the BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize the input text
tokens = tokenizer.tokenize(text)

# Convert tokens to token IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)

# Create attention mask
attention_mask = [1] * len(token_ids)

# Convert token IDs and attention mask to tensors
input_ids = torch.tensor([token_ids])
attention_mask = torch.tensor([attention_mask])

# Generate the transformer output
outputs = model(input_ids, attention_mask=attention_mask, output_attentions=True)

# Extract attentions and check the shape
outputs.attentions[0].shape

As you can see, we extracted an attention from the first layer. The first dimension is the bach, the second one is the number of heads used in the first layer, and the last two dimensions are the sequence length. Given that this was a self attention block the last two numbers are equal.

We can now use a method from the [bertviz library](https://github.com/jessevig/bertviz) and plot all the heads.

You'll see a dropdown menu that allows you the select a layer of the model (GPT-2 has 12). You'll then see a color for every head used in that layer (GPT-2 has 12 head per layer). By default all heads are shown, click on a color to activate/disactivate that head. It can help starting by activating only one head and checking the learned relation learn by that self attentino head. By hovering over each word you can see the attention weigths that linked that words to all the others.

**Question** Do you notice any interesting (linguistic) pattern?

In [None]:
head_view(outputs.attentions, tokens=tokens)