# Transformer Day Exercises

In [1]:
# Set Up
!git clone https://github.com/LxMLS/lxmls-toolkit.git
%cd lxmls-toolkit/
import sys
import os
sys.path.append(os.getcwd())
!git checkout transformer-day

Cloning into 'lxmls-toolkit'...


remote: Enumerating objects: 4099, done.[K
remote: Counting objects: 100% (63/63), done.[K
remote: Compressing objects: 100% (46/46), done.[K
remote: Total 4099 (delta 28), reused 42 (delta 17), pack-reused 4036[K
Receiving objects: 100% (4099/4099), 26.25 MiB | 2.75 MiB/s, done.
Resolving deltas: 100% (2699/2699), done.
/Users/israfelsalazar/Documents/lxmls-toolkit/labs/notebooks/transformers/lxmls-toolkit
branch 'transformer-day' set up to track 'origin/transformer-day'.
Switched to a new branch 'transformer-day'


## Exercise 1: Tokenization
Tokenization is a crucial step in NLP. It involves splitting a sentence or a document into individual tokens, which are the basic units of language used for further analysis. Tokenization allows us to represent text data in a format that can be understood by machine learning models. In this exercise, we will explore tokenization to understand how it helps in NLP tasks.

<details>
  <summary>1. Why is tokenization an important preprocessing step in NLP tasks?</summary>

Answer:
It helps in creating a structured representation of text by breaking it down into smaller units (tokens) such as words or subwords. Tokenization enables the creation of a vocabulary or dictionary of unique tokens, facilitates text normalization, accounts for language-specific considerations, and provides the necessary input for various text analysis and processing tasks.

</details>

**Your answer:**

<summary>2. Write a function that takes in a sentence as input and demonstrates tokenization with various preprocessing options using the transformers library. Incorporate lowercasing, removing stop words, and applying stemming or lemmatization to the tokens. Display the list of processed tokens obtained from the sentence, highlighting the impact of each preprocessing step on the resulting tokens</summary>

add more info or link to other resources

In [2]:
import numpy as np
from lxmls.transformers.bpe import BPETokenizer

In [3]:
tokenizer = BPETokenizer()

downloading https://openaipublic.blob.core.windows.net/gpt-2/models/124M/encoder.json to /Users/israfelsalazar/.cache/mingpt/encoder.json
downloading https://openaipublic.blob.core.windows.net/gpt-2/models/124M/vocab.bpe to /Users/israfelsalazar/.cache/mingpt/vocab.bpe


In [4]:
# Tokenize a sample sentence
sentence = "I like to cirkumnavigate the globe every year"
tokenizer.encoder.encode_and_show_work(sentence)

{'bpe_idx': [40, 588, 284, 10774, 74, 4182, 615, 10055, 262, 13342, 790, 614],
 'tokens': ['I',
  ' like',
  ' to',
  ' cirkumnavigate',
  ' the',
  ' globe',
  ' every',
  ' year'],
 'parts': [{'token': 'I',
   'token_bytes': b'I',
   'token_translated': 'I',
   'token_merged': ['I'],
   'token_ix': [40]},
  {'token': ' like',
   'token_bytes': b' like',
   'token_translated': 'Ġlike',
   'token_merged': ['Ġlike'],
   'token_ix': [588]},
  {'token': ' to',
   'token_bytes': b' to',
   'token_translated': 'Ġto',
   'token_merged': ['Ġto'],
   'token_ix': [284]},
  {'token': ' cirkumnavigate',
   'token_bytes': b' cirkumnavigate',
   'token_translated': 'Ġcirkumnavigate',
   'token_merged': ['Ġcir', 'k', 'umn', 'av', 'igate'],
   'token_ix': [10774, 74, 4182, 615, 10055]},
  {'token': ' the',
   'token_bytes': b' the',
   'token_translated': 'Ġthe',
   'token_merged': ['Ġthe'],
   'token_ix': [262]},
  {'token': ' globe',
   'token_bytes': b' globe',
   'token_translated': 'Ġglobe',
   

## Excercise 2: Attention

Attention is a crucial component in the transformer, it allows to capture dependencies between different positions in the input sequence. Understanding how attention works and being able to implement it are essential for anyone working with transformers or natural language processing tasks.

Given a query ($Q$), key ($K$), and value ($V$) tensors, the attention mechanism computes a weighted sum of the value tensor based on the similarity between the query and key tensors as shown in the following equation:

$$
\text{Attention}(Q,K,V) = \text{softmax}\Big(\frac{QK^T}{\sqrt{d_k}}\Big)V
$$

where 
- $Q$ represents the query tensor.
- $K$ represents the key tensor.
- $V$ represents the value tensor.
- $d_k$ represents the dimensionality of the key tensor.

In this exercise, we will dive into the attention mechanism. To do so, we are going to build a simple cross-attention function that we will then extend to a more complex multi-head self-attention module that incorporates the concept of causality.

### Exercise 2.1: Building a Simple Cross-Attention Function

Cross-attention refers to the case where the input sequences to compute $Q$, $K$, and $V$ come from different sources. It allows models to incorporate contextual information from one sequence (S1) into another (S2)

Given two input sequences $S_1$ and $S_2$ and the transformation weights $W_Q$, $W_K$ and $W_V$, complete the `cross_attention` function in the cell below. 

You need to implement the following:
- Calculate the query, key, and value projections using linear transformations.
- Compute the attention scores by performing the dot product between the query and key tensors.
- Apply softmax activation to the attention scores to obtain the attention weights.
- Multiply the attention weights with the value tensor to get the attended values.
- Return the attended values.

In [5]:
import torch
import torch.nn.functional as F

def cross_attention(S1, S2, W_Q, W_K, W_V):
    # Calculate Queries from sequence S2
    queries = torch.matmul(S2, W_Q)

    # Calculate Key and Value from sequence S1
    keys = torch.matmul(S1, W_K)
    values = torch.matmul(S1, W_V)

    # Compute attention scores
    attention_scores = torch.matmul(queries, keys.transpose(-2, -1))
    
    # Scale the attention scores
    d_k = queries.size(-1)
    attention_scores = attention_scores / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
    
    # Apply softmax to obtain attention weights
    attention_weights = F.softmax(attention_scores, dim=-1)
    
    # Compute the attended values
    attended_values = torch.matmul(attention_weights, values)
    
    return attended_values

In [6]:
# Do something more interesting, maybe using the tokenizer from the previous exercise.
S1 = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]], dtype=torch.float)  # Sequence S1
S2 = torch.tensor([[2, 4], [1, 3]], dtype=torch.float)  # Sequence S2

W_Q = torch.tensor([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]])  # Query weights
W_K = torch.tensor([[0.7, 0.8, 0.9], [1.0, 1.1, 1.2], [1.3, 1.4, 1.5]])  # Key weights
W_V = torch.tensor([[1.6, 1.7, 1.8], [1.9, 2.0, 2.1], [2.2, 2.3, 2.4]])  # Value weights

# Perform cross-attention
attended_values = cross_attention(S1, S2, W_Q, W_K, W_V)

# Expected output
expected_output = torch.tensor([[63.3000, 66.6000, 69.9000],[63.3000, 66.6000, 69.9000]])

# Compare the output with the expected values
assert torch.allclose(attended_values, expected_output), "Ops! You need to check your function!"
print("Test passed!")

Test passed!


### Exercise 2.2: Extending to Multi-Head Self-Attention
Great! You have successfully implemented cross-attention. Now, let's make some modifications so we can train a real GPT model.

**1. Pytroch Module**

The first modification involves embedding our function into a PyTorch module. As you may have noticed, in the previous exercise, we passed the transformation weights as inputs to the function. In a real-world scenario, these matrices are learned, and PyTorch can keep track of them for us.

**1. Self-Attention**

We will be replacing the cross-attention mechanism with self-attention. In self-attention, a single sequence acts as the query, key, and value, allowing attention to be computed within the sequence itself.

**2. Multi-Head**

Finally, we are going to extend the single-head attention function to multi-head attention. In the previous implementation, we had one set of weights for the input query, resulting in a single type of relationship between the query and the values. With multi-head attention, we can utilize multiple parallel single-head attention modules to obtain diverse relationships between the query and the values.

Complete the missing lines on the initialization of the module and the forward pass.

In [7]:
import math
import torch.nn as nn

class CausalSelfAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        
        # Initialize layers and parameters
        self.hidden_size = config.n_embd
        self.num_heads = config.n_head

        # Create the linear project

        self.query_proj = nn.Linear(config.n_embd, config.n_embd)
        self.key_proj = nn.Linear(config.n_embd, config.n_embd)
        self.value_proj = nn.Linear(config.n_embd, config.n_embd)

        self.output_proj = nn.Linear(config.n_embd, config.n_embd)

        self.attn_dropout = nn.Dropout(config.attn_pdrop)
        self.resid_dropout = nn.Dropout(config.resid_pdrop)

        self.register_buffer(
            "bias",
            torch.tril(torch.ones(config.block_size, config.block_size)).view(
                1, 1, config.block_size, config.block_size))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, C = x.size()

        # Split input into query, key, and value tensors
        query = self.query_proj(x)
        key = self.key_proj(x)
        value = self.value_proj(x)

        # Reshape and transpose tensors for multi-head computation
        query = query.view(B, T, self.num_heads,
                           self.hidden_size // self.num_heads).transpose(1, 2)
        key = key.view(B, T, self.num_heads,
                       self.hidden_size // self.num_heads).transpose(1, 2)
        value = value.view(B, T, self.num_heads,
                           self.hidden_size // self.num_heads).transpose(1, 2)

        # Compute attention scores
        scores = torch.matmul(query, key.transpose(-2, -1))

        # Normalize the scores by dividing by the square root of the hidden size
        scores = scores / math.sqrt(self.hidden_size // self.num_heads)

        # Apply causal mask to restrict attention to the left in the input sequence
        mask = self.bias[:, :, :T, :T]
        scores = scores.masked_fill(mask == 0, float('-inf'))

        # Apply softmax activation to get attention weights
        weights = F.softmax(scores, dim=-1)

        # Multiply attention weights with values to get attended values
        attended_values = torch.matmul(self.attn_dropout(weights), value)

        # Transpose and reshape attended values to restore original shape
        attended_values = attended_values.transpose(1, 2).contiguous().view(
            B, T, C)

        # Apply output projection and dropout
        output = self.resid_dropout(self.output_proj(attended_values))

        return output

### Exercise 2.3: [Optional]
<details>
<summary>What is the purpose of applying a causal mask in the attention computation?</summary>
The causal mask ensure that the in attention computation, each position in the sequence can only attend to the positions on its left, preventing information leakage from future positions. This is essential in tasks where the model should generate output sequentially, such as language generation or autoregressive tasks.
</details>

<details>
<summary>How does the number of attention heads affect the model's capacity to capture different types of dependencies in the input sequence?</summary>
Multiple heads allow the model to attend to different parts of the input sequence simultaneously. By increasing the number of attention heads, the model can capture more diverse dependencies and patterns in the data. Each head can focus on different aspects of the input, enabling the model to learn complex relationships and improve performance on tasks that require capturing multiple types of dependencies.
</details>


<details>
<summary>What is the purpose of the residual dropout and attention dropout in the CausalSelfAttention module?</summary>
The residual dropout and attention dropout are regularization techniques used to prevent overfitting and improve the generalization of the model. The residual dropout applies dropout to the output of the attention module, helping to regularize the model during training. The attention dropout applies dropout to the attention weights, which helps to reduce over-reliance on specific tokens and encourages the model to attend to a broader range of tokens in the sequence.
</details>



## Exercise 3:
Here I will do:
1. Create a small model
2. Train a model on a small dataset.
3. Visualize the attention before and after training.
4. Compare the performance during inference of a trained model and the small trained model.

Based on this
https://pytorch.org/tutorials/beginner/transformer_tutorial.html

Train on shakespeare and harry potter.
Train that loss/perprexity decreases.


In [8]:
import torch
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
import numpy as np

import random
random.seed(42)

from lxmls.transformers.utils import set_seed
from lxmls.transformers.bpe import BPETokenizer
from lxmls.transformers.model import GPT
from lxmls.transformers.trainer import Trainer

In [9]:
import pickle
class WeatherDataset(Dataset):

    """Dataset for training an auto regressive transformer on a sequence of weather/actions
    Input (observations): ['clean', 'clean', 'shop', 'walk', 'shop', 'read']
    Input (IDs): [0, 0, 2, 4, 2, 1]
    Output (states): ['sunny', 'rainy', 'rainy', 'sunny', 'snowy', 'sunny']
    Output (IDs): [7, 5, 5, 7, 6, 7]]
    Which we will feed into the transformer concatenated as:
    Input: [0, 0, 2, 4, 2, 1, 7, 5, 5, 7, 6]
    Output: [-1, -1, -1, -1, -1, 7, 5, 5, 7, 6, 7]
    where each observation and state are converted to an index ans -1 indicates "ignore",
    as the transformer is reading the input sequence but not predicting it.
    """

    def __init__(self, split, seq_len = 6, num_instances=10000, proba = False):
        assert split in {'train', 'test'}
        self.split = split
        self.size = num_instances

        # Generate vocabulary
        self.obs, self.states = self.generate_voc()

        # Get HMM probabilities for dataset generation
        # We should work with a fixed proba, but there is a functoin for random generation
        if proba:
            self.proba = proba
        else:
            self.generate_random_proba()

        self.length = seq_len

    def __len__(self):
        return(self.size)

    def get_block_size(self):
        # the length of the sequence that will feed into transformer,
        # containing concatenated input and the output, but -1 because
        # the transformer starts making predictions at the last input element
        return self.length*2 -1

    def get_vocab_size(self):
        # Our vocabulary is the size of observation + states
        return(len(self.obs) + len(self.states))


    def generate_voc(self):
        """Generating vocabulary for the HMM model.
        Should not change that."""

        observations = ["walk", "shop", "clean", "tennis", "read"]
        states = ["sunny", "rainy", "snowy"]

        # Sort them alphabetically, just to be on the safe side
        observations.sort()
        states.sort()

        return(observations, states)

    # Dummy functoins for decoding
    def decode_obs(self,obs):
        return([self.obs[i] for i in obs])

    # State IDs are offset by number of observations
    def decode_st(self,st):
        ofs = len(self.obs)
        return([self.states[i-ofs] for i in st])

    def decode_seq(self,x,y):

        return(self.decode_obs(x),self.decode_st(y))

    # Dummy function for converting random logits to probabilities
    def logits_to_probs(self,logits):
        logits = np.array(logits)  # Convert the list to a numpy array for efficient calculations
        exp_logits = np.exp(logits)  # Apply the exponential function to each element
        probabilities = exp_logits / np.sum(exp_logits)  # Divide each element by the sum of all elements
        return probabilities.tolist()  # Convert the numpy array back to a Python list

    # We should NOT use that.
    # Mostly for debugging purposes
    # The resulting dataset is almost unlearnable as it's randomly generated
    def generate_random_proba(self):

        # Generating a probability distribution for HMM
        self.proba = {}

        # Initial probabilities
        self.proba["initial"] = []

        # Generate random initial probabilities for each state
        for state in self.states:
            self.proba["initial"].append(random.random())

        # Convert to probabilities
        self.proba["initial"] = self.logits_to_probs(self.proba["initial"])

        # Transition probabilities
        self.proba["transition"] = []

        # Generate transition from state x to any other state
        for state in self.states:
            c_t_pr = []

            # Generate random tr probabilities for all states
            for state in self.states:
                c_t_pr.append(random.random())

            # N.B. we do NOT generate "Final" probabilities
            # We will generate a fixed length sequence instead
            # Lazy solution, I know...


            # Convert to probabilities
            c_t_pr = self.logits_to_probs(c_t_pr)

            self.proba["transition"].append(c_t_pr)

        # Emission probabilities
        self.proba["emission"] = []

        # Generate emission from state x to any observation
        for state in self.states:
            c_e_pr = []

            # Generate random em probabilities for all observations
            for obs in self.obs:
                c_e_pr.append(random.random())

            c_e_pr = self.logits_to_probs(c_e_pr)

            self.proba["emission"].append(c_e_pr)

    # Dummy function for sampling w.r.t probability
    def sample_p(self,p_l):
        items = np.arange(len(p_l))
        sample = np.random.choice(items, p=p_l)
        return sample

    def generate_seq(self):

        """Generating a random sequence given probas"""

        # Variable initialization
        eos = False
        c_s = 99
        x = []
        y = []

        while not eos:

            # Start of sequence
            if c_s == 99:
                # Sample from initial
                c_s = self.sample_p(self.proba["initial"])

            # Consecutive iterations

            # We generate until we get length of self length
            elif len(x) < self.length:
                # Sample from transition of last state
                c_s = self.sample_p(self.proba["transition"][c_s])

                # Generate emission

                # Note that we append the states as labels and observations as input
                y.append(c_s)
                x.append(self.sample_p(self.proba["emission"][c_s]))

            else:
                eos = True

        # We get the state ID by offseting their idx by the length of observations
        ofs = len(self.obs)
        y = [i+ofs for i in y]
        return(x,y)


    def __getitem__(self, idx):

        # use rejection sampling to generate an input example from the desired split
        while True:

            # Generate observation and its states
            obs, st = self.generate_seq()

            # figure out if this generated example is train or test based on its hash
            h = hash(pickle.dumps(obs))
            inp_split = 'test' if h % 4 == 0 else 'train' # designate 25% of examples as test
            if inp_split == self.split:
                break # ok


        # concatenate the observation and labels
        cat = torch.cat((torch.LongTensor(obs), torch.LongTensor(st)), dim=0)

        # the inputs to the transformer will be the offset sequence
        x = cat[:-1].clone()
        y = cat[1:].clone()
        # we only want to predict at output locations, mask out the loss at the input locations
        y[:self.length-1] = -1
        return x, y

In [10]:
# Fixed probabilities, easier to learn
fixed_proba = {}
fixed_proba["initial"] = [.5,.3,.2]
fixed_proba["transition"] = [
    [.5,.5,0],
    [0,.5,.5],
    [.5,0,.5]
]
fixed_proba["emission"] = [
    [.5,0,.2,0,.3],
    [0,.5,.4,0,.1],
    [0,0,.1,.5,.4]

]

In [11]:
# print an example instance of the dataset
train_dataset = WeatherDataset('train',proba=fixed_proba)
test_dataset = WeatherDataset('test',proba=train_dataset.proba)
x, y = train_dataset[0]
print(x.tolist())
print(y.tolist())
print(train_dataset.decode_obs(x.tolist()[:6]))
print(train_dataset.decode_st(y.tolist()[5:]))

[4, 2, 1, 4, 2, 4, 5, 6, 6, 7, 5]
[-1, -1, -1, -1, -1, 5, 6, 6, 7, 5, 5]
['walk', 'shop', 'read', 'walk', 'shop', 'walk']
['rainy', 'snowy', 'snowy', 'sunny', 'rainy', 'rainy']


In [12]:
# create a GPT instance

model_config = GPT.get_default_config()
model_config.model_type = 'gpt-nano'
model_config.vocab_size = train_dataset.get_vocab_size()
model_config.block_size = train_dataset.get_block_size()
model = GPT(model_config)

number of parameters: 0.09M


In [13]:
# create a Trainer object

train_config = Trainer.get_default_config()
train_config.learning_rate = 5e-4 # the model we're using is so small that we can go a bit faster
train_config.max_iters = 2000
train_config.num_workers = 0
trainer = Trainer(train_config, model, train_dataset)

running on device cpu


In [14]:
import time

In [15]:
def batch_end_callback(trainer):
    if trainer.iter_num % 100 == 0:
        print(f"iter_dt {trainer.iter_dt * 1000:.2f}ms; iter {trainer.iter_num}: train loss {trainer.loss.item():.5f}")
trainer.set_callback('on_batch_end', batch_end_callback)

start_time = time.time()
trainer.run()
end_time = time.time()
elapsed_time = end_time - start_time

# Print the training time
print("Training time: {:.2f} seconds".format(elapsed_time))

iter_dt 0.00ms; iter 0: train loss 2.09532
iter_dt 35.29ms; iter 100: train loss 0.58396
iter_dt 27.47ms; iter 200: train loss 0.33435
iter_dt 26.30ms; iter 300: train loss 0.36664
iter_dt 27.40ms; iter 400: train loss 0.34780
iter_dt 26.39ms; iter 500: train loss 0.39525
iter_dt 26.82ms; iter 600: train loss 0.28695
iter_dt 26.82ms; iter 700: train loss 0.22949
iter_dt 26.60ms; iter 800: train loss 0.29234
iter_dt 25.34ms; iter 900: train loss 0.31274
iter_dt 27.38ms; iter 1000: train loss 0.26250
iter_dt 30.06ms; iter 1100: train loss 0.27067
iter_dt 26.03ms; iter 1200: train loss 0.25588
iter_dt 26.97ms; iter 1300: train loss 0.24768
iter_dt 28.73ms; iter 1400: train loss 0.28382
iter_dt 25.36ms; iter 1500: train loss 0.27474
iter_dt 26.06ms; iter 1600: train loss 0.29187
iter_dt 26.22ms; iter 1700: train loss 0.26703
iter_dt 30.16ms; iter 1800: train loss 0.30363
iter_dt 26.42ms; iter 1900: train loss 0.29262
Training time: 55.49 seconds


In [16]:
!pip install transformers bertviz

Collecting bertviz
  Downloading bertviz-1.4.0-py3-none-any.whl (157 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m157.6/157.6 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: bertviz
Successfully installed bertviz-1.4.0


In [17]:
from transformers import BertTokenizer, BertModel
from bertviz import head_view

# Define a sample input text
text = "I will go for a run and will jump into a lake."

# Instantiate the BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize the input text
tokens = tokenizer.tokenize(text)

# Convert tokens to token IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)

# Create attention mask
attention_mask = [1] * len(token_ids)

# Convert token IDs and attention mask to tensors
input_ids = torch.tensor([token_ids])
attention_mask = torch.tensor([attention_mask])

# Generate the transformer output
outputs = model(input_ids, attention_mask=attention_mask, output_attentions=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [18]:
outputs.attentions[0].shape

torch.Size([1, 12, 13, 13])

In [19]:
head_view(outputs.attentions, tokens=tokens)

<IPython.core.display.Javascript object>