# Transformer Day Exercises

In [1]:
# Set Up
#!git clone https://github.com/LxMLS/lxmls-toolkit.git
#%cd lxmls-toolkit/
import sys
import os
#sys.path.append(os.getcwd())
#sys.path.append("/Users/israfelsalazar/Documents/lxmls-toolkit")
#!git checkout transformer-day

## Exercise 1: Tokenization ✅

*Tokenization* is a fundamental process in modern NLP pipelines. It works by splitting a sentence, which consists in a sequence of characters, into atomic elements called **tokens**, possibly discarding other elements like punctuations. These tokens are the basic language units used by the model. 

As we will see, the granularity of such units may cary. 
In general, *tokenization allows us to represent text data in a format that can be process by standar deep learning models*. In this exercise, we will explore tokenization to understand how it helps in NLP tasks.

### Word-based tokenizers

The easiest way to split a text into units is probably to divide it based on whitespace. The idea is to chunk a sentence into segments everytime a space charecter is found. In python, we can do this with the `split()` function.

_string.split(separator, maxsplit)_

where separator is the whitespace by default and there's not maxsplit. You can read the python docs for `split()` [here](https://docs.python.org/3/library/stdtypes.html#str.split)

In [2]:
text = "I travelled to Lisbon in July to attend an NLP summer school"
text.split()

['I',
 'travelled',
 'to',
 'Lisbon',
 'in',
 'July',
 'to',
 'attend',
 'an',
 'NLP',
 'summer',
 'school']

Now, this list of words can be fed to a model to perform task like next token prediction, machine translation, story generation etc.

Tokenizing based on words allows the model to understand the basic units of language withouth worrying to learn things like workd boundaries. However, a downside of this approach is that we end up having an extremely large vocabulary, with an entry for each word of the language.
Due to computational and processing resources, deep learning models still struggle to handle vocabularies that are larger than tens of thousands of tokens. For this reasons, some words need to be left out of the vocabulary and are all mapped to a shared token usually called UNK, which stands for unknown (word).

#### Text normalization

In actual tokenization pipelines, text is usually normalized before being tokenized. This process means extracting a basic version of each word that is stripped from suffixes or functional information. For example, the verb "run" might appear as "running," "runs" or "ran", and the word "program" might appear as "programmer".


### Character-based tokenizers

Another option is to tokenize text based on **characters**. This allows us to have a much smaller model vocabulary, it provides us with a method to handle new words given the combinatorial ability of combining known characters, but it doesn't instill to the model the concept of words, which should be learned during training.

In [3]:
text = "I will travel to Lisbon in July, I will attend an NLP summer school but I hope to visit around: it's a beatiful city!"
tokenized = [c for c in text if c not in [",", ";", ":", "'", "!", "?"]]

for i in tokenized:
    print(i)

I
 
w
i
l
l
 
t
r
a
v
e
l
 
t
o
 
L
i
s
b
o
n
 
i
n
 
J
u
l
y
 
I
 
w
i
l
l
 
a
t
t
e
n
d
 
a
n
 
N
L
P
 
s
u
m
m
e
r
 
s
c
h
o
o
l
 
b
u
t
 
I
 
h
o
p
e
 
t
o
 
v
i
s
i
t
 
a
r
o
u
n
d
 
i
t
s
 
a
 
b
e
a
t
i
f
u
l
 
c
i
t
y


**Question** What other problem are character-based tokenizers posing to NLP models?

**Your Answer**: 

<span style="color:yellow">

### Word tokenizers
The vocabulary space becomes very large. The model will also struggle with new or rare words. 

### Subword tokenizers
While the vocabulary is short (typically 26 + 10 for English alphanumeric)The length of tokens becomes very long. 

</span>

### The best of both worlds: subword-based tokenizers

In order to combine the best of both words, the most frequent tokenization strategy for modern NLP system is to tokenize based on **subwords**. Subwords are sequence of characters the can be shorter than entire words. 

How to split words into subwords depends on *how frequent a given sequence of characters is*. The core idea is that very frequent sequences are not split given that they are very likely to be used and appear a lot in corpora.

For instance, "unexpectedly" can be a **rare word** in a corpus and being split in the subwords "un", "expected", "ly".
These are standalone subwords that can be reused across other words, while the meaning of "unexpectedly" can be retained combining the three subwords. On the other hand, words like cats, people, and running, given their high frequency, will not be split by the tokenizer.


#### BPE (byte-pair encoding) tokenizer

One of the most widely used subword tokenizer is based on a method called <span style="color:magenta"> **BPE (byte-pair encoding)**</span>, which was introduced in a paper by [Senrich et al, 2016](https://aclanthology.org/P16-1162/). BPE relies on an *iterative algorithm based on the frequency of character sequences*. At each step, the algorithm compute character frequencies and merges pairs that tend to occur together

##### Notes

Note that after we split words into subwords, we are now left with all the elements that will form the *vocabulary of our model*. Each subword is then mapped into an **index in the model vocabulary**. This is a standard mapping between string respresentations of (sub)words and vocabulary entries. 

Remember that subword tokenizers need to be "trained", they need to learn what word splits are based on data that they see. This is not the same training that we do with our neural network models given that tokenizers are *non-parametric deterministic methods*. 

Lastly, note also that different data leads to different word splitting choices and that a tokenizer is directly connected to a model. For this reason, you have keep in mind that a model trained on a given tokenizer it is not guaranteed to perform equally when coupled with a different tokenizer.

##### Extra:

Each vocabulary index, which corresponds to a subword, is then used by the model to load and process a related (sub)word embedding representation. These are dense vectors that the model will use when doing computation for any NLP task. If you want to learn more about word embeddings, check out this [website](https://lena-voita.github.io/nlp_course/word_embeddings.html).

### Using BPE

We are now looking at a real example using BPE. We can import the BPE tokenizer from the lxmls toolkit. BPE, which is used in models like GPT-2, and other commonly used subword tokenizers like WordPiece (used in BERT) are available in practically any standard NLP library like huggingface.

In [4]:
from lxmls.transformers.bpe import BPETokenizer

In [5]:
tokenizer = BPETokenizer()

We will now split the sentence: *"Your drawing is charmingly anachronistic."*

**Question:** Do you have a guess on which word will be split subwords and which one won't?

<span style="color:yellow">
I would guess 'ing$' 'ly$' '^an'

</span>

In [6]:
# Tokenize a sample sentence
sentence = "Your drawing is charmingly anachronistic."
tokens = tokenizer.encoder.encode_and_show_work(sentence)

In [7]:
import numpy as np 

merged_tokens = [x['token_merged'] for x in tokens['parts']]
merged_tokens = list(np.concatenate(merged_tokens).flat)

print(merged_tokens)

['Your', 'Ġdrawing', 'Ġis', 'Ġcharming', 'ly', 'Ġan', 'ach', 'ron', 'istic', '.']


We can now look at this python vocabulary object returned by the tokenizer. If you look at `bpe_idx`, you see all the subwords that the tokenizer decided to split. These number are the indexes in the vocabulary all our model. The `parts` field contains the actual splittting in the `token_merged` field. 

As you can see, some words have been split. Do they match your initial guess? Why?

#### Whitespaces

You probably have noticed a special `Ġ` character inserted before each word expect the first one. This is because spaces are converted into this special token by the BPE algorithm, such that the word "run" and " run" are not treated equally and the tokenizer understands whether a word is at the beginning of a sentence or not. 

This is something that was found to provide better performance to the original GPT2 model. For more information you can read [here](https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475?u=joaogante)

Now look at the next two tokenized sentence. Can you notice how the word "very" is assigned two different tokens?

In [8]:
sentence = "running is very cool"
tokens = tokenizer.encoder.encode_and_show_work(sentence)
merged_tokens = [x['token_merged'] for x in tokens['parts']]
merged_tokens = list(np.concatenate(merged_tokens).flat)

print(merged_tokens)

['running', 'Ġis', 'Ġvery', 'Ġcool']


In [9]:
sentence = "very cool!"
tokens = tokenizer.encoder.encode_and_show_work(sentence)
merged_tokens = [x['token_merged'] for x in tokens['parts']]
merged_tokens = list(np.concatenate(merged_tokens).flat)

print(merged_tokens)

['very', 'Ġcool', '!']


#### Handling typos

Now we are going to tokenize another two very similar sentences.

In [10]:
sentence = "I like to circumnavigate the globe every year"
tokens = tokenizer.encoder.encode_and_show_work(sentence)
merged_tokens = [x['token_merged'] for x in tokens['parts']]
merged_tokens = list(np.concatenate(merged_tokens).flat)

print(merged_tokens)

['I', 'Ġlike', 'Ġto', 'Ġcirc', 'umn', 'av', 'igate', 'Ġthe', 'Ġglobe', 'Ġevery', 'Ġyear']


In [11]:
sentence = "I like to cirkumnavigate the globe every year"
tokens = tokenizer.encoder.encode_and_show_work(sentence)
merged_tokens = [x['token_merged'] for x in tokens['parts']]
merged_tokens = list(np.concatenate(merged_tokens).flat)

print(merged_tokens)

['I', 'Ġlike', 'Ġto', 'Ġcir', 'k', 'umn', 'av', 'igate', 'Ġthe', 'Ġglobe', 'Ġevery', 'Ġyear']


Why have they been tokenized differently? 

The only difference between these two sentences is in the the **typo** of the word "circumnavigate". As you can see, a _simple change_ in the word spelling breaks the tokenization process and leads to a different result. However, unlike word-based tokenizers where the wrong word would have been processed as an _unknown_ word, here we can still retain some ther other correct characters and our favorite NLP model can hopefully partially make it up for the typo while processing the sentence.

#### Determinism

Finally, recall that another important aspect of the tokenization process is that it's fully deterministic.  <span style="color:magenta"> Once we split a sentence into chunks and obtain the list of word indexes, we can fully revert the process and decode back the original text.</span>

In [12]:
original_sentence = "We are about to start exercise 2 about attention, let's have fun!"
tokenized_sentence = tokenizer.encoder.encode(original_sentence)
reconstructed_sentence = tokenizer.encoder.decode(tokenized_sentence)

print(reconstructed_sentence)

We are about to start exercise 2 about attention, let's have fun!


## Excercise 2: Attention ✅

Attention is a crucial component in the transformer, it allows to capture dependencies between different positions of two sequence of elements. In our case, and in most cases in NLP applications, sequences are sentences and elements are (sub)words.
It is a powerful operation that allows to learn an alignment between each element in two sequences. It generates a score of how related each element in sequence1 and sequence2 are between each other.
Understanding how attention works and being able to implement it are essential for anyone working with transformers. 

Given a query ($Q$), key ($K$), and value ($V$) tensors, the attention mechanism computes a weighted sum of the value tensor based on the similarity between the query and key tensors as shown in the following equation:

$$
\text{Attention}(Q,K,V) = \text{softmax}\Big(\frac{QK^T}{\sqrt{d_k}}\Big)V
$$

where 
- $Q$ represents the query tensor.
- $K$ represents the key tensor.
- $V$ represents the value tensor.
- $d_k$ represents the dimensionality of the key tensor.

This is the image that was in the original Transformer paper and that shows the computations used in the attention.

Forget about the right part, we'll get back to that later in the lab.

![image](https://miro.medium.com/v2/resize:fit:1270/1*LpDpZojgoKTPBBt8wdC4nQ.png)


In this exercise, we will dive into the attention mechanism. To do so, we are going to build a simple cross-attention function that we will then extend to a more complex multi-head self-attention module that incorporates the concept of causality.

### Exercise 2.1: Building a Simple Cross-Attention Function ✅

Cross-attention refers to the case where the input sequences to compute $Q$, $K$, and $V$ come from different sources. It allows models to incorporate contextual information from one sequence (S1) into another (S2). <a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1)


Given two input sequences $S_1$ and $S_2$ and the transformation weights $W_Q$, $W_K$ and $W_V$, complete the `cross_attention` function in the cell below. 

You need to implement the following:
- Calculate the query, key, and value projections using linear transformations.
- Compute the attention scores by performing the dot product between the query and key tensors.
- Apply softmax activation to the attention scores to obtain the attention weights.
- Multiply the attention weights with the value tensor to get the attended values.
- Return the attended values.

<a name="cite_note-1"></a>1. [^](#cite_ref-1) Conceptually, the self attention variant that you might have heard is the same, with the only difference that the S1 and S2 sequences are the same.

In [13]:
import torch
import torch.nn.functional as F

def cross_attention(S1, S2, W_Q, W_K, W_V):
    
    # Calculate Queries from sequence S2
    Q = torch.matmul(S2.T, W_Q)

    # Calculate Key and Value from sequence S1
    K = torch.matmul(S1.T, W_K)
    V = torch.matmul(S1.T, W_V)  

    # Compute attention scores
    SAS = torch.matmul(Q, K.transpose(-2, -1)) 
    
    # Scale the attention scores
    d_k = Q.size(-1)
    SAS = SAS / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
    
    
    # Apply softmax to obtain attention weights
    SAS = F.softmax(SAS, dim=1) 
    
    # Compute the attended values
    attended_values = torch.matmul(SAS, V)
    
    return attended_values

In [14]:
sentence = "I travelled to Lisbon in July to attend an NLP summer school"
query = "I like traveling"
S1 = torch.tensor([tokenizer.encoder.encode(sentence)], dtype=torch.float)
S2 = torch.tensor([tokenizer.encoder.encode(query)], dtype=torch.float)

W_Q = torch.tensor([[0.1, 0.2, 0.3]])  # Query weights
W_K = torch.tensor([[0.7, 0.8, 0.9]])  # Key weights
W_V = torch.tensor([[1.6, 1.7, 1.8]])  # Value weights

# Perform cross-attention
attended_values = cross_attention(S1, S2, W_Q, W_K, W_V)

print(attended_values)

# Expected output
expected_output = torch.tensor([[67036.8047, 71226.6016, 75416.3984]])

# Compare the output with the expected values
assert torch.allclose(attended_values, expected_output), "Ops! You need to check your function!"
print("Test passed!")

tensor([[67036.8047, 71226.6016, 75416.3984],
        [67036.8047, 71226.6016, 75416.3984],
        [67036.8047, 71226.6016, 75416.3984]])
Test passed!


### Exercise 2.2: Extending to Multi-Head Self-Attention ✅
Great! You have successfully implemented cross-attention. Now, let's make some modifications so we can train a real GPT model.

**1. Self-Attention**

We will be replacing the cross-attention mechanism with self-attention. In self-attention, a single sequence acts as the query, key, and value, allowing attention to be computed within the sequence itself. This can be useful for syntactic where an attention head can model the relationship between part of speech like subjects and verbs. 

**2. Multi-Head**

However, the relations present even in a single sentence are more than one. Think about number and gender agreement as one, the semantic relation between subject and object, the functional aspect that verb arguments have etc. All this cannot be modeled by a single head.

For this reason, we are going to extend the single-head attention function to **multi-head attention**. In the previous implementation, we had one set of weights for the input query, resulting in a single type of _relationship between the the source and target sequence_. With multi-head attention, we can utilize _multiple parallel single-head attention modules_ to obtain diverse relationships between the query and the values. The attention operation works by projecting the sequences through a multiplication with a projection matrix, and then computing the alignment score. These are are all operation that can be parallelized since there's no interdependency between each each head. For this reasons, each head could learn to model a different linguistic intereation useful for many downstream tasks, be it syntactic, semantic or generation-based..

**3. Pytorch Module**

The last modification involves embedding our function into a PyTorch module. As you may have noticed, in the previous exercise, we passed the transformation weights as inputs to the function. In a real-world scenario, these matrices are learned, and PyTorch can keep track of them for us.

Complete the missing lines on the initialization of the module and the forward pass.

###### Note

GPT uses a version of self-attention called causal self-attention. When training our models for tasks like language modeling and machine translation, in practice we feed the entire train sequence to the model but, at every timestep, we want to prevent it to compute the alignment with future tokens. For this reason we use a mask that we incrementally lift at every timestep. For instance, we have a sentence that says "Libson is a great city to live in". At time 0, we feed the entire sentence to the model masking everything but the first token. Using the strikethrough format as masking, this will be what the model sees at step 0:

- Time 0: Libson ~is a great city to live in~

We then let the model generate a token a and move to step 1 where we are masking everything but the first two tokens
 
- Time 1: Libson is ~a great city to live in~ 

and so on...

- Time 2: Libson is a ~great city to live in~ 
- Time 3: Libson is a great ~city to live in~ 
- Time 4: Libson is a great city ~to live in~ 
- Time 5: Libson is a great city to ~live in~ 
- Time 6: Libson is a great city to live ~in~ 

We can now look back at the attention figure from the paper. Hopefully, you are now able to understand also the right side of the figure.

![image](https://miro.medium.com/v2/resize:fit:1270/1*LpDpZojgoKTPBBt8wdC4nQ.png)

In [15]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

class CausalSelfAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        
        # Initialize layers and parameters
        self.hidden_size = config.n_embd
        self.num_heads = config.n_head

        # Create the linear projections for query, key, and value tensors
        # Note: the input and output size of all these projections is n_embd
        self.query_proj = nn.Linear(config.n_embd, config.n_embd)
        self.key_proj = nn.Linear(config.n_embd, config.n_embd)
        self.value_proj = nn.Linear(config.n_embd, config.n_embd)

        self.output_proj = nn.Linear(config.n_embd, config.n_embd)

        self.attn_dropout = nn.Dropout(config.attn_pdrop)
        self.resid_dropout = nn.Dropout(config.resid_pdrop)

        self.register_buffer(
            "bias",
            torch.tril(torch.ones(config.block_size, config.block_size)).view(
                1, 1, config.block_size, config.block_size))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, C = x.size()

        # Create the projections for query, key, and value tensors
        # Note: In self-attention these are all over the same tensor x
        Q = self.query_proj(x)
        K = self.key_proj(x)
        V = self.value_proj(x)

        # Reshape and transpose tensors for multi-head computation. 
        # We reshape the output from (B, T, C) to (B, T, num_heads, hidden_size/num_heads)
        # And transpose the result to (B, num_heads, T, hidden_size/num_heads)
        # So the multi-head computation can be implemented as a single matrix multiplication.
        query = query.view(B, T, self.num_heads,
                           self.hidden_size // self.num_heads).transpose(1, 2)
        key = key.view(B, T, self.num_heads,
                       self.hidden_size // self.num_heads).transpose(1, 2)
        value = value.view(B, T, self.num_heads,
                           self.hidden_size // self.num_heads).transpose(1, 2)

        # Compute attention scores. The shape of scores should be (B, num_heads, T, T)
        # Hint: You can use tensor.transpose() to adapt the order of the axes.
        attention_scores = torch.matmul(Q, K.transpose(-2, -1))

        # Normalize the scores by dividing by the square root of the hidden size
        # Take into account that you are using multi-head attention!
        attention_scores = attention_scores / math.sqrt(floor(self.hidden_size / self.num_heads))

        # Apply causal mask to restrict attention to the left in the input sequence 
        mask = self.bias[:, :, :T, :T]
        scores = scores.masked_fill(mask == 0, float('-inf'))

        # Apply softmax activation to get attention weights
        # Check the correct axis for the softmax function! What should be the shape of the weights?
        weights = F.softmax(scores, dim=-1)

        # Apply dropout to the attention weights
        weights = self.attn_dropout(weights)

        # Multiply attention weights with values to get attended values
        attended_values = torch.matmul(weights, V)

        # Transpose and reshape attended values to restore original shape
        attended_values = attended_values.transpose(1, 2).contiguous().view(B, T, C)

        # Apply output projection and dropout
        output = self.resid_dropout(self.output_proj(attended_values))

        return output

### Exercise 2.3: Questions [Optional]
<details>
<summary>What is the purpose of applying a causal mask in the attention computation?</summary>
The causal mask ensure that the in attention computation, each position in the sequence can only attend to the positions on its left, preventing information leakage from future positions. This is essential in tasks where the model should generate output sequentially, such as language generation or autoregressive tasks.
</details>

<details>
<summary>How does the number of attention heads affect the model's capacity to capture different types of dependencies in the input sequence?</summary>
Multiple heads allow the model to attend to different parts of the input sequence simultaneously. By increasing the number of attention heads, the model can capture more diverse dependencies and patterns in the data. Each head can focus on different aspects of the input, enabling the model to learn complex relationships and improve performance on tasks that require capturing multiple types of dependencies.
</details>


<details>
<summary>What is the purpose of the residual dropout and attention dropout in the CausalSelfAttention module?</summary>
The residual dropout and attention dropout are regularization techniques used to prevent overfitting and improve the generalization of the model. The residual dropout applies dropout to the output of the attention module, helping to regularize the model during training. The attention dropout applies dropout to the attention weights, which helps to reduce over-reliance on specific tokens and encourages the model to attend to a broader range of tokens in the sequence.
</details>



### Exercise 2.4: Visualize Attentions[Optional]

Now that we understand the basic mechanisms of attention, we can check the activated attention patterns in a pretrained BERT model (Devlin et al. 2018). Recall that BERT is an encoder-based transformer model which is based on a stack of self-attention blocks.

In [16]:
from transformers import BertTokenizer, BertModel
from bertviz import head_view

# Define a sample input text
text = "I will go for a run and will jump into a lake."

# Instantiate the BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize the input text
tokens = tokenizer.tokenize(text)

# Convert tokens to token IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)

# Create attention mask
attention_mask = [1] * len(token_ids)

# Convert token IDs and attention mask to tensors
input_ids = torch.tensor([token_ids])
attention_mask = torch.tensor([attention_mask])

# Generate the transformer output
outputs = model(input_ids, attention_mask=attention_mask, output_attentions=True)

# Extract attentions and check the shape
outputs.attentions[0].shape

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([1, 12, 13, 13])

As you can see, we extracted an attention from the first layaer. The first dimension is the bach, the second one is the number of heads used in the first layer, and the last two dimensions are the sequence length. Given that this was a self attention block the last two numbers are equal.

We can now use a method from the [bertviz library](https://github.com/jessevig/bertviz) and plot all the heads.

You'll see a dropdown menu that allows you the select a layer of the model (GPT-2 has 12). You'll then see a color for every head used in that layer (GPT-2 has 12 head per layer). By default all heads are shown, click on a color to activate/disactivate that head. It can help starting by activating only one head and checking the learned relation learn by that self attentino head. By hovering over each word you can see the attention weigths that linked that words to all the others.

**Question** Do you notice any interesting (linguistic) pattern?

In [17]:
head_view(outputs.attentions, tokens=tokens)

<IPython.core.display.Javascript object>

## Exercise 3 ✅
Now that we know everything aboout attention, we can go ahead and train a GPT-based model which heavilty relies on attention.

We will now:
1. Create a GPT-2 model
2. Train this model on a small dataset.
3. Check the loss of our model 

Based on this
https://pytorch.org/tutorials/beginner/transformer_tutorial.html

### Exercise 3.1: Training a Weather Prediction Model using Autoregressive Transformer

In this exercise, we will work with a dummy weather dataset that consists of sequences of weather observations and corresponding states. The goal is to train a small model using an autoregressive transformer to predict the weather state based on the previous observations.

In [18]:
import torch
from torch.utils.data.dataloader import DataLoader
import numpy as np
import time

import random
random.seed(42)

from lxmls.transformers.utils import set_seed
from lxmls.transformers.bpe import BPETokenizer
from lxmls.transformers.model import GPT
from lxmls.transformers.trainer import Trainer
from lxmls.transformers.dataset import WeatherDataset

We start by initializing the dataset, which is responsible for providing the training data for our model. The dataset contains sequences of weather observations and their corresponding states. These sequences are converted into indices and concatenated to form the input and output sequences for the transformer model.

You can check in detail the dataset in `lxmls/transformers/dataset.py`. 

In [19]:
# Fixed probabilities, easier to learn
# This is just to create the sequence in the dataset
fixed_proba = {}
fixed_proba["initial"] = [.5,.3,.2]
fixed_proba["transition"] = [
    [.5,.5,0],
    [0,.5,.5],
    [.5,0,.5]
]
fixed_proba["emission"] = [
    [.5,0,.2,0,.3],
    [0,.5,.4,0,.1],
    [0,0,.1,.5,.4]
]

In [20]:
# print an example instance of the dataset
train_dataset = WeatherDataset('train', proba=fixed_proba)
test_dataset = WeatherDataset('test', proba=train_dataset.proba)
x, y = train_dataset[0]

print("Sampling from the dataset:")
print(f"Input: {train_dataset.decode_obs(x.tolist()[:6])}")
print(f"Labels: {train_dataset.decode_st(y.tolist()[5:])}")
print("-"*50)
print("Tokenized sequences:")
print(f"Input: {x.tolist()}")
print(f"Labels: {y.tolist()}")

Sampling from the dataset:
Input: ['read', 'walk', 'walk', 'walk', 'tennis', 'shop']
Labels: ['snowy', 'sunny', 'sunny', 'sunny', 'sunny', 'sunny']
--------------------------------------------------
Tokenized sequences:
Input: [1, 4, 4, 4, 3, 2, 6, 7, 7, 7, 7]
Labels: [-1, -1, -1, -1, -1, 6, 7, 7, 7, 7, 7]


Next, we create a model using the default configuration for the GPT model. This configuration includes parameters which determine the size and structure of the model. The GPT model is a small version called GPT Nano.

In [21]:
# create a GPT instance
model_config = GPT.get_default_config()
model_config.model_type = 'gpt-nano'
model_config.vocab_size = train_dataset.get_vocab_size()
model_config.block_size = train_dataset.get_block_size()
model = GPT(model_config)

print(model_config)

number of parameters: 0.09M
model_type: gpt-nano
n_layer: 3
n_head: 3
n_embd: 48
vocab_size: 8
block_size: 11
embd_pdrop: 0.1
resid_pdrop: 0.1
attn_pdrop: 0.1
pretrained: False



To train our model, we create a Trainer object. The Trainer handles the training process, including defining the learning rate, setting the maximum number of iterations, and specifying the number of workers for data loading. The Trainer is initialized with the model, training dataset, and validation dataset.

In [22]:
# create a Trainer object
train_config = Trainer.get_default_config()
train_config.learning_rate = 5e-4 # the model we're using is so small that we can go a bit faster
train_config.max_iters = 2000
train_config.num_workers = 0
train_config.device = "mps"
trainer = Trainer(train_config, model, train_dataset)

print(train_config)

running on device mps
device: mps
num_workers: 0
max_iters: 2000
batch_size: 64
learning_rate: 0.0005
betas: (0.9, 0.95)
weight_decay: 0.1
grad_norm_clip: 1.0



With these components in place, we are ready to train our model on the weather dataset and make predictions based on the learned patterns. We just add some minor utilities function that show us intermediate logs. You can safely ignore them since most of this is usually abstracted away from end users in modern deep learning libraries.

In [23]:
def batch_end_callback(trainer):
    if trainer.iter_num % 100 == 0:
        print(f"iter_dt {trainer.iter_dt * 1000:.2f}ms; iter {trainer.iter_num}: train loss {trainer.loss.item():.5f}")
trainer.set_callback('on_batch_end', batch_end_callback)

start_time = time.time()
trainer.run()
end_time = time.time()
elapsed_time = end_time - start_time

# Print the training time
print("Training time: {:.2f} seconds".format(elapsed_time))

iter_dt 0.00ms; iter 0: train loss 2.12462
iter_dt 74.95ms; iter 100: train loss 0.68922
iter_dt 71.30ms; iter 200: train loss 0.41048
iter_dt 72.52ms; iter 300: train loss 0.27799
iter_dt 70.22ms; iter 400: train loss 0.34290
iter_dt 72.17ms; iter 500: train loss 0.29112
iter_dt 70.00ms; iter 600: train loss 0.32617
iter_dt 70.19ms; iter 700: train loss 0.29911
iter_dt 80.44ms; iter 800: train loss 0.28581
iter_dt 72.55ms; iter 900: train loss 0.28469
iter_dt 71.64ms; iter 1000: train loss 0.24523
iter_dt 71.61ms; iter 1100: train loss 0.26700
iter_dt 70.17ms; iter 1200: train loss 0.25450
iter_dt 69.91ms; iter 1300: train loss 0.31443
iter_dt 75.90ms; iter 1400: train loss 0.27875
iter_dt 72.02ms; iter 1500: train loss 0.28641
iter_dt 69.55ms; iter 1600: train loss 0.23928
iter_dt 70.10ms; iter 1700: train loss 0.30623
iter_dt 72.61ms; iter 1800: train loss 0.29678
iter_dt 70.17ms; iter 1900: train loss 0.24689


As you can see the loss started decreaseing and it seemed to fluctuate around a range of values close to 0.25

Great! You have just **trained a small GPT model**! Congrats!
Generating from such a tiny model that has been trained only for a short number of iteration won't give us interesting output. Let's rely on the one of the many powerful and larger pretrained model publicly available.

### Exercise 3.2: Prompting a pretrained GPT-2 model

We can load a pretrained gpt2 model from hugging face (this is done behind the scene from the GPT class) and prompt it with any text of our choice. 

In [24]:
model_type = "gpt2"
device = "mps"
model = GPT.from_pretrained(model_type)

# move model to the device(GPU if available)
# set to eval mode to avoid gradient accumulation model.to(device)
model.to(device)
model.eval()

number of parameters: 124.44M


Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT(
  (transformer): ModuleDict(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): PretrainedCausalSelfAttention(
          (c_attn): Linear(in_features=768, out_features=2304, bias=True)
          (c_proj): Linear(in_features=768, out_features=768, bias=True)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): ModuleDict(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
          (act): NewGELU()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head

Great! We can now generate from our pretrained model. We just need to pass a context and let the model generate. 
There are two additional parameters. `tokens` is the number of tokens we want our model to generate and `num_samples` is the number of diverse samples we are asking the model to produce. Since we are sampling from the model distribution, we can generate as many samples as we want, hence the parameter.

Feel free to change the context and have fun _generating text from a pretrained **GPT2**_!

In [25]:
# Random prompt, uses pooling
for i in range(5): 
    set_seed(42)
    model.prompt("Barack Obama, the", 50, 3)

# Deterministic prompt, does NOT use pooling
for i in range(5):
    model.prompt_topK("Barack Obama, the", 50, 3)

--------------------------------------------------------------------------------
Barack Obama, the second-term Obama, took on a lot of voters on the issue, and he is probably making an impression there. Obama won 60 percent of the black electorate in 2008, according to a Pew survey, and was well ahead of his rivals in a
--------------------------------------------------------------------------------
Barack Obama, the head of the White House's press secretary, told AFP that "our goal is to make sure journalists work with local authorities and continue to present the facts."

But the president also noted that he had appointed a "global warming denier" to
--------------------------------------------------------------------------------
Barack Obama, the incoming governor of Massachusetts, said Monday that the country should be turning over a single black person without a federal form of identification.

"The fact that it is African Americans, even though of Indian descent, and despite the 