The goal of this notebook is to implement part of the famous paper "Attention Is All You Need". We will try to create a text completion generator trained on a part of the dialogue of the Naruto series (Naruto and Nagato/Pain).

In [1]:
import torch.nn as nn 
from torch.nn import functional as F
import torch 

In [2]:
seed = 1337
device = "cuda" if torch.cuda.is_available() else "cpu"
batch_size = 64
learning_rate = 3e-4
n_steps = 5000
eval_iter = 200
eval_inter = 500
n_embeds = 384
block_size = 56
n_heads = 6
n_layer = 6
head_size = 16
dropout = 0.2
torch.manual_seed(seed)

<torch._C.Generator at 0x201856dde30>

***First step: Getting the data***

Let's first read the text file and save it into a variable 

In [3]:
input = "NarutoPain.txt"
with open(input, "r", encoding="utf-8") as file:
    text = file.read()

***Second Step: Tokenization***

There are multiple ways to chose tokens. Let's take the simplest way and assume each character is a token.
Now, let's take a sorted list of the characters used and compute our vocabulary size.
Then, let's code the encoding and decoding functions that assigns numerical value to each characters.

In [4]:
list_chars = sorted(list(set(text)))
vocabulary_size = len(list_chars)
print(f"The vocabulary sizer is {vocabulary_size}")

enc_dic ={ch : i for i,ch in enumerate(list_chars)}
deco_dic = {i : ch for i,ch in enumerate(list_chars)}
encode = lambda s: [enc_dic[c] for c in s]
decode = lambda int_l: "".join([deco_dic[i] for i in int_l])

The vocabulary sizer is 59


Let's transform our textual data (text var) to a more suitable format, i.e: a tensor. And then split it between training and validation sets.

In [7]:
data = torch.tensor(encode(text), dtype=torch.long)

train_perc = 0.9 #Percentage of data dedicated to training
train_size = int(train_perc * len(data))
training_data = data[ : train_size]
val_data = data[train_size : ]


Now that our data is ready, we need to code our model. The picture below shows the general architecture of the Tra,sformer presented in the paper. Let's inspect it block by block.

![Model architecture](modelArchi.png)

As shown in the picture, the Transformer is composed of two stacks, the Encoder (The stack in the left) and the Decoder (stack in the right). 

Encoder stack :  Maps the input sequence (x1, ..., xn) to a continuous representation (z1, ..., zn), where the input sequence is : inputs( from data) + postional encoding vector. The positional encoding vector is a vector that represents the relative position of the tokens. The positional encoding is mandatory to make use of the order of the sequence since there is no recurence in the encoder. 

Decoder stack : Maps the predicted sentence by the encoder (z1, ..., zn) to a new sequence (y1, ..., yn) predicting one sequence at a time, while having the output at time n-1 be part of the inputs for the prediction of the token yn. Example: to generate yn we have as an input (z1, ..., zm, y1, ..., yn-1 )

In this notebook, we will only code the encoder stack. The encoder is composed of N blocks, each of them having Multi-head attention layer, and a feed forward layer.

***What is Attention ?***

![attention layer](AttentionLayer.png)

The picture above shows the architecture of the attention mechanism, following the equation : 
$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_{k}}})V$,
where Q, K are vectors of dimension $d_{k}$ and V a vector of dimension $d_{v}$ .


First, let's try and understand the intuition behind this mathematical equation. 
Suppose we have a sequence X of legth T, for example X = Pain (T = 5) that we model by $X_{encoded}= (845, 25, 13, 10, 0)$, to predict the next character(token), my model needs context on the entire sequence.
The simplest way to get this context would be to average all the tokens, and we get : $X_{avg} = (178.6, 178.6, 178.6, 178.6)$.
There are several problems with this, but the biggest one would be that the 4th token knows what the fifth token. If I want my model to predict the word "pain", he can't have the information that after the letter "a" commes an "n" as an input, so to remedy that, we only average each token with the past and we get 
$X_{avg} = (845, 435, 294.3, 220.75, 178.6)$ 

Let's code it bit by bit to better show its working :

In [29]:
import torch
seed = 1337
torch.manual_seed(seed) # For repetability
vector_size = 5
x_encoded = [845, 25, 13, 10, 0]
q = torch.tensor(x_encoded, dtype = torch.float)
x_avg = torch.zeros(vector_size)
for i in range(vector_size):
    x_avg[i] = q[:i+1].mean()


print(f"X_encoded is {q} and X_avg is : {x_avg}")

X_encoded is tensor([845.,  25.,  13.,  10.,   0.]) and X_avg is : tensor([845.0000, 435.0000, 294.3333, 223.2500, 178.6000])


To make this opperation more eficcient computationally, we use matrix multiplication.In order to do that, we multiply the sequence by a weight matrix such as: 
- All future tokens have a weight of 0.
- All the weights add to 1.
To do that we use to tricks:
- A triangular matrix (All the values above the diagonal are null), example this 3 x 3 matrix : 
$\begin{pmatrix}
1 & 0 & 0\\
2 & 3 & 0\\
4 & 5 & 6
\end{pmatrix}$
and by replacing the zeros with $-\infty$ since $softmax(-\infty) = 0$
- The softmax function to normalise the weights

In [39]:
from torch.nn import functional as F
weights = torch.tril(torch.ones(vector_size, vector_size))
weights = weights.masked_fill(weights == 0, float("-inf"))
weights = F.softmax(weights, dim=1)
x_avg_2 = weights @ q.transpose(-1, 0)
print(f"X_avg using softmax is{x_avg_2}")

X_avg using softmax istensor([845.0000, 435.0000, 294.3334, 223.2500, 178.6000])


We get the same but with a faster and more efficient methode. That's the secret behind the attention equation. The $\sqrt{d_K}$ is to avoid having too much descrepency between weights of tokens and loosing information.
Now, all we want is to have the weights for the operation be a function of the inputs, so we get $Q = f_1(inputs)$, $K = f_2(inputs)$, $weights = softmax(\frac{Trig(QK^T)}{\sqrt{d_K}})$ and $outputs = weights . f_3(inputs)$

So, let's code our attention block :

In [49]:
import torch.nn as nn
n_embeds = 384
head_size = 16
dropout = 0.2

class AttentionHead(nn.Module):
    def __init__(self, head_size) :
        super().__init__()
        self.key = nn.Linear(n_embeds, head_size, bias=False)
        self.query = nn.Linear(n_embeds, head_size, bias=False)
        self.value = nn.Linear(n_embeds, head_size, bias=False)
        self.dropout = nn.Dropout(dropout)
        self.register_buffer("tril", torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)
        #Compute initial weights or "affinities"
        wei = q @ k.transpose(-2, -1) * (C ** -0.5)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float("-inf"))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)
        v = self.value(x)
        out = wei @ v
        return out 
    

Now, in the encoder we use multiheaded attention, which is simply the concatenation of multiple attention heads. So let's code it. (We also add a dropout layer to avoid overfitting)

In [50]:
n_heads = 6
class MultiHead(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.head = nn.ModuleList([AttentionHead(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embeds, n_embeds)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x):
        y = torch.cat([h(x) for h in self.head], dim=-1)
        y = self.proj(y)
        y = self.dropout(y)
        return y

Let's code the FeedForward layer and wrap this all up into a single block while also adding layer normalisation

In [42]:
class FeedForward(nn.Module):
    def __init__(self, n_embeds):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embeds, 4*n_embeds),
            nn.ReLU(),
            nn.Linear(4*n_embeds, n_embeds),
            nn.Dropout(dropout),
            )
        
    def forward(self, x):
        return self.net(x)

In [43]:
class Block(nn.Module):
    def __init__(self, n_heads, n_embeds):
        super().__init__()
        head_size = n_embeds // n_heads
        self.mheads = MultiHead(n_heads, head_size)
        self.ffwd = FeedForward(n_embeds)
        self.ln1 = nn.LayerNorm(n_embeds)
        self.ln2 = nn.LayerNorm(n_embeds)
    def forward(self, x):
        x_sa = self.mheads(self.ln1(x)) + x
        x_ffwd = self.ffwd(self.ln2(x_sa)) + x_sa
        return x_ffwd


Now let's code our Transformer. We just need to add embedding layers (one for token embedding and the second for positional embedding) and a final linear layer

In [72]:
block_size = 56
class TransformerModel(nn.Module):

    def __init__(self):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocabulary_size, n_embeds)
        self.positional_emb = nn.Embedding(block_size, n_embeds)
        self.block = nn.Sequential(*[Block(n_heads, n_embeds) for _ in range(n_layer)])
        self.lnorm = nn.LayerNorm(n_embeds)
        self.linear = nn.Linear(n_embeds, vocabulary_size)
       

    def forward(self, x, targets=None):
        _, t = x.shape 
        x_t = torch.arange(t, device=device)
        embedded_token = self.token_embedding_table(x) # shape B,T,C: n_embeds
        embedded_pos = self.positional_emb(x_t)
        x_pos = embedded_token + embedded_pos
        x_pos = self.block(x_pos)
        x_pos = self.lnorm(x_pos)
        logits = self.linear(x_pos)  # shape B, T, vocab_size

        if targets is None:
            loss = None
        else:  # returns the loss
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        return logits, loss 
    
    def generate(self, x, max_new_tokens):
        x_cond = x[-block_size:]
        for _ in range(max_new_tokens):
            logits, _ = self(x_cond)
            logits = logits[:, -1, :] #Last elem from time dimension 
            probs = F.softmax(logits, dim=1)
            x_next = torch.multinomial(probs, num_samples=1) # Chose the token from a multinomal distribution with probability law of probs
            x = torch.cat((x, x_next), dim=1)
        return x 
        

Now let's code our training function to tran the model .

In [85]:
#Batch the data 
eval_iter = 200
eval_inter = 50
device = "cuda" if torch.cuda.is_available() else "cpu"

def get_batch(split):
    data = training_data if split == "train" else val_data
    idx = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in idx])
    y = torch.stack([data[i+1:i+block_size+1] for i in idx])
    x, y = x.to(device), y.to(device) # To run the code on GPU
    return x,y

#Loss estimation function 
def estimate_loss(model):
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.zeros(eval_iter)
        for k in range(eval_iter):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item() 
        out[split] = losses.mean()
    model.train()
    return out

def learning_rate_scheduler(num_step, lr):
    if num_step < 50:
        return lr
    else : 
        return 3e-4
    

def train(model, num_steps):
    history = {"train_losses" : [], "val_losses" : []}
    
    for step in range(n_steps):
    #evaluate loss on val too
        if step % eval_inter == 0:
            losses = estimate_loss(model)
            history["train_losses"].append(losses["train"])
            history["val_losses"].append(losses["val"])
            print(f" for step {step} training loss is {losses["train"] : .4f} validation loss is {losses["val"] : .4f}")

    #sample data 
        xb, yb = get_batch("train")
        optim = torch.optim.AdamW(model.parameters(), lr=learning_rate_scheduler(step, 1e-3))

    #evaluate loss
        logits, loss = model(xb, yb)
        optim.zero_grad(set_to_none=True)
        loss.backward()
        optim.step()
    return history 

Now let's initialise and train our model

In [87]:

naruto = TransformerModel().to(device)

n_steps = 250

history = train(naruto, num_steps=n_steps)

 for step 0 training loss is  4.2643 validation loss is  4.2657
 for step 50 training loss is  2.3151 validation loss is  2.6014
 for step 100 training loss is  1.9899 validation loss is  2.3782
 for step 150 training loss is  1.8084 validation loss is  2.3568
 for step 200 training loss is  1.6092 validation loss is  2.3118


In [88]:
#save the model 
torch.save(naruto.state_dict, "NaruotVsPain.pt")

In [89]:
#x_gen = torch.zeros((1, 1), dtype=torch.long, device=device)
x_gen_text = "Name"
x_gen = torch.tensor(encode(x_gen_text), dtype=torch.long, device=device)
x_gen = x_gen.view(1, x_gen.shape[0])
response = naruto.generate(x_gen, 100)
result = decode(response[0].tolist())
print(result)


Namentn t d tsgtn  n   tdd  tttd xtt,ttttr .c ttttt!o! rsrrdt dnrt ottt  tt rtde,tnst! t tdtdtdt ptn    
