# Transformers in Pytorch

The idea of this notebook is to explain how transformers are coded in pytorch. We will take as reference the original [Attention is all you need](https://arxiv.org/abs/1706.03762) paper and this [video](https://www.youtube.com/watch?v=ISNdQcPhsts). The transformer we are going to build is rather simple and it will translate sentences from English to Spanish.

First we will build the transformer component by component and then we will work on the training loop and inference.

## The Transformer

In order to build the transformer, we will have to build all the inner components first. The base components of the transformer are:
- Input Embeddings
- Positional Encoding
- Layer Normalization
- Feed Forward Block
- Multi Head Attention Block

Then we have the encoder and the decoder, both composed of many encoder and decoder blocks. And finally a projection layer.

![Transformer Architecture](assets/transformer-network.png)


In [1]:
import torch
import torch.nn as nn
import math

### Input Embeddings
This layer will assign a vector to each of the tokens of the input sequence. This vectors are learned during training and represent the "meaning" of the token (or word). 

A `nn.Module` with this functionality already exists in PyTorch, but we will build a module on top in order to make reference to it.

In [2]:
class InputEmbedding (nn.Module) :
    
    def __init__(self, d_model: int, vocab_size: int) -> None :
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        
        self.embedding = nn.Embedding(vocab_size, d_model)
        
    def forward (self, x) :
        return self.embedding(x) * math.sqrt(self.d_model) # as specified in the original paper

### Positional Encoding

The Positional Encodings adds some vectors to the embeddings in order to encode the position of the token in the sentence (e.g. first, second, ...). There are many ways to archive this, but here we will use the vectors proposed in the original paper, calculated with the following functions:

![Positional Encodings Functions](assets/positional-encoding-functions.png)

Where $pos$ is the position of the token in the sentence and $i$ is the dimension.

In [3]:
class PositionalEncoding(nn.Module) :
    
    def __init__(self, d_model: int, seq_len: int, dropout: float) -> None :
        super().__init__()
        self.d_model = d_model
        self.seq_len = seq_len
        self.dropout = dropout
        
        pe = torch.zeros(seq_len, d_model) # (seq_len, d_model)
        
        position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1) # (seq_len)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(1000) / d_model)) # more numerically stable
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        pe = pe.unsqueeze(0) # (1, seq_len, d_model)
        
        self.register_buffer("pe", pe) # Save it to the state file, but not as a parameter
        
    def forward (self, x) :
        x = x + (self.pe[:, :x.shape[1], :]).requires_grad_(False)
        return self.dropout(x)

### Layer Normalization

This component normalizes each input (each vector corresponding to a token) so its values have mean 0 and variance 1. Then it scales the values with a parameter $\alpha$ and shifts them with a parameter $\beta$.

The propose of this block is to stabilize and accelerate the training of the model as inputs of the next block will be on a specified range.

In [4]:
class LayerNormalization(nn.Module) :
    
    def __init__(self, eps: float = 10**-6) -> None :
        super().__init__()
        self. eps = eps # numerical stability
        
        self.alpha = nn.Parameter(torch.ones(1)) # Multiplied 
        self.beta = nn.Parameter(torch.ones(1)) # Added
        
    def forward (self, x) :
        mean = x.mean(dim=-1, keepdim=True)
        std = x.std(dim=-1, keepdim=True)
        return self.alpha * (x - mean) / (std + self.eps) + self.beta

### Feed Forward Block

This block is a simple fully-connected two layer neural network. 

In [5]:
class FeedForwardBlock (nn.Module) :
    
    def __init__ (self, d_model: int, d_ff: int, dropout: float) -> None :
        super().__init__()
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model)
        
    def forward (self, x) :
        # (batch, seq_len, d_model) --> (batch, seq_len, d_ff) --> (batch, seq_len, d_model)
        return self.linear_2(self.dropout(torch.relu(self.linear_1(x))))