# Transformer Encoder from scratch

> This project aims to build a transformer encoder archetecture using the Transformer model paper. We will use the following diagram to design the architecture:

<style>
figure {
    display: block;
    margin-left: auto;
    margin-right: auto;
    width: 50%;
}
figcaption {
    text-align: center;
}
</style>
<figure>     
    <img src="images/The-Transformer-encoder-diagram.jpg" >
    <figcaption> Fig. 1: The Transformer encoder architecture. </figcaption>
</figure>


In [None]:
import torch
import math
from torch import nn
from torch.nn import functional as F

## Single-head Attention
> The key concept we are interested in captioning for the transformer is attention. This is allows for the transformer to attend to different parts of another sequence when making predictions. For large language models (LLM), it is this very important aspect that is crucial for effective performance with language tasks such as sentiment analysis and text summarisation.

> We will start by building single-headed attention using the scaled dot product and then later utilise this for multi-headed attention:

<style>
figure {
    display: block;
    margin-left: auto;
    margin-right: auto;
    width: 50%;
}
figcaption {
    text-align: center;
}
</style>
<figure>     
    <img src="images/single-head-attention.jpeg" >
    <figcaption> Fig. 2: Single-head attention architecture. </figcaption>
</figure>


In [None]:
def scaled_dot_product_attention(q, k, v, mask=None):
    """
    Calculate the single attention head values and attention.
    q, k, v must have matching leading dimensions.
    k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
    The mask has different shapes depending on its type(padding or look ahead)
    but it must be broadcastable for addition."""

    d_k = q.size()[-1]
    scaled = torch.matmul(q, k.transpose(-1, -2)) / math.sqrt(d_k)
    # print(f"scaled size: {scaled.size()}")
    if mask is not None:
        # print(f"mask size: {mask.size()}")
        scaled = scaled + mask
    attenion = F.softmax(scaled, dim=-1)
    values = torch.matmul(attenion, v)
    return values, attenion

## Multi-head Attention

> Now we can build the multi-head attention functionality. The diagram below shows how to construct multi-head attention:

<style>
figure {
    display: block;
    margin-left: auto;
    margin-right: auto;
    width: 50%;
}
figcaption {
    text-align: center;
}
</style>
<figure>     
    <img src="images/Multi-head-attention.jpeg" >
    <figcaption> Fig. 3: Multi-head attention architecture. </figcaption>
</figure>

In [None]:
class MultiHeadAttention(nn.Module):
    
    def __init__(self, d_model, n_head):
        super().__init__()
        self.d_model = d_model
        self.n_head = n_head
        self.d_head = d_model // n_head
        self.qkv_layer = nn.Linear(d_model, 3 * d_model)
        self.linear_layer = nn.Linear(d_model, d_model)
        
    def forward(self, x, mask=None):
        batch_size, max_input_len, d_model = x.size()
        #print(f"x.size() = {x.size()}")
        qkv = self.qkv_layer(x).reshape(batch_size, max_input_len, self.n_head, 3*self.d_head).permute(0, 2, 1, 3)
        q, k, v = qkv.chunk(3, dim=-1) #splitting the last dimension into 3 parts so shape is 30x8x200x64
        #print(f"q.size(): {q.size()}, k.size(): {k.size()}, v.size(): {v.size()}")
        values, attention = scaled_dot_product_attention(q, k, v, mask=mask)
        #print(f"values.size(): {values.size()}, attention.size(): {attention.size()}")
        values = values.reshape(batch_size, max_input_len, self.n_head * self.d_head)
        #print(f"values.size(): {values.size()}")
        out = self.linear_layer(values)
        #print(f"out.size(): {out.size()}")
        return out

## Layer Normalisation

> For neural networks activation values can get very large in magnitude, causing large gradient steps to be taken when performing back propogation, causing unstable training. This can be avoided by normalising the activations of the hidden layers. There are multiple ways to normalise the activations of the hidden layers but we will opt with Z-score normalization.

> Using the layer mean and layer standard deviation and ε (to ensure non-zero division), summarised in the following formula:

<style>
figure {
    display: block;
    margin-left: auto;
    margin-right: auto;
    width: 50%;
}
figcaption {
    text-align: center;
}
</style>
<figure>     
    <img src="images/layer_norm.PNG" >
    <figcaption> Fig. 4: Layer Normalisation Formula </figcaption>
</figure>

In [None]:
class LayerNormalisation(torch.nn.Module):
    def __init__(self, parameters_shape, eps=1e-5):
        super().__init__()
        self.parameters_shape = parameters_shape
        self.eps = eps
        self.gamma = torch.nn.Parameter(torch.ones(parameters_shape))
        self.beta = torch.nn.Parameter(torch.zeros(parameters_shape))
        
    def forward(self, x):
        dims = [-(i+1) for i in range(len(self.parameters_shape))]
        mean = x.mean(dims=dims, keepdim=True)
        #print(f"mean: {mean.size()}")
        var = ((x - mean)**2).mean(dims=dims, keepdim=True)
        std = (var + self.eps).sqrt()
        #print(f"std: {std.size()}")
        y = (x - mean) / std
        #print(f"y: {y.size()}")
        out = self.gamma * y + self.beta
        #print(f"out: {out.size()}")
        return out

## FeedForward Layer

> This is a postion-wise transformation that consists of linear transformation, ReLU, and another linear transformation. The role and purpose is to process the output from one attention layer in a way to better fit the input for the next attention layer. We will use this layer to essentially gain any aditional information applicable before proceeding to normalisation and add prcoess. 



In [None]:
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, ffn_hidden, drop_prob=0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, ffn_hidden)
        self.linear2 = nn.Linear(ffn_hidden, d_model)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=drop_prob)
        
    def forward(self, x):
        return self.linear2(self.dropout(self.relu(self.linear1(x))))

## Single Encoder Layer

> Now we need to put together the encoder architecture using the functions we have built previously. We will follow the structuring of the signle encoder layer architecture below. 

<style>
figure {
    display: block;
    margin-left: auto;
    margin-right: auto;
    width: 50%;
}
figcaption {
    text-align: left;
}
</style>
<figure>     
    <img src="images/single-encoder-layer.jpg" >
    <figcaption> Fig. 5: Single Encoder layer structure </figcaption>
</figure>

In [None]:
class EncoderLayer(nn.Module):
    
    def __init__(self, d_model, ffn_hidden, n_heads, drop_prob=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model=d_model, n_head=n_heads)
        self.dropout1 = nn.Dropout(p=drop_prob)
        self.norm1 = LayerNormalisation(parameters_shape=[d_model])
        self.ffn = PositionwiseFeedForward(d_model=d_model, ffn_hidden=ffn_hidden, drop_prob=drop_prob)
        self.dropout2 = nn.Dropout(p=drop_prob)
        self.norm2 = LayerNormalisation(parameters_shape=[d_model])
    
    def forward(self, x):
        residual_x = x
        #print(f"---- 1st Attention Step ----")
        x = self.attention(x, mask=None)
        #print(f"---- 1st Dropout Step ----")
        x = self.dropout1(x)
        #print(f"---- 1st Add & Norm Step ----")
        x = self.norm1(x + residual_x)
        #print(f"---- 2nd Attention Step ----")
        x = self.ffn(x)
        #print(f"---- 2nd Dropout Step ----")
        x = self.dropout2(x)
        #print(f"---- 2nd Add & Norm Step ----")
        x = self.norm2(x + residual_x)
        return x

## Full Encoder Architecture

> As illustrated in the diagram for the encoder, in totality, it is made up of N individual encoder layers, hence the last step for us to do is to concatenate all the encoder layers into one single encoder structure. 

<style>
figure {
    display: block;
    margin-left: auto;
    margin-right: auto;
    width: 50%;
}
figcaption {
    text-align: left;
}
</style>
<figure>     
    <img src="images/concat-encoder.jpg" >
    <figcaption> Fig. 6: Full Encoder structure </figcaption>
</figure>

In [None]:
class Encoder(nn.model):
    def __init__(self, d_model, ffn_hidden, n_heads, dropprob, n_layers):
        super().__init__()
        # Creating a list of EncoderLayer objects (size = n_layers)
        # The * unpacks/destructures the list since python lists are not registered in a nn.Module so *[1,2,3] -> 1,2,3 
        # nn.sequential will string together the individual encoder layers in the order they are passed into the constructor
        self.layers = nn.Sequential(*[EncoderLayer(d_model, ffn_hidden, n_heads, dropprob) for _ in range(n_layers)])
    
    def forward(self, x):
        return self.layers(x)

## _Parameters_




In [None]:
d_model = 512
n_head = 8
drop_prob = 0.1
batch_size = 30
max_input_len = 200
ffn_hidden = 2048
n_layers = 5

encoder = Encoder(d_model, ffn_hidden, n_head, drop_prob, n_layers)
x = torch.randn((batch_size, max_input_len, d_model))
out = encoder(x)