* In this assignment you will be building the **Encoder** part of the Transformer architecture.
* You will be using the **PyTorch** framework to implement the following components
  * Encoder Layer that contains
    * Multi-Head Attention (MHA) Module
    * Position-wise Feed Forward Neural Network

  * Output layer that takes the encoder output and predicts the token_ids.

  * Optionally, study whether adding positional information is helpful.
  
* **DO NOT** USE Built-in **TRANSFORMER LAYERS** as it affects the reproducibility.

* You will be given with a configuration file that contains information on various hyperparameters such as embedding dimension, vocabulary size,number heads and so on

* Use ReLU activation function and Stochastic Gradient Descent optimizer
* Here are a list of helpful Pytorch functions (does not mean you have to use all of them) for this and subsequent assignments
  * [torch.matmul](https://pytorch.org/docs/stable/generated/torch.matmul.html#torch-matmul)
  * [torch.bmm](https://pytorch.org/docs/stable/generated/torch.bmm.html)
  * torch.swapdims
  * torch.unsqueeze
  * torch.squeeze
  * torch.argmax
  * [torch.Tensor.view](https://pytorch.org/docs/stable/generated/torch.Tensor.view.html)
  * [torch.nn.Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)
  * [torch.nn.Parameter](https://pytorch.org/docs/stable/generated/torch.nn.parameter.Parameter.html)
  * torch.nn.Linear
  * torch.nn.LayerNorm
  * torch.nn.ModuleList
  * torch.nn.Sequential
  * [torch.nn.CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)
  
* Important: **Do not** set any global seeds.

* Helpful resources to get started with

 * [Annotated Transformers](https://nlp.seas.harvard.edu/annotated-transformer/)
 * [PyTorch Source code of Transformer Layer](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)



# Import

In [None]:
import torch
from torch import Tensor

import torch.nn as nn
from torch.nn import Parameter
import torch.nn.functional as F
from torch.nn.functional import one_hot

import torch.optim as optim

from  pprint import pprint
from yaml import safe_load
import requests
from io import BytesIO

import math

# Configuration

In [None]:
config_url = "https://raw.githubusercontent.com/Arunprakash-A/LLM-from-scratch-PyTorch/main/config_files/enc_config.yml"
response = requests.get(config_url)
config = response.content.decode("utf-8")
config = safe_load(config)
pprint(config)

{'input': {'batch_size': 10, 'embed_dim': 32, 'seq_len': 8, 'vocab_size': 10},
 'model': {'d_ff': 128,
           'd_model': 32,
           'dk': 4,
           'dq': 4,
           'dv': 4,
           'n_heads': 8,
           'n_layers': 6}}


In [None]:
vocab_size = config['input']['vocab_size']
batch_size = config['input']['batch_size']
seq_len = config['input']['seq_len']
embed_dim = config['input']['embed_dim']

In [None]:
print("Vocabulary Size :", vocab_size)
print("Batch Size :", batch_size)
print("Sequence Length :", seq_len)
print("Embedding Dimension :", embed_dim)

Vocabulary Size : 10
Batch Size : 10
Sequence Length : 8
Embedding Dimension : 32


* Here, you are directly given with the token ids
* Assume that length of all sequences are equal to the context length (T) (so that we do not need to bother about padding shorter sequences while batching)

In [None]:
data_url = 'https://github.com/Arunprakash-A/LLM-from-scratch-PyTorch/raw/main/config_files/w1_input_tokens'
r = requests.get(data_url)
token_ids = torch.load(BytesIO(r.content))
print(token_ids)

tensor([[5, 7, 5, 6, 3, 8, 7, 5],
        [7, 2, 7, 1, 2, 1, 1, 7],
        [1, 0, 0, 3, 6, 3, 0, 8],
        [5, 0, 2, 8, 6, 5, 5, 3],
        [3, 5, 4, 8, 5, 0, 7, 3],
        [8, 6, 7, 4, 4, 4, 0, 1],
        [5, 8, 1, 0, 1, 1, 0, 3],
        [1, 7, 8, 8, 0, 5, 3, 7],
        [7, 7, 1, 4, 5, 6, 7, 0],
        [1, 7, 2, 8, 3, 0, 0, 4]])


  token_ids = torch.load(BytesIO(r.content))


# Building the sub-layers

In [None]:
dq = torch.tensor(config['model']['dq'])
dk = torch.tensor(config['model']['dk'])
dv = torch.tensor(config['model']['dv'])
dmodel = embed_dim
heads = torch.tensor(config['model']['n_heads'])
d_ff = config['model']['d_ff']

##Multi-Head Attention

 * Be mindful when using `torch.matmul`
 * Ensure that you understood what is being computed (because matrix product is not commutative)
 * Randomly initialize the parameters using normal distribution with the following seed values
  * $W_Q:$(seed=43)
  * $W_K:$(seed=44)
  * $W_V:$(seed=45)
  * $W_O:$(seed=46)

In [None]:
class MHA(nn.Module):

  def __init__(self,dmodel,dq,dk,dv,heads):
    super(MHA,self).__init__()

    self.dq = dq
    self.dk = dk
    self.dv = dv
    self.dmodel = dmodel
    self.heads = heads

    torch.manual_seed(43)
    self.W_q = nn.Parameter(torch.randn(heads, dmodel, dq))

    torch.manual_seed(44)
    self.W_k = nn.Parameter(torch.randn(heads, dmodel, dk))

    torch.manual_seed(45)
    self.W_v = nn.Parameter(torch.randn(heads, dmodel, dv))

    torch.manual_seed(46)
    self.W_o = nn.Parameter(torch.randn(dmodel, dmodel))

  def forward(self,H=None):
    '''
    Input: Size [BSxTxdmodel]
    Output: Size[BSxTxdmodel]
    '''

    BS, T, dmodel = H.size()

    Q = torch.matmul(H.unsqueeze(1), self.W_q)
    K = torch.matmul(H.unsqueeze(1), self.W_k)
    V = torch.matmul(H.unsqueeze(1), self.W_v)

    attention_score = torch.matmul(Q, K.transpose(2,3))/math.sqrt(self.dk)
    attention_score = torch.softmax(attention_score, dim = -1)

    z = torch.matmul(attention_score, V)
    z = z.permute(0,2,1,3).contiguous()
    z = z.view(BS, T, -1)

    out = torch.matmul(z, self.W_o)

    return out

## Pointwise FFN

* Randomly initialize the parameters using normal distribution with the following seed values
  * $W_{1}:$(seed=47)
  * $W_2:$(seed=48)  

In [None]:
class FFN(nn.Module):
  def __init__(self,dmodel,d_ff,layer=0):
    super(FFN,self).__init__()

    self.dmodel = dmodel
    self.d_ff = d_ff

    torch.manual_seed(47)
    self.W_1 = nn.Parameter(torch.randn(dmodel, d_ff))

    torch.manual_seed(48)
    self.W_2 = nn.Parameter(torch.randn(d_ff, dmodel))

    self.relu = nn.ReLU()


  def forward(self,x):
    '''
    input: size [BSxTxdmodel]
    output: size [BSxTxdmodel]
    '''

    out = torch.matmul(x, self.W_1)

    out = self.relu(out)

    out = torch.matmul(out, self.W_2)

    return out

## Output Layer

* Randomly initialize the linear layer
 * $W_L:$(seed=49)


In [None]:
class OutputLayer(nn.Module):

  def __init__(self,dmodel,vocab_size):
    super(OutputLayer,self).__init__()

    torch.manual_seed(49)
    self.W_L = nn.Parameter(torch.randn(dmodel, vocab_size))

  def forward(self,representations):
    '''
    input: size [bsxTxdmodel]
    output: size [bsxTxvocab_size]
    Note: Do not apply the softmax. Just return the output of linear transformation
    '''
    out = torch.matmul(representations, self.W_L)
    return out

## Encoder Layer

In [None]:
class EncoderLayer(nn.Module):

  def __init__(self,dmodel,dq,dk,dv,d_ff,heads):
    super(EncoderLayer,self).__init__()
    self.mha = MHA(dmodel,dq,dk,dv,heads)
    self.layer_norm_mha = torch.nn.LayerNorm(dmodel)
    self.layer_norm_ffn = torch.nn.LayerNorm(dmodel)
    self.ffn = FFN(dmodel,d_ff)

  def forward(self,x):

    out_1 = self.mha(x)

    out_1 = self.layer_norm_mha(out_1+x)

    out = self.ffn(out_1)

    out = self.layer_norm_ffn(out+out_1)

    return out

## Model with one encoder layer

 * The encoders' forward function accepts the token_ids as input
 * Generate the embeddings for the token ids by initializing the emebedding weights from normal distribution by setting the seed value to 50
 * Use `torch.nn.Embed()` to generate required embeddings

In [None]:
class Encoder(nn.Module):

  def __init__(self,vocab_size,embed_dim,dq,dk,dv,d_ff,heads,num_layers=1):
    super(Encoder,self).__init__()

    self.vocab_size = vocab_size
    self.embed_dim = embed_dim

    torch.manual_seed(50)
    self.embed_weights = nn.Parameter(torch.randn(vocab_size, embed_dim))
    self.embed = nn.Embedding(vocab_size, embed_dim, _weight=self.embed_weights)

    self.encoder_layer = EncoderLayer(embed_dim, dq, dk, dv, d_ff, heads)
    self.output_layer = OutputLayer(embed_dim, vocab_size)

  def forward(self,x):
    '''
    The input should be tokens ids of size [BS,T]
    '''
    out = self.embed(x) # get the embeddings of the tokens
    out = self.encoder_layer(out) # pass the embeddings throught the encoder layers
    out = self.output_layer(out) # get the logits

    return out

In [None]:
model = Encoder(vocab_size,dmodel,dq,dk,dv,d_ff,heads)
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Training the model

 * Train the model for 30 epochs and compute the loss

In [None]:
def train(token_ids, model, optimizer, criterion, epochs=None):

  for epoch in range(epochs):
    out = model(token_ids)

    batch_size, seq_len, vocab_size = out.size()
    out = out.view(-1, vocab_size)
    target = token_ids.view(-1)

    loss = criterion(out, target)
    print(f'Epoch {epoch}, Loss: {loss.item()}')

    loss.backward()

    optimizer.step()
    optimizer.zero_grad()


In [None]:
train(token_ids, model, optimizer, criterion, epochs=30)

Epoch 0, Loss: 10.118653297424316
Epoch 1, Loss: 8.861600875854492
Epoch 2, Loss: 8.083215713500977
Epoch 3, Loss: 7.3580474853515625
Epoch 4, Loss: 6.783898830413818
Epoch 5, Loss: 6.39406681060791
Epoch 6, Loss: 6.037106990814209
Epoch 7, Loss: 5.667043209075928
Epoch 8, Loss: 5.352385520935059
Epoch 9, Loss: 5.088061332702637
Epoch 10, Loss: 4.846492290496826
Epoch 11, Loss: 4.6231794357299805
Epoch 12, Loss: 4.4200358390808105
Epoch 13, Loss: 4.247159004211426
Epoch 14, Loss: 4.019876003265381
Epoch 15, Loss: 3.756844997406006
Epoch 16, Loss: 3.4984729290008545
Epoch 17, Loss: 3.398616313934326
Epoch 18, Loss: 3.2541491985321045
Epoch 19, Loss: 3.1784861087799072
Epoch 20, Loss: 3.0574450492858887
Epoch 21, Loss: 2.99891996383667
Epoch 22, Loss: 2.886470317840576
Epoch 23, Loss: 2.8377766609191895
Epoch 24, Loss: 2.7386815547943115
Epoch 25, Loss: 2.690978765487671
Epoch 26, Loss: 2.6101996898651123
Epoch 27, Loss: 2.5524134635925293
Epoch 28, Loss: 2.4962239265441895
Epoch 29, Los

# Inference

In [None]:
with torch.inference_mode():
  predictions =  model(token_ids) # predict the labels
  predicted_labels = predictions.argmax(dim=-1)

* See how many labels are correctly predicted

In [None]:
print(torch.count_nonzero(token_ids==predicted_labels).item())

38


* The loss by now should be about 2.39 and the number of correct predictions should be about 37

# Encoder with N Layers

  * The intialized parameters in all layers are identical
  * use ModuleList to create **deep-copies** of encoder layer

In [None]:
import copy

In [None]:
class Encoder(nn.Module):

  def __init__(self,vocab_size,dmodel,dq,dk,dv,d_ff,heads,num_layers=1):
    super(Encoder,self).__init__()

    self.embed_weights = nn.Parameter(torch.randn(vocab_size, embed_dim))
    self.embed = nn.Embedding(vocab_size, embed_dim, _weight=self.embed_weights)

    enc_layer = EncoderLayer(dmodel, dq, dk, dv, d_ff, heads)
    self.enc_layers = nn.ModuleList([copy.deepcopy(enc_layer) for _ in range(num_layers)])

    self.norm = nn.LayerNorm(embed_dim)

    self.output_layer = OutputLayer(dmodel, vocab_size)


  def forward(self,x):
    '''
    1. Get embeddings
    2. Pass it through encoder layer-1 and recursively pass the output to subsequent enc.layers
    3. output the logits
    '''

    out = self.embed(x)

    out = self.norm(out)

    for i, layer in enumerate(self.enc_layers):
            out = layer(out)

    out = self.output_layer(out)

    return out

* Train the stack of encoder layers with `num_layers=2` for the same 30 epochs

In [None]:
model = Encoder(vocab_size,dmodel,dq,dk,dv,d_ff,heads,num_layers=2)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

In [None]:
train(token_ids, model, optimizer, criterion, epochs=30)

Epoch 0, Loss: 12.377585411071777
Epoch 1, Loss: 10.060522079467773
Epoch 2, Loss: 8.443035125732422
Epoch 3, Loss: 7.300762176513672
Epoch 4, Loss: 5.969535827636719
Epoch 5, Loss: 5.122201442718506
Epoch 6, Loss: 4.611206531524658
Epoch 7, Loss: 4.359195709228516
Epoch 8, Loss: 3.807462692260742
Epoch 9, Loss: 3.5612869262695312
Epoch 10, Loss: 3.303182601928711
Epoch 11, Loss: 3.124396324157715
Epoch 12, Loss: 2.9949800968170166
Epoch 13, Loss: 2.8886213302612305
Epoch 14, Loss: 2.7892909049987793
Epoch 15, Loss: 2.6888840198516846
Epoch 16, Loss: 2.624729633331299
Epoch 17, Loss: 2.5725009441375732
Epoch 18, Loss: 2.5149991512298584
Epoch 19, Loss: 2.4602086544036865
Epoch 20, Loss: 2.411038875579834
Epoch 21, Loss: 2.3677639961242676
Epoch 22, Loss: 2.331463098526001
Epoch 23, Loss: 2.246312379837036
Epoch 24, Loss: 2.210348129272461
Epoch 25, Loss: 2.1535091400146484
Epoch 26, Loss: 2.1113452911376953
Epoch 27, Loss: 2.0613067150115967
Epoch 28, Loss: 2.0267081260681152
Epoch 29,

In [None]:
with torch.inference_mode():
  predictions =  model(token_ids) # predict the labels
  predicted_labels = predictions.argmax(dim=-1)

In [None]:
torch.count_nonzero(predicted_labels==token_ids).item()

38

* Now, the loss value should be about 1.9 and the number of correct preditions is about 38

## Count Number of Parameters

In [None]:
total_num_parameters = 0

for parameter in model.parameters():
  total_num_parameters += parameter.numel()

print('total number of parameters in the model \n including the embedding layer is:', total_num_parameters)

total number of parameters in the model 
 including the embedding layer is: 25856


## (Optional) Positional Encoding

 * We now use the positional embedding as defined in the [paper](https://arxiv.org/pdf/1706.03762v1.pdf) (differs a bit from the lecture).
 * Note that, the positional encoding for each position is fixed (not a learnable parameter)
 * However, we add this with the raw_embeddings which are learnable.
 * Therefore, it is important to create a class definition for PE and register PE parameters in the buffer (in case we move the model to GPU)
 * Just create a matrix of same size of input and add it to the embeddings

In [None]:
import math
class PositionalEncoding(nn.Module):
    "Implement the PE function."

    def __init__(self,d_model,max_len=8):
        super(PositionalEncoding, self).__init__()

        #compute it in the log space

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        pe = pe.unsqueeze(0)

        self.register_buffer('pe', pe)

    def forward(self, x):
        # add positional embedding

        x = x + self.pe[:, :x.size(1), :]

        return x