<a href="https://colab.research.google.com/github/Guche02/transformer-from-scratch/blob/master/transformers_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch
import torch.nn as nn
import math

 ## Input Embeddings: convert tokens into vector of 512 dimensions.

nn.module is the base class for all creating all the neural networks.


 The models that we create, inherit the functions from this base class such as forward() function.

  super() is used to initialize the parent class ( nn.module ) by every object that is created later.  *If super() is not used that methods like nn.Embedding will not be initialzed and we will not be able to use it.*


In [None]:
class InputEmbeddings(nn.Module):
  def __init__(self, d_model: int, vocab_size:int):
    super().__init__()
    self.d_model = d_model
    self.vocab_size = vocab_size
    self.embedding = nn.Embedding(num_embeddings = vocab_size, embedding_dim = d_model)

  def forward(self, x):
    return self.embedding(x) * math.sqrt(self.d_model)   # embedding is multiplied by d_model according to the original paper.

## Positional Encoding: add information about position of tokens

Dropout is used in positional encoding to prevent the over-reliance on the positions of tokens. Instead, the model is made to focus on the semantic relationships between the tokens.

 torch.arrange() is similar to range() in python.
  
  unsqueeze() is used to add an extra dimenion at the specified position.
   
  all the numbers are converted to float(), to avoid broadcasting to int by mistake during division.

  register_buffer() is added to register a tensor as a buffer (part of the model but not considered a model parameter (i.e., it does not get updated during training))

  requires_grad(false) means that the parameter is not updated during training.
  

In [None]:
class PositionalEncoding(nn.Module):
  def __init__(self, d_model: int, seq_len: int, dropout:float):
    super().__init__():
    self.d_model = d_model
    self.seq_len = seq_len
    self.dropout = nn.Dropout(dropout)

    # create a tensor (seq_len x d_model) filled with zeros as a placeholder for positional encoding
    pe = torch.zeros(seq_len, d_model)

    # numerator term of the formula for positional encoding
    position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)

    # denominator term of the formula
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))  # log and exp is used for numerical stability.

    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)

    # adding a batch size dimension to positional encoding, specifying that this positional encoding is used for single ip sequence
    pe = pe.unsqueeze(0)  # (1, seq_len, d_model)

    # saving the positional encoding info as a constant buffer that is not updated during training.
    self.register_buffer('pe', pe)

  def forward(self, x):
    # x.shape[1] gives the length of ip sequqnce.
    x = x + (self.pe[:, :x.shape[1], :]).requires_grad_(False)
    return self.dropout(x)