# Transformers from Scratch

This notebook will guide you through implementing a Transformer model from scratch using PyTorch. We will break down the Transformer architecture into its core components and build it step-by-step.

## 1. Setup and Imports

First, let's import the necessary libraries. We'll primarily be using PyTorch for building the model.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

## 2. Multi-Head Attention

The core of the Transformer is the Multi-Head Attention mechanism. It allows the model to weigh the importance of different words in the input sequence when processing a particular word.

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        # TODO: Initialize the dimensions, number of heads, and linear layers for Q, K, V and the output.
        # Make sure d_model is divisible by num_heads
        pass

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # TODO: Implement the scaled dot-product attention.
        # Formula: softmax((Q * K^T) / sqrt(d_k)) * V
        pass

    def forward(self, Q, K, V, mask=None):
        # TODO: 
        # 1. Project Q, K, V through their respective linear layers.
        # 2. Reshape the projections to have `num_heads` dimensions.
        # 3. Apply scaled dot-product attention.
        # 4. Concatenate the heads and pass through the final linear layer.
        pass

## 3. Position-wise Feed-Forward Networks

This is a relatively simple component, consisting of two linear transformations with a ReLU activation in between.

In [None]:
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionwiseFeedForward, self).__init__()
        # TODO: Initialize the two linear layers and the ReLU activation.
        pass

    def forward(self, x):
        # TODO: Implement the forward pass.
        pass

## 4. Positional Encoding

Since Transformers don't have a built-in sense of sequence order, we need to inject positional information. This is done using sinusoidal functions.

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        # TODO: Create a positional encoding matrix using sine and cosine functions.
        # The formula for PE(pos, 2i) and PE(pos, 2i+1) should be used.
        pass

    def forward(self, x):
        # TODO: Add the positional encodings to the input tensor `x`.
        pass

## 5. Encoder Block

The Transformer Encoder is a stack of identical layers. Each layer has two sub-layers: a multi-head self-attention mechanism and a position-wise feed-forward network.

In [None]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        # TODO: Initialize the Multi-Head Attention, Position-wise Feed-Forward, LayerNorms, and Dropout layers.
        pass

    def forward(self, x, mask):
        # TODO: Implement the forward pass for the encoder layer, including residual connections and layer normalization.
        pass

## 6. Decoder Block

The Transformer Decoder is also a stack of identical layers. In addition to the two sub-layers in the encoder layer, the decoder inserts a third sub-layer which performs multi-head attention over the output of the encoder stack.

In [None]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        # TODO: Initialize the three main components: self-attention, encoder-decoder attention, and the feed-forward network.
        # Also, initialize LayerNorms and Dropout.
        pass

    def forward(self, x, enc_output, src_mask, tgt_mask):
        # TODO: Implement the forward pass for the decoder layer, including residual connections and layer normalization.
        # Remember the masked multi-head attention for the first attention block.
        pass

## 7. Transformer

Now we assemble the full Transformer model from the components we've built.

In [None]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_len, dropout=0.1):
        super(Transformer, self).__init__()
        # TODO: Initialize the embedding layers, positional encodings, encoder and decoder stacks, and the final linear layer.
        pass

    def forward(self, src, tgt, src_mask, tgt_mask):
        # TODO: Implement the full forward pass of the transformer.
        # 1. Pass source and target through their respective embedding and positional encoding layers.
        # 2. Pass the source through the encoder.
        # 3. Pass the target and encoder output through the decoder.
        # 4. Apply the final linear and softmax layer to get output probabilities.
        pass

## 8. Training

Here you would define your training loop. This includes setting up the optimizer, loss function, and iterating over your dataset.

In [None]:
# TODO: 
# 1. Create a dummy vocabulary and sample data.
# 2. Instantiate the Transformer model.
# 3. Define the loss function (e.g., CrossEntropyLoss) and optimizer (e.g., Adam).
# 4. Write a training loop that:
#    - Feeds a batch of data to the model.
#    - Calculates the loss.
#    - Performs backpropagation and updates the weights.

## 9. Inference

For inference, you would typically use a greedy or beam search approach to generate the output sequence one token at a time.

In [None]:
# TODO:
# 1. Write a function for greedy decoding.
# 2. Start with a start-of-sequence token.
# 3. In a loop, feed the current output sequence to the model and get the next token prediction.
# 4. Append the predicted token to the output sequence.
# 5. Stop when an end-of-sequence token is generated or a max length is reached.