# Lab 3 Trannsformer

The provided lab description covers the implementation of several key components of the Transformer architecture. However, it does not encompass the entire architecture as described in the original "Attention is All You Need" paper by Vaswani et al. Specifically, it focuses on implementing components such as self-attention mechanism, feed-forward network, positional encoding, and a single transformer block.

## Task 1-Self Attention

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        """
        Initializes the SelfAttention module.

        Args:
        - embed_size (int): The dimensionality of input embeddings.
        - heads (int): The number of attention heads.

        This module computes the self-attention mechanism. It takes input embeddings and splits them into multiple heads.
        Then, it computes attention scores between query, key, and value vectors and aggregates the values based on these scores.

        """

    def forward(self, values, keys, query, mask):
        """
        Forward pass of the self-attention mechanism.

        Args:
        - values (torch.Tensor): The values tensor.
        - keys (torch.Tensor): The keys tensor.
        - query (torch.Tensor): The query tensor.
        - mask (torch.Tensor): The mask tensor.

        Returns:
        - out (torch.Tensor): The output tensor.

        This function performs the forward pass of the self-attention mechanism. It computes attention scores between
        query and key vectors, applies the mask if provided, computes attention weights using softmax, and finally
        aggregates the values based on these weights.

        """



## Task 2-Feed Forward

In [None]:
class FeedForward(nn.Module):
    def __init__(self, embed_size, ff_hidden_size):
        """
        Initializes the FeedForward module.

        Args:
        - embed_size (int): The dimensionality of input embeddings.
        - ff_hidden_size (int): The hidden layer size of the feed-forward network.

        This module implements a simple feed-forward network with one hidden layer and ReLU activation function.

        """

    def forward(self, x):
        """
        Forward pass of the feed-forward network.

        Args:
        - x (torch.Tensor): The input tensor.

        Returns:
        - x (torch.Tensor): The output tensor.

        This function performs the forward pass of the feed-forward network. It applies the linear transformation
        followed by ReLU activation and another linear transformation.

        """





## Task 3-Positional Encoding

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, embed_size, max_len=512):
        """
        Initializes the PositionalEncoding module.

        Args:
        - embed_size (int): The dimensionality of input embeddings.
        - max_len (int): The maximum length of input sequences.

        This module generates positional encodings for input sequences based on their positions.

        """

    def forward(self, x):
        """
        Forward pass of the positional encoding module.

        Args:
        - x (torch.Tensor): The input tensor.

        Returns:
        - x (torch.Tensor): The output tensor.

        This function performs the forward pass of the positional encoding module. It adds positional encodings
        to the input embeddings.

        """



## Task 4-Transformer Block

In [None]:

class TransformerBlock(nn.Module):
    def __init__(self, embed_size, heads, ff_hidden_size, dropout):
        """
        Initializes the TransformerBlock module.

        Args:
        - embed_size (int): The dimensionality of input embeddings.
        - heads (int): The number of attention heads.
        - ff_hidden_size (int): The hidden layer size of the feed-forward network.
        - dropout (float): The dropout probability.

        This module implements a single transformer block consisting of multi-head self-attention mechanism,
        feed-forward network, and layer normalization.

        """
        #sequence of architectural components should be self_attention, feed_forward, layer_norm, dropout layers respectuvely.

    def forward(self, value, key, query, mask):
        """
        Forward pass of the transformer block.

        Args:
        - value (torch.Tensor): The value tensor.
        - key (torch.Tensor): The key tensor.
        - query (torch.Tensor): The query tensor.
        - mask (torch.Tensor): The mask tensor.

        Returns:
        - out (torch.Tensor): The output tensor.

        This function performs the forward pass of the transformer block. It applies the self-attention mechanism,
        feed-forward network, and layer normalization.

        """


In [None]:
class Transformer(nn.Module):
    def __init__(
        self,
        embed_size,
        num_layers,
        heads,
        ff_hidden_size,
        input_vocab_size,
        target_vocab_size,
        max_len,
        dropout,
    ):
        super(Transformer, self).__init__()
        self.encoder_embedding = nn.Embedding(input_vocab_size, embed_size)
        self.decoder_embedding = nn.Embedding(target_vocab_size, embed_size)
        self.positional_encoding = PositionalEncoding(embed_size, max_len)
        self.layers = nn.ModuleList(
            [
                TransformerBlock(embed_size, heads, ff_hidden_size, dropout)
                for _ in range(num_layers)
            ]
        )
        self.fc_out = nn.Linear(embed_size, target_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, source, target, source_mask, target_mask):
        N, source_length = source.shape
        N, target_length = target.shape

        source_embedding = self.dropout(self.positional_encoding(self.encoder_embedding(source)))
        target_embedding = self.dropout(self.positional_encoding(self.decoder_embedding(target)))

        for layer in self.layers:
            target_embedding = layer(source_embedding, source_embedding, target_embedding, source_mask)

        output = self.fc_out(target_embedding)
        return output


In [None]:
# Define hyperparameters
embed_size = 256
heads = 8
ff_hidden_size = 512
num_layers = 6
dropout = 0.1
input_vocab_size = 1000  # Example vocab size
target_vocab_size = 1000  # Example vocab size
max_len = 100

# Create random input data for testing
source = torch.randint(0, input_vocab_size, (32, 10))  # Batch size 32, sequence length 10
target = torch.randint(0, target_vocab_size, (32, 5))  # Batch size 32, sequence length 5

# Create masks
source_mask = torch.ones((32, 1, 1, 10))  # For padding mask
target_mask = torch.ones((32, 1, 5, 5))  # For padding mask

# Initialize transformer model
model = Transformer(
    embed_size,
    num_layers,
    heads,
    ff_hidden_size,
    input_vocab_size,
    target_vocab_size,
    max_len,
    dropout,
)

# Forward pass
output = model(source, target, source_mask, target_mask)

# Print output shape
print("Output shape:", output.shape)  # It should be (batch_size, target_sequence_length, target_vocab_size)


## Task 5-Flow Chart 
Understand the flow of the code, and draw the block digram/flowchart to explain the flow. You can handdraw the digram and include the picture in your notebook of the lab report.

## Task 6-Changing input data
Also, try to modify the test data a little bit and study the impact of transformer's architecture on that changed data and explain the change in your words that you witness.

## Deliverable

Please strcitly adhere to the submission guidleines, i.e.,  submit the .ipynb file with complete code and diagram of flow chart embedded in the notebook 