<a href="https://colab.research.google.com/github/SharonOBoyle/AI-Launchpad/blob/main/Day2_Assignment1_TransformerModelCode.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Assignment 1**

Implement the transformer model code in https://github.com/toni-ramchandani/AIMasterClassTTT/blob/main/Section6-1_Transformers.ipynb


# **1. Implement the transformer model code**

In https://github.com/toni-ramchandani/AIMasterClassTTT/blob/main/Section6-1_Transformers.ipynb


**Note:** Added explanations of code from ChatGPT

In [None]:
# Install required libraries
!pip install torch
!pip install torchvision



In [1]:
# Import libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

**What is Scaled Dot-Product Attention?**

* **Purpose:** In Transformer models (like GPT, BERT), attention mechanisms let the model decide which parts of the input sequence to focus on.
* **Key Idea:** Scaled Dot-Product Attention computes a weighted combination of values (value) based on how similar queries (query) are to keys (key).

In [2]:
# Define the ScaledDotProductAttention class, which is a type of attention mechanism used in Transformer models
class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_k):
        super(ScaledDotProductAttention, self).__init__()
        self.d_k = d_k  # d_k is the dimensionality of the keys and queries, used for scaling the dot product

    # The forward method defines how the input data moves through this layer
    def forward(self, query, key, value, mask=None):
        # Compute the dot product between the query and the transpose of the key
        # The transpose operation swaps the last two dimensions of the key
        # This dot product gives us a score matrix that represents the similarity between queries and keys
        scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(self.d_k)

        # If a mask is provided, apply it to the scores
        # This is usually done to ignore certain positions in the input (e.g., padding tokens)
        # The masked positions are filled with a large negative value (-1e9) so that their softmax result is close to zero
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        # Apply the softmax function to the scores to obtain attention weights
        # Softmax is applied along the last dimension to ensure the weights sum up to 1
        attention = F.softmax(scores, dim=-1)

        # Multiply the attention weights with the value vectors
        # This step generates the output by weighting the value vectors according to the attention weights
        output = torch.matmul(attention, value)

        # Return the output and the attention weights
        return output, attention


**Summary**

* **Purpose:** This class implements the Scaled Dot-Product Attention mechanism.
* **Process:**
  1. Compute similarity scores between query and key.
  2. Optionally mask out certain positions.
  3. Convert scores to attention weights using softmax.
  4. Use the weights to compute a weighted sum of value vectors.
* **Output:** The resulting output and the attention weights.




**Analogy**

Imagine you’re reading a book:

1. **Query:** What you’re currently looking for (e.g., a character’s name).
2. **Key:** All the information in the book.
3. **Value:** The content of the book.
4. **Mask:** Ignoring irrelevant sections (e.g., the index or blank pages).
5. **Attention Weights:** Deciding which sentences or paragraphs are most relevant based on your query.

The class essentially performs this process computationally!

**What is Multi-Head Attention?**

* **Purpose:** Multi-head attention is a key part of Transformer models. It allows the model to focus on different parts of the input sequence simultaneously.
* **How it works:** The input (queries, keys, and values) is split into multiple "heads," and each head performs attention independently. The results are then combined.

In [3]:
# Define the Multi-head Attention Layer

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# Define the MultiHeadAttention class, which is a core component of the Transformer model
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads  # The number of attention heads
        self.d_k = d_model // num_heads  # The dimension of each head (d_model divided by num_heads)

        # Linear layers to project the input query, key, and value vectors into the required dimensions
        self.query_linear = nn.Linear(d_model, d_model)
        self.key_linear = nn.Linear(d_model, d_model)
        self.value_linear = nn.Linear(d_model, d_model)

        # Linear layer to project the concatenated output of all heads back into the original d_model dimension
        self.out_linear = nn.Linear(d_model, d_model)

    # The forward method defines how the input data moves through this layer
    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)  # Get the batch size from the query input

        # Project the query, key, and value inputs into multiple heads
        # Each projection is reshaped to [batch_size, num_heads, sequence_length, d_k]
        query = self.query_linear(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        key = self.key_linear(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        value = self.value_linear(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Apply scaled dot-product attention for each head
        # The attention function returns the attention-weighted values
        attention, _ = ScaledDotProductAttention(self.d_k)(query, key, value, mask)

        # Transpose and reshape the attention output back to [batch_size, sequence_length, d_model]
        attention = attention.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)

        # Apply the final linear transformation to combine the heads' outputs
        output = self.out_linear(attention)
        return output  # Return the final output of the multi-head attention mechanism




# **Summary of the Process**

1. **Input Transformation:**

  * The input is projected into query, key, and value vectors using separate linear layers.
  * These vectors are split into multiple smaller heads for parallel attention computation.
2. **Attention Computation:**

  * Each head computes attention using ScaledDotProductAttention.
  * Attention focuses on different parts of the input based on the query.
3. **Combine Results:**

  * The results of all heads are concatenated and passed through a linear layer to get the final output.

# **Analogy**

Think of reading a book with multiple reviewers:

1. Each reviewer (head) focuses on a different aspect of the story (e.g., characters, plot, language).
2. They analyze (compute attention) and provide their perspectives.
3. Their opinions are then combined into a summary (final output).

# **Key Points**
* Multi-head attention splits the input into smaller "attention heads."
*Each head processes attention independently and focuses on different parts of the sequence.
*The results from all heads are concatenated and transformed back into the original dimensionality.

This structure allows the model to learn more diverse and meaningful relationships in the data.

# **Explanation of the PositionalEncoding Class**

The PositionalEncoding class introduces **positional information** into the input embeddings for Transformer models. Since Transformer models are inherently **order-agnostic** (they don’t process data sequentially), positional encodings help the model understand the relative or absolute positions of words or tokens in a sequence.

In [4]:
# Implement Positional Encoding

# Define the PositionalEncoding class, which is used to inject information about the relative or absolute position of tokens in a sequence.
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()

        # Initialize an empty encoding matrix with shape [max_len, d_model]
        self.encoding = torch.zeros(max_len, d_model)

        # Create a tensor with shape [max_len, 1], where each element represents the position in the sequence
        position = torch.arange(0, max_len).unsqueeze(1)

        # Calculate the division term, which is based on the position and model dimension
        # The division term varies across dimensions and is used to scale the position
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))

        # Apply sine to even indices (starting at 0) in the encoding matrix
        self.encoding[:, 0::2] = torch.sin(position * div_term)

        # Apply cosine to odd indices (starting at 1) in the encoding matrix
        self.encoding[:, 1::2] = torch.cos(position * div_term)

        # Add an extra dimension at the start of the encoding tensor to match batch dimensions
        # The resulting shape is [1, max_len, d_model]
        self.encoding = self.encoding.unsqueeze(0)

    # Forward method that adds the positional encoding to the input tensor x
    def forward(self, x):
        # The encoding tensor is added to the input x, which is expected to have a shape of [batch_size, sequence_length, d_model]
        # The encoding is sliced to match the sequence length of x and moved to the same device as x
        return x + self.encoding[:, :x.size(1), :].to(x.device)


# **Key Insights**
* **Why Sine and Cosine?**

  * The combination of sine and cosine allows the model to learn relative positions between tokens. For example, the sum or difference of two periodic functions can capture distance between positions.
* **Why Exponential Scaling?**

  * The frequencies decrease exponentially to capture positional patterns at both small and large scales.
* **What Does It Output?**

  * If the input x has shape [batch_size, seq_len, d_model], the output will also have the same shape but with positional information added.


# **Summary**
* The **PositionalEncoding** class generates positional encodings using sine and cosine functions.
* It adds these encodings to the input embeddings to provide information about the order of tokens.
* The output has the same shape as the input but includes positional information.

This mechanism is crucial for allowing Transformer models to understand the sequence structure of data.

# **What is the Encoder Layer?**
* The EncoderLayer is a single layer in the Transformer encoder architecture.
* It consists of two main components:
  1. **Self-Attention Mechanism:** Helps the model focus on relevant parts of the input sequence.
  2. **Feed-Forward Network (FFN):** Processes and transforms the attention outputs.

In [5]:
# Define the EncoderLayer class, which is a single layer of the Transformer encoder
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()

        # Initialize the MultiHeadAttention mechanism
        self.self_attn = MultiHeadAttention(d_model, num_heads)

        # Define the feedforward network as a sequence of layers
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),  # Linear transformation from d_model to d_ff dimensions
            nn.ReLU(),  # Apply ReLU activation function
            nn.Linear(d_ff, d_model)  # Linear transformation back to d_model dimensions
        )

        # Layer normalization applied after the self-attention sub-layer
        self.layer_norm1 = nn.LayerNorm(d_model)

        # Layer normalization applied after the feedforward sub-layer
        self.layer_norm2 = nn.LayerNorm(d_model)

        # Dropout layer to prevent overfitting
        self.dropout = nn.Dropout(dropout)

    # The forward method defines the flow of data through this layer
    def forward(self, x, mask=None):
        # Apply self-attention mechanism
        # The input x is passed as query, key, and value, which is typical for self-attention
        attn_output = self.self_attn(x, x, x, mask)

        # Add the attention output to the input (residual connection) and apply layer normalization
        x = self.layer_norm1(x + self.dropout(attn_output))

        # Pass the normalized output through the feedforward network
        ff_output = self.ff(x)

        # Add the feedforward output to the input (residual connection) and apply layer normalization
        x = self.layer_norm2(x + self.dropout(ff_output))

        # Return the final output of this encoder layer
        return x


# **Summary of the Encoder Layer**
**Self-Attention:**

* Calculates relationships between all tokens in the sequence.
* Outputs attention-weighted information for each token.

**Add & Normalize (Sublayer 1):**

* Combines the input and self-attention output.
* Normalizes to stabilize training.

**Feed-Forward Network:**

* Transforms the self-attention output into more meaningful representations.

**Add & Normalize (Sublayer 2):**

* Combines the FFN output with the input from the first sublayer.
* Normalizes again.

#**Illustrative Analogy**
Imagine processing a sentence:

1. **Self-Attention:** For each word, find its relationships with all other words (e.g., "The cat sat on the mat" → the word "cat" relates to "sat" and "mat").
2. **Feed-Forward Network:** Process the relationships to extract deeper meaning (e.g., identify "cat sitting on mat").
3. **Add & Normalize:** Combine the original information with the new insights and stabilize it.

#**How It Fits in a Transformer**
* The EncoderLayer is one block in the Transformer encoder stack.
* Multiple EncoderLayers are stacked to build a deep Transformer model capable of understanding complex input relationships.

# **What is the Decoder Layer?**
The DecoderLayer is a single block in the **Transformer decoder**. Its primary function is to:

1. Attend to the target sequence (self-attention).
2. Attend to the encoder output (cross-attention).
3. Transform the resulting representation using a feed-forward network (FFN).

It processes the input from the previous decoder layer (or embeddings) and combines it with the output from the encoder to generate meaningful predictions.

In [6]:
# Define the DecoderLayer class, which is a single layer of the Transformer decoder
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()

        # Self-attention mechanism for the target sequence (decoder input)
        self.self_attn = MultiHeadAttention(d_model, num_heads)

        # Cross-attention mechanism that attends to the encoder output
        self.cross_attn = MultiHeadAttention(d_model, num_heads)

        # Feedforward network as a sequence of layers
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),  # Linear transformation from d_model to d_ff dimensions
            nn.ReLU(),  # Apply ReLU activation function
            nn.Linear(d_ff, d_model)  # Linear transformation back to d_model dimensions
        )

        # Layer normalization applied after self-attention sub-layer
        self.layer_norm1 = nn.LayerNorm(d_model)

        # Layer normalization applied after cross-attention sub-layer
        self.layer_norm2 = nn.LayerNorm(d_model)

        # Layer normalization applied after the feedforward sub-layer
        self.layer_norm3 = nn.LayerNorm(d_model)

        # Dropout layer to prevent overfitting
        self.dropout = nn.Dropout(dropout)

    # The forward method defines the flow of data through this layer
    def forward(self, x, enc_output, src_mask=None, tgt_mask=None):
        # Apply self-attention mechanism on the decoder input (target sequence)
        # The input x is passed as query, key, and value, similar to the encoder's self-attention
        attn_output = self.self_attn(x, x, x, tgt_mask)

        # Add the attention output to the input (residual connection) and apply layer normalization
        x = self.layer_norm1(x + self.dropout(attn_output))

        # Apply cross-attention mechanism, attending to the encoder output (source sequence)
        # The decoder input x is the query, and the encoder output (enc_output) is the key and value
        attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)

        # Add the cross-attention output to the input and apply layer normalization
        x = self.layer_norm2(x + self.dropout(attn_output))

        # Pass the normalized output through the feedforward network
        ff_output = self.ff(x)

        # Add the feedforward output to the input (residual connection) and apply layer normalization
        x = self.layer_norm3(x + self.dropout(ff_output))

        # Return the final output of this decoder layer
        return x



# **Summary of the Decoder Layer**
1. **Self-Attention:**

  * Focuses on the target sequence to model relationships within it.
  * Enforces causal masking to ensure no information from future tokens is used.
2. **Cross-Attention:**

  * Focuses on the encoder's output to incorporate source sequence information.
  * Models the relationship between the target and source sequences.
3. **Feed-Forward Network:**

  * Transforms the resulting representation for richer feature extraction.
4. **Add & Normalize:**

  * Ensures stability and allows residual connections for better gradient flow.

# **Illustrative Analogy**
Imagine a translator working on a sentence:

1. **Self-Attention:** Focuses on the structure and relationships within the partially translated sentence.
2. **Cross-Attention:** Looks back at the source sentence to pick up the necessary context.
3. **Feed-Forward Network:** Refines the translation for fluency and correctness.

# **How It Fits in a Transformer**
* The DecoderLayer is a single block in the Transformer decoder.
* Multiple DecoderLayers are stacked to form a deep decoder, enabling it to generate coherent and contextually accurate outputs.

# **What is the Transformer Class?**
The Transformer class implements the full Transformer architecture, combining:

1. **Embedding Layers:** Converts input tokens into dense vectors.
2. **Positional Encoding:** Adds positional information to embeddings.
3. **Encoder:** Processes the input sequence to create meaningful representations.
4. **Decoder:** Processes the target sequence and integrates information from the encoder.
5. **Final Linear Layer:** Maps the decoder output to the output vocabulary for predictions.


In [7]:
# Build the transformer model

# Define the Transformer class, which represents the complete Transformer model architecture
class Transformer(nn.Module):
    def __init__(self, d_model, num_heads, num_encoder_layers, num_decoder_layers, d_ff, input_vocab_size, output_vocab_size, max_len=5000, dropout=0.1):
        super(Transformer, self).__init__()

        # Embedding layer for the source input sequence (encoder input)
        self.encoder_embedding = nn.Embedding(input_vocab_size, d_model)

        # Embedding layer for the target input sequence (decoder input)
        self.decoder_embedding = nn.Embedding(output_vocab_size, d_model)

        # Positional encoding to add positional information to the input embeddings
        self.positional_encoding = PositionalEncoding(d_model, max_len)

        # Stack of encoder layers, each composed of self-attention and feedforward sub-layers
        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_encoder_layers)])

        # Stack of decoder layers, each composed of self-attention, cross-attention, and feedforward sub-layers
        self.decoder_layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_decoder_layers)])

        # Final linear layer that maps the decoder output to the output vocabulary size
        self.final_linear = nn.Linear(d_model, output_vocab_size)

    # The forward method defines how the data flows through the Transformer model
    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        # Embed the source input sequence (src) and scale the embeddings by the square root of the model dimension
        src = self.encoder_embedding(src) * torch.sqrt(torch.tensor(d_model, dtype=torch.float32))

        # Add positional encodings to the embedded source sequence
        src = self.positional_encoding(src)

        # Pass the source sequence through each encoder layer in the stack
        for layer in self.encoder_layers:
            src = layer(src, src_mask)

        # Embed the target input sequence (tgt) and scale the embeddings by the square root of the model dimension
        tgt = self.decoder_embedding(tgt) * torch.sqrt(torch.tensor(d_model, dtype=torch.float32))

        # Add positional encodings to the embedded target sequence
        tgt = self.positional_encoding(tgt)

        # Pass the target sequence through each decoder layer in the stack
        for layer in self.decoder_layers:
            tgt = layer(tgt, src, tgt_mask, src_mask)

        # Pass the output of the final decoder layer through the final linear layer
        # This maps the output to the vocabulary space, producing a distribution over the output vocabulary
        output = self.final_linear(tgt)

        # Return the final output, which is typically passed to a softmax layer during training
        return output


# **Complete Data Flow in the Transformer**
1. **Input:**

  * Source (src) and target (tgt) sequences are provided as input.
2. **Encoder:**

  * Processes the source sequence to create a rich representation.
3. **Decoder:**

  * Processes the target sequence while attending to the encoder’s output.
  * Outputs contextualized representations for each target token.
4. **Output Layer:**

  * Maps decoder representations to the output vocabulary for predictions.


# **Common Use Case**
The Transformer is commonly used for:

* **Machine Translation:** Source (src) is the input language, and target (tgt) is the translated language.
* **Text Summarization:** Source (src) is the full text, and target (tgt) is the summary.
* **Text Generation:** Source (src) could be a prompt, and target (tgt) is the generated text.


# **Define Example Usage**

In [8]:
# Define Example Usage

# Define the vocabulary sizes for the source and target languages
input_vocab_size = 10000  # Source vocabulary size
output_vocab_size = 10000  # Target vocabulary size

# Set the dimensionality of the model, which determines the size of the embeddings and the model's internal representations
d_model = 512  # Dimensionality of the embeddings and model

# Define the number of attention heads in the multi-head attention mechanism
num_heads = 8  # Number of attention heads

# Set the number of layers in the encoder and decoder stacks
num_encoder_layers = 6  # Number of encoder layers
num_decoder_layers = 6  # Number of decoder layers

# Define the dimensionality of the feedforward network within each layer
d_ff = 2048  # Dimensionality of the feedforward network

# Set the maximum length for the input and output sequences
max_len = 100  # Maximum length of the input and output sequences

# Instantiate the Transformer model with the specified parameters
model = Transformer(d_model, num_heads, num_encoder_layers, num_decoder_layers, d_ff, input_vocab_size, output_vocab_size, max_len)

# Generate a batch of random source sentences
# Each sentence has 100 tokens, and there are 32 sentences in the batch
src = torch.randint(0, input_vocab_size, (32, 100))  # Source sentences (randomly generated)

# Generate a batch of random target sentences
# Each sentence has 100 tokens, and there are 32 sentences in the batch
tgt = torch.randint(0, output_vocab_size, (32, 100))  # Target sentences (randomly generated)

# Pass the source and target sentences through the Transformer model
output = model(src, tgt)

# Print the shape of the output tensor
# Expected shape: [32, 100, 10000], corresponding to [batch size, sequence length, output vocab size]
print(output.shape)


torch.Size([32, 100, 10000])


# **Summary**
* The code initializes and runs a Transformer model.
* Randomly generated source (src) and target (tgt) sequences are passed through the model.
* The output has a shape of [batch_size, tgt_seq_len, output_vocab_size].
