<a href="https://colab.research.google.com/github/EliasSf73/test_/blob/master/Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
## tqdm for loading bars
from tqdm.notebook import tqdm

## PyTorch
import torch
import numpy as np
import random


import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data
import torch.optim as optim

## Torchvision
import torchvision
from torchvision.datasets import CIFAR100
from torchvision import transforms
import math

# The Scaled-Dot_Product Attention

The scaled_dot_product function is the heart of the attention mechanism in the MultiheadAttention module. It calculates the attention scores and applies them to the input values. Here's a step-by-step process:

**Dot Product**: It computes the dot product between the query (q) and key (k) matrices, which indicates the degree to which each element of the query sequence should attend to each element of the key sequence.

**Scaling**: The dot product is scaled down by the square root of the dimension of the key vectors. This helps stabilize gradients during training, as larger dimensions can cause the dot product to grow large in magnitude, leading to small gradients.

**Masking (Optional)**: If a mask is provided, it is applied to the scaled dot products. This is often used to ignore padding tokens or future tokens in sequence-to-sequence tasks to prevent information leak.

**Softmax Normalization**: A softmax function is applied to convert the scores into probabilities, which sum up to 1. This step ensures that the model's attention is distributed across the key elements.

**Apply Attention to Values**: Finally, the function multiplies the attention probabilities by the value (v) matrix. This step aggregates the information from the values, weighted by the computed attention scores.

The output of the scaled_dot_product function is a weighted sum of the values, giving us the final attention output that will be used by the rest of the model.



In [7]:
def scaled_dot_product(q,k,v,mask=None):
  d_k=q.size()[-1]
  #raw attention logits scores before scaling

  attention_logits=torch.matmul(q,k.transpose(-2,-1))

  #The attention scores are scaled by dividing them by the square root of the dimensionality of the keys (d_k).
  #This scaling helps in stabilizing gradients during training, as it prevents the dot product from growing too large in magnitude.

  attention_logits=attention_logits/math.sqrt(d_k)
  # selectively ignore certain values to remove their positions from the attention calculation
  if mask is not None:
   attention_logits = attention_logits.masked_fill(mask == 0, -9e15)
  # apply softmax along the last dimension to normalize attention scores; make them non-negative and sum=1. row scores--> actual attention weight
  attention = F.softmax(attention_logits, dim=-1)
  # aggregate information from different positions, weighted by the computed attention
  values = torch.matmul(attention, v)

  return values, attention




Visualization of what the function does

In [8]:
# Set a seed value
seed = 42
# Set the seed for generating random numbers in PyTorch
torch.manual_seed(seed)
seq_len, d_k = 3, 2

q = torch.randn(seq_len, d_k)
k = torch.randn(seq_len, d_k)
v = torch.randn(seq_len, d_k)
values, attention = scaled_dot_product(q, k, v)
print("Q\n", q)
print("K\n", k)
print("V\n", v)
print("Values\n", values)
print("Attention\n", attention)

Q
 tensor([[ 0.3367,  0.1288],
        [ 0.2345,  0.2303],
        [-1.1229, -0.1863]])
K
 tensor([[ 2.2082, -0.6380],
        [ 0.4617,  0.2674],
        [ 0.5349,  0.8094]])
V
 tensor([[ 1.1103, -1.6898],
        [-0.9890,  0.9580],
        [ 1.3221,  0.8172]])
Values
 tensor([[ 0.5698, -0.1520],
        [ 0.5379, -0.0265],
        [ 0.2246,  0.5556]])
Attention
 tensor([[0.4028, 0.2886, 0.3086],
        [0.3538, 0.3069, 0.3393],
        [0.1303, 0.4630, 0.4067]])


**Multi-head_attention**



The scaled dot product attention allows a network to attend over a sequence. However, often there are multiple different aspects a sequence element wants to attend to, and a single weighted average is not a good option for it. This is why we extend the attention mechanisms to multiple heads, i.e. multiple different query-key-value triplets on the same features. Specifically, given a query, key, and value matrix, we transform those into ℎ sub-queries, sub-keys, and sub-values, which we pass through the scaled dot product attention independently. Afterward, we concatenate the heads and combine them with a final weight matrix.


In [9]:
# Helper function to support different mask shapes.
# Output shape supports (batch_size, number of heads, seq length, seq length)
# If 2D: broadcasted over batch size and number of heads
# If 3D: broadcasted over number of heads
# If 4D: leave as is
def expand_mask(mask):
    assert mask.ndim >= 2, "Mask must be at least 2-dimensional with seq_length x seq_length"
    if mask.ndim == 3:
        mask = mask.unsqueeze(1)
    while mask.ndim < 4:
        mask = mask.unsqueeze(0)
    return mask



# Multihead Attention Mechanism


---


The MultiheadAttention module is a core component of the Transformer architecture that allows the model to efficiently process sequences. It achieves this by focusing on different parts of the sequence simultaneously through multiple 'heads'. Each head learns different aspects of the data, providing a more nuanced understanding of the sequence.

Here's a summary of its role:

**Parallel Attention Heads**: Each head can attend to different parts of the sequence, capturing diverse relationships across the sequence. This parallel processing is key to the Transformer's ability to handle complex dependencies.





**Scalability**: Multiple attention heads make the model more flexible and scalable to large input sequences compared to single-head attention.





**Learned Interactions**: The model learns different types of interactions between words (or subwords) in the sequence, which is critical for tasks such as language understanding and generation.





**During the forward pass of the MultiheadAttention module**:

Input sequences are linearly projected into queries (q), keys (k), and values (v) for each attention head.
The scaled dot-product attention is computed for each set of queries, keys, and values.
The attention outputs from each head are concatenated and linearly transformed into the final output.
The MultiheadAttention module's ability to capture different representation subspaces at different positions makes it a powerful mechanism for sequence processing tasks.



In [10]:
class MultiheadAttention(nn.Module):
  def __init__(self,input_dim,embed_dim,num_heads):
    super().__init__()
    self.input_dim=input_dim
    self.embed_dim=embed_dim
    self.num_heads=num_heads
    self.head_dim=embed_dim//num_heads
    #Embedding size needs to be divisible by heads
    assert (embed_dim==num_heads*self.head_dim)
    #linear layers for transforming queries, keys, and values
    self.W_q = nn.Linear(self.head_dim, self.head_dim, bias=False)
    self.W_k = nn.Linear(self.head_dim, self.head_dim, bias=False)
    self.W_v = nn.Linear(self.head_dim, self.head_dim, bias=False)
    #a fully connected output layer that combines the outputs from all heads.
    self.fc_out = nn.Linear(num_heads * self.head_dim, embed_dim)

    def forward(self,queries,keys,values,mask=None):
      N = queries.shape[0]
      query_len, key_len, value_len = queries.shape[1], keys.shape[1], values.shape[1]

        # Split the embedding into self.heads different pieces
      queries = queries.reshape(N, query_len, self.num_heads, self.head_dim)
      keys = keys.reshape(N, key_len, self.num_heads, self.head_dim)
      values = values.reshape(N, value_len, self.num_heads, self.head_dim)

      queries = self.W_q(queries)
      keys = self.W_k(keys)
      values = self.W_v(values)

      # Scaled Dot-Product Attention
      energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])

      if mask is not None:
          energy = energy.masked_fill(mask == 0, float("-inf"))

      attention = F.softmax(energy / (self.embed_size ** (1/2)), dim=3)

  # Apply attention weights to the values and then concatenate the heads' outputs.
  # 'attention' (shape: [batch_size, num_heads, query_len, seq_len])
  # 'values' (shape: [batch_size, seq_len, num_heads, head_dim])
  # Output shape after einsum: [batch_size, query_len, num_heads, head_dim]
  # Reshape to merge head dimension with the head_dim, forming the final output.
      out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(
      N, query_len, self.num_heads * self.head_dim
)

      #apply the fully connected output layer to the concatenated heads.

      out = self.fc_out(out)
      return out




# ENCODER-BLOCKS

**Understanding the EncoderBlock**





The EncoderBlock class is a fundamental building block of the Transformer



Encoder. Each block performs two key operations:




**Self-Attention**: This mechanism allows the encoder to consider other words in the input sequence when encoding a specific word, thereby capturing the context more effectively.






**Feed-Forward Network (FFN)**: After attention has been applied, the FFN further



processes the attention output with two linear layers and a ReLU activation in between.






Each EncoderBlock includes residual connections around these two operations, followed by layer normalization. The residual connections help in preserving the information from the initial input as it passes through multiple layers, preventing the vanishing gradient problem often seen in deep networks.


---


Here's what happens during the forward pass of an EncoderBlock:

Input x is first passed through the self-attention mechanism.
The output of this attention, along with the original input, is normalized.
The normalized output is then passed through the FFN.
Finally, another layer of normalization is applied after adding the output of the FFN to the normalized attention output.
This structure is repeated for each EncoderBlock within the TransformerEncoder, allowing for deep and complex transformation of the input sequence.



In [None]:
class EncoderBlock(nn.Module):
    # Initialize the EncoderBlock with attention and feedforward layers
    def __init__(self, input_dim, num_heads, dim_feedforward, dropout=0.0):
        super().__init__()
        # Self-attention layer with multiple heads
        self.self_attn = MultiheadAttention(input_dim, input_dim, num_heads)
        # Fully connected feed-forward network
        self.linear_net = nn.Sequential(
            nn.Linear(input_dim, dim_feedforward),  # First linear transformation
            nn.Dropout(dropout),                    # Dropout for regularization
            nn.ReLU(inplace=True),                  # ReLU activation
            nn.Linear(dim_feedforward, input_dim)   # Second linear transformation
        )
        # Layer normalization modules
        self.norm1 = nn.LayerNorm(input_dim)
        self.norm2 = nn.LayerNorm(input_dim)
        # Dropout module
        self.dropout = nn.Dropout(dropout)

    # Define the forward pass
    def forward(self, x, mask=None):
        # Apply self-attention to the input
        attn_out = self.self_attn(x, mask=mask)
        # Add the attention output to the original input (residual connection) and normalize
        x = self.norm1(x + self.dropout(attn_out))
        # Apply the feed-forward network to the attention output
        linear_out = self.linear_net(x)
        #Applies dropout to the output of the MLP and adds it to the x before the MLP (another residual connection)
        x = x + self.dropout(linear_out)
        #Normalizes the output from the previous step using the second layer normalization
        x = self.norm2(x)
        #Returns the final output of the encoder block.
        return x


# Transformer Encoder Overview









The TransformerEncoder class is a key component of the Transformer architecture, responsible for processing the input sequence through multiple layers of attention and feed-forward networks. This class is designed to encapsulate the entire encoding mechanism of the Transformer model, which includes several layers of EncoderBlock modules stacked on top of each other.





Here's a brief breakdown of the TransformerEncoder class functionality:

**Initialization**: The TransformerEncoder initializes with a specified number of encoder layers (num_layers). Each layer is an instance of the EncoderBlock, which contains the self-attention mechanism and a position-wise feed-forward network.





**ModuleList**: It employs a nn.ModuleList to hold the EncoderBlock instances. This list ensures that each block's parameters are properly registered for training and that they are accessible during the forward pass.





**Forward Pass**: During the forward pass, the input data x is sequentially processed by each EncoderBlock. If a mask is provided (e.g., to ignore padding in the input sequence), it is applied to each block's attention mechanism to prevent it from attending to padding tokens.





**Output**: The output of the TransformerEncoder is a transformed sequence that has incorporated contextual information from the entire sequence. This output is then ready to be used by the next component in the Transformer model, typically a TransformerDecoder if the model is being used for tasks like machine translation.





The encoder is a vital part of the Transformer model, enabling it to capture complex dependencies within the input data. The self-attention mechanism allows the model to focus on different parts of the input sequence, and the feed-forward networks further transform the data at each position.

In [None]:
class TransformerEncoder(nn.Module):

    def __init__(self, num_layers, **block_args):
        """
        Inputs:
            num_layers - Number of encoder blocks to stack
            block_args - Arguments to be passed to each encoder block (e.g., input_dim, num_heads, dim_feedforward, dropout)
        """
        super().__init__()
        # Stack multiple encoder blocks to form the encoder
        self.layers = nn.ModuleList([
            EncoderBlock(**block_args) for _ in range(num_layers)
        ])
        # Store the number of layers
        self.num_layers = num_layers

    def forward(self, x, mask=None):
        # Pass the input (and mask) through each encoder block in sequence
        for layer in self.layers:
            x = layer(x, mask=mask)
        return x
