# Understanding Transformers in Deep Learning

This notebook provides a comprehensive guide to Transformers, a state-of-the-art architecture used in natural language processing (NLP) and other deep learning applications. The notebook covers the theoretical concepts, practical implementations, and best practices for using Transformers.

## Table of Contents
1. **Introduction to Transformers**
    - Overview and History
    - Key Components of the Transformer Architecture
2. **Attention Mechanism**
    - Self-Attention and Multi-Head Attention
    - Positional Encoding
3. **Building Transformers from Scratch**
    - Implementing a Basic Transformer Model
    - Understanding the Model Architecture
4. **Using Pre-trained Transformer Models**
    - Leveraging `transformers` library (Hugging Face)
    - Text Classification and Text Generation Examples
5. **Advanced Topics**
    - Fine-Tuning Transformers
    - Handling Long Sequences with Transformers


## 1. Introduction to Transformers

### 1.1 Overview and History
The Transformer architecture was introduced in the paper ["Attention is All You Need"](https://arxiv.org/abs/1706.03762) by Vaswani et al. in 2017. It revolutionized NLP by replacing recurrent and convolutional networks with a structure based solely on attention mechanisms. Transformers enable efficient parallelization and have become the backbone for models such as BERT, GPT, and T5.

### 1.2 Key Components of the Transformer Architecture
1. **Encoder and Decoder**: The Transformer consists of an encoder and a decoder. The encoder processes input sequences, while the decoder generates output sequences.
2. **Attention Mechanisms**: Self-attention allows the model to weigh the importance of different words in a sequence.
3. **Positional Encoding**: Adds information about the order of words in a sequence, as Transformers lack inherent sequential information.

In the following sections, we will explore each of these components in detail and see how they work together to form the Transformer architecture.


## 2. Attention Mechanism

### 2.1 Self-Attention and Multi-Head Attention
The core innovation of the Transformer is the self-attention mechanism, which allows the model to look at other words in the same sequence to predict a word. This helps in capturing long-range dependencies more effectively.

#### Self-Attention Calculation
Given an input sequence, self-attention is computed as follows:

1. Calculate three vectors: Query (Q), Key (K), and Value (V) for each word in the sequence.
2. Compute the dot product of Q and K, and scale by the square root of the dimension of K.
3. Apply a softmax function to get attention scores.
4. Multiply the attention scores by V to get the output representation.

#### Multi-Head Attention
Instead of computing a single attention score, multi-head attention uses multiple sets of Q, K, and V, which allows the model to focus on different parts of the sequence simultaneously.

### Implementation of Self-Attention Mechanism
Below is a code implementation of a self-attention mechanism in PyTorch.


In [None]:
# Self-Attention Mechanism Implementation

import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        assert (
            self.head_dim * heads == embed_size
        ), "Embedding size needs to be divisible by heads"

        self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.fc_out = nn.Linear(heads * self.head_dim, embed_size)

    def forward(self, values, keys, query, mask):
        N = query.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]

        # Split the embedding into self.heads different pieces
        values = values.reshape(N, value_len, self.heads, self.head_dim)
        keys = keys.reshape(N, key_len, self.heads, self.head_dim)
        queries = query.reshape(N, query_len, self.heads, self.head_dim)

        # Calculate the dot product attention
        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])  # Queries shape: (N, query_len, heads, head_dim)
                                                                   # Keys shape: (N, key_len, heads, heads_dim)
                                                                   # Energy shape: (N, heads, query_len, key_len)

        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))

        attention = torch.softmax(energy / (self.embed_size ** (1 / 2)), dim=3)  # Normalize across key_len
        out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(N, query_len, self.heads * self.head_dim)
        out = self.fc_out(out)

        return out

# Test the self-attention mechanism
embed_size = 256
heads = 8
attention = SelfAttention(embed_size, heads)

# Create sample input
x = torch.rand(64, 10, embed_size)  # (batch_size, sequence_length, embed_size)
mask = None
output = attention(x, x, x, mask)

print(f"Output shape: {output.shape}")


## 3. Building Transformers from Scratch

Let's implement a simplified version of a Transformer architecture, including both the encoder and decoder components. We'll use PyTorch for the implementation.

**Note**: This implementation is for educational purposes and does not include advanced optimizations used in actual Transformer models like GPT or BERT.


## 4. Using Pre-trained Transformer Models

The `transformers` library by Hugging Face provides easy access to numerous pre-trained models. In this section, we will see how to leverage these models for downstream NLP tasks.

### 4.1 Text Classification
We'll use BERT to classify text into different categories.

### 4.2 Text Generation
We'll use GPT-2 for generating text based on a given prompt.

Below is the implementation for both examples:


In [None]:
# Text Classification using Pre-trained BERT

from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, Dataset
import torch

# Define a sample dataset
class SampleDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, item):
        encoding = self.tokenizer.encode_plus(
            self.texts[item],
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            return_attention_mask=True,
            return_tensors='pt',
        )
        return {
            'text': self.texts[item],
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(self.labels[item], dtype=torch.long)
        }

# Sample data
texts = ["I love programming", "Transformers are powerful models", "Machine learning is fascinating"]
labels = [1, 0, 1]  # Assume 1 for positive sentiment, 0 for neutral/negative

# Load pre-trained tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# Create DataLoader
dataset = SampleDataset(texts, labels, tokenizer, max_len=10)
loader = DataLoader(dataset, batch_size=2)

# Define optimizer and loss function
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
loss_fn = nn.CrossEntropyLoss()

# Training loop (simplified)
for batch in loader:
    input_ids = batch['input_ids']
    attention_mask = batch['attention_mask']
    labels = batch['labels']

    outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
    loss = outputs.loss
    logits = outputs.logits

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

print(f"Training completed. Final loss: {loss.item():.4f}")
