<a href="https://colab.research.google.com/github/LeninGF/CoursesNotes/blob/main/InteligenciaArtificalGenerativa/Problems/transformers/EjercicioTransformersEncoder-IAG-2024B_LeninFalconi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers Encoder



Coder: Lenin G. Falconí



Asignatura: Tópicos Especiales (Inteligencia Artificial)



Fecha: 2024-12-02

# Transformer Encoder

Para realizar un transformer Encoder se requiere de:

1. Embedding Layer
2. Positional Encoding
3. Pila de capas de Encoder
4. La salida que sería un classification head

In [11]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math
import copy

### MultiHead attention
 the MultiHeadAttention class encapsulates the multi-head attention mechanism commonly used in transformer models. It takes care of splitting the input into multiple attention heads, applying attention to each head, and then combining the results. By doing so, the model can capture various relationships in the input data at different scales, improving the expressive ability of the model.
`scaled_dot_product_attention`: the attention scores are calculated by taking the dot product of queries (Q) and keys (K), and then scaling by the square root of the key dimension (d_k).

`attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)`

`split_heads`: This method reshapes the input x into the shape (batch_size, num_heads, seq_length, d_k). It enables the model to process multiple attention heads concurrently, allowing for parallel computation.

`combine_heads`: combines the results back into a single tensor of shape (batch_size, seq_length, d_model)

`forward`: The forward method is where the actual computation happens:

In [12]:
class MultiHeadAttention(nn.Module):
  """
  d_model: Dimensionality of the input.
  num_heads: The number of attention heads to split the input into.
  d_model is divisible by num_heads

  """
  def __init__(self, d_model, num_heads):
    super(MultiHeadAttention, self).__init__()
    # Ensure that the model dimension (d_model) is divisible by the number of heads
    assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

    # Initialize dimensions
    self.d_model = d_model # Model's dimension
    self.num_heads = num_heads # Number of attention heads
    self.d_k = d_model // num_heads # Dimension of each head's key, query, and value

    # Linear layers for transforming inputs
    self.W_q = nn.Linear(d_model, d_model) # Query transformation
    self.W_k = nn.Linear(d_model, d_model) # Key transformation
    self.W_v = nn.Linear(d_model, d_model) # Value transformation
    self.W_o = nn.Linear(d_model, d_model) # Output transformation

  def scaled_dot_product_attention(self, Q, K, V, mask=None):
    # Calculate attention scores
    attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

    # Apply mask if provided (useful for preventing attention to certain parts like padding)
    if mask is not None:
        attn_scores = attn_scores.masked_fill(mask == 0, -1e9)

    # Softmax is applied to obtain attention probabilities
    attn_probs = torch.softmax(attn_scores, dim=-1)

    # Multiply by values to obtain the final output
    output = torch.matmul(attn_probs, V)
    return output

  def split_heads(self, x):
    # Reshape the input to have num_heads for multi-head attention
    batch_size, seq_length, d_model = x.size()
    return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)

  def combine_heads(self, x):
    # Combine the multiple heads back to original shape
    batch_size, _, seq_length, d_k = x.size()
    return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)

  def forward(self, Q, K, V, mask=None):
    # Apply linear transformations and split heads
    Q = self.split_heads(self.W_q(Q))
    K = self.split_heads(self.W_k(K))
    V = self.split_heads(self.W_v(V))

    # Perform scaled dot-product attention
    attn_output = self.scaled_dot_product_attention(Q, K, V, mask)

    # Combine heads and apply output transformation
    output = self.W_o(self.combine_heads(attn_output))
    return output

### Position Wise Feed Forward
defines a position-wise feed-forward neural network that consists of two linear layers with a ReLU activation function in between. In the context of transformer models, this feed-forward network is applied to each position separately and identically. It helps in transforming the features learned by the attention mechanisms within the transformer, acting as an additional processing step for the attention outputs.

In [13]:
class PositionWiseFeedForward(nn.Module):
  """
  d_model: Dimensionality of the input.
  d_ff: Dimensionality of the inner layer in the feed-forward network.
  """

  def __init__(self, d_model, d_ff):
    super(PositionWiseFeedForward, self).__init__()
    self.fc1 = nn.Linear(d_model, d_ff)
    self.fc2 = nn.Linear(d_ff, d_model)
    self.relu = nn.ReLU()

  def forward(self, x):
    return self.fc2(self.relu(self.fc1(x)))

### Positional Encoding
The PositionalEncoding class adds information about the position of tokens within the sequence. Since the transformer model lacks inherent knowledge of the order of tokens (due to its self-attention mechanism), this class helps the model to consider the position of tokens in the sequence. The sinusoidal functions used are chosen to allow the model to easily learn to attend to relative positions, as they produce a unique and smooth encoding for each position in the sequence.

`max_seq_length`: The maximum length of the sequence for which positional encodings are pre-computed.
`pe`: A tensor filled with zeros, which will be populated with positional encodings.
`position`: A tensor containing the position indices for each position in the sequence.
`div_term`: A term used to scale the position indices in a specific way.

The sine function is applied to the even indices and the cosine function to the odd indices of pe.

In [14]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()

        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

###  Encoder Layer

The EncoderLayer class defines a single layer of the transformer's encoder. It encapsulates a multi-head self-attention mechanism followed by position-wise feed-forward neural network, with residual connections, layer normalization, and dropout applied as appropriate. These components together allow the encoder to capture complex relationships in the input data and transform them into a useful representation for downstream tasks. Typically, multiple such encoder layers are stacked to form the complete encoder part of a transformer model.

In [15]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

### Encoder Transformer

In [16]:
class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, dropout, num_classes, max_sequence_length):
        super(TransformerEncoder, self).__init__()
        self.embedding = nn.Linear(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_sequence_length)
        self.encoder_layers = nn.ModuleList(
            [EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)]
        )
        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, x, mask=None):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        for layer in self.encoder_layers:
            x = layer(x, mask)
        x = x.mean(dim=1)  # Global average pooling
        x = self.fc(x)
        return x


## Prueba con Datos Aleatorios

Se declara un dataset que genera datos sintéticos para evaluar el rendimiento del modelo en clasificacción

In [33]:
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
vocab_size = 1000
sequence_length = 256
dropout = 0.1
num_classes = 3

In [34]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
num_samples = 1000

X = np.random.rand(num_samples, sequence_length, vocab_size).astype(np.float32)
y = np.random.randint(0, num_classes, num_samples).astype(np.int64)

# Convert to PyTorch tensors
X_tensor = torch.tensor(X)
y_tensor = torch.tensor(y)

# Create DataLoader
dataset = TensorDataset(X_tensor, y_tensor)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)


In [35]:
model = TransformerEncoder(vocab_size,
                           d_model,
                           num_heads,
                           num_layers,
                           d_ff,
                           dropout,
                           num_classes,
                           sequence_length)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)


In [36]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'Using device: {device}')


Using device: cuda


In [37]:
model = model.to(device)
X_tensor = X_tensor.to(device)
y_tensor = y_tensor.to(device)


In [38]:
num_epochs = 10

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    correct_predictions = 0
    total_predictions = 0
    for batch_X, batch_y in dataloader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        optimizer.zero_grad()
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

        # calculando la accuracy
        _, predicted = torch.max(outputs.data, 1)
        total_predictions += batch_y.size(0)
        correct_predictions += (predicted == batch_y).sum().item()

    avg_loss = total_loss / len(dataloader)
    accuracy = correct_predictions / total_predictions
    print(f'Epoch {epoch + 1}/{num_epochs}, Loss: {avg_loss:.4f}, Accuracy:{accuracy:.4f}')


Epoch 1/10, Loss: 1.5117, Accuracy:0.3250
Epoch 2/10, Loss: 1.1159, Accuracy:0.3450
Epoch 3/10, Loss: 1.1351, Accuracy:0.3190
Epoch 4/10, Loss: 1.1200, Accuracy:0.3380
Epoch 5/10, Loss: 1.1208, Accuracy:0.3270
Epoch 6/10, Loss: 1.1245, Accuracy:0.3330
Epoch 7/10, Loss: 1.1160, Accuracy:0.3290
Epoch 8/10, Loss: 1.1233, Accuracy:0.3340
Epoch 9/10, Loss: 1.1240, Accuracy:0.3460
Epoch 10/10, Loss: 1.1127, Accuracy:0.3230


In [39]:
model.eval()
with torch.no_grad():
    outputs = model(X_tensor)
    _, predictions = torch.max(outputs, 1)
    accuracy = (predictions == y_tensor).float().mean()
    print(f'Accuracy: {accuracy:.4f}')


Accuracy: 0.3450


In [40]:
# Ensure the model is in evaluation mode
model.eval()

# Assuming X_tensor is your dataset in tensor form
with torch.no_grad():
    # Move the data to the GPU if available
    X_tensor = X_tensor.to(device)

    # Make predictions
    outputs = model(X_tensor)
    _, predictions = torch.max(outputs, 1)

    # If you want to move predictions back to CPU and convert to numpy array
    predictions = predictions.cpu().numpy()

print(predictions)


[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 

In [41]:
!pip install torchinfo

Collecting torchinfo
  Downloading torchinfo-1.8.0-py3-none-any.whl.metadata (21 kB)
Downloading torchinfo-1.8.0-py3-none-any.whl (23 kB)
Installing collected packages: torchinfo
Successfully installed torchinfo-1.8.0


In [42]:
from torchinfo import summary
summary(model)

Layer (type:depth-idx)                        Param #
TransformerEncoder                            --
├─Linear: 1-1                                 512,512
├─PositionalEncoding: 1-2                     --
├─ModuleList: 1-3                             --
│    └─EncoderLayer: 2-1                      --
│    │    └─MultiHeadAttention: 3-1           1,050,624
│    │    └─PositionWiseFeedForward: 3-2      2,099,712
│    │    └─LayerNorm: 3-3                    1,024
│    │    └─LayerNorm: 3-4                    1,024
│    │    └─Dropout: 3-5                      --
│    └─EncoderLayer: 2-2                      --
│    │    └─MultiHeadAttention: 3-6           1,050,624
│    │    └─PositionWiseFeedForward: 3-7      2,099,712
│    │    └─LayerNorm: 3-8                    1,024
│    │    └─LayerNorm: 3-9                    1,024
│    │    └─Dropout: 3-10                     --
│    └─EncoderLayer: 2-3                      --
│    │    └─MultiHeadAttention: 3-11          1,050,624
│    │    └─

## Referencias
- https://www.datacamp.com/tutorial/building-a-transformer-with-py-torch
- https://campus.datacamp.com/es/courses/introduction-to-llms-in-python/building-a-transformer-architecture?ex=15
