## Import Libraries

In [1]:
import numpy as np
import random
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.optim.lr_scheduler import ReduceLROnPlateau
from torch.optim.lr_scheduler import OneCycleLR

## DummyTextDataset Class

This code defines a custom dataset class `DummyTextDataset` for PyTorch, which generates random text sequences and assigns class labels. It's mainly designed for training purposes in text classification tasks where you need synthetic data. Below is a breakdown of the code:

### Class: `DummyTextDataset`

#### `__init__(self, num_samples=1000, seq_length=50, vocab_size=1000)`
- **Parameters**:
  - `num_samples` (default: 1000): The number of text samples in the dataset.
  - `seq_length` (default: 50): The length of each text sequence.
  - `vocab_size` (default: 1000): The size of the vocabulary for generating text sequences.

- **Functionality**:
  - Generates `num_samples` random integer sequences representing text data. Each sequence is of length `seq_length` and consists of integers between 0 and `vocab_size-1`.
  - The labels are balanced between two classes: class 0 and class 1. Half of the samples are labeled as class 0 and the rest as class 1.
  - The data and labels are shuffled randomly to ensure diversity in the dataset.

#### `__len__(self)`
- Returns the total number of samples in the dataset (i.e., the length of `self.data`).

#### `__getitem__(self, idx)`
- Fetches the text sequence (`data[idx]`) and its corresponding label (`labels[idx]`) at a given index (`idx`).


In [2]:
class DummyTextDataset(Dataset):
    def __init__(self, num_samples=1000, seq_length=50, vocab_size=1000):
        self.data = torch.randint(0, vocab_size, (num_samples, seq_length), dtype=torch.long)

        # Ensuring equal number of class 0 and class 1 samples
        num_class_0 = num_samples // 2
        num_class_1 = num_samples - num_class_0

        # Randomly shuffle indices to assign class labels
        self.labels = torch.zeros(num_samples, dtype=torch.long)
        self.labels[:num_class_0] = 0  # Class 0
        self.labels[num_class_0:] = 1  # Class 1

        # Shuffle the data and labels together to ensure randomness
        indices = torch.randperm(num_samples)
        self.data = self.data[indices]
        self.labels = self.labels[indices]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

## LearnablePositionalEncoding Class

The `LearnablePositionalEncoding` class implements learnable positional encodings for sequence data in PyTorch. It is particularly useful in transformer models, where the position of elements in the input sequence is crucial but not inherently represented by the model. This class allows the model to learn an optimal representation of sequence positions during training.

### Class: `LearnablePositionalEncoding`

#### `__init__(self, d_model, max_seq_length)`
- **Parameters**:
  - `d_model`: The dimension of the model (i.e., the size of the embedding vector for each position).
  - `max_seq_length`: The maximum length of the sequence for which positional embeddings will be generated.

- **Functionality**:
  - Initializes a learnable positional encoding as an `nn.Embedding`, which maps each position in the sequence (up to `max_seq_length`) to a `d_model`-dimensional vector.
  - The `positional_encoding` layer creates a lookup table where each position in the sequence (from 0 to `max_seq_length-1`) is assigned a trainable embedding vector.

#### `forward(self, x)`
- **Input**:
  - `x`: A tensor of shape `[batch_size, seq_length, d_model]`, representing the input sequence.
  
- **Functionality**:
  - Extracts the `batch_size` and `seq_length` from the input tensor `x`.
  - Creates a tensor of positional indices (from `0` to `seq_length-1`) and replicates it for all items in the batch.
  - Uses the `positional_encoding` layer to get a positional embedding for each position in the sequence.
  - Adds the positional embeddings to the input `x`, thereby injecting positional information into the model's representation of the input.

- **Output**:
  - The input `x` with added positional embeddings, ensuring that the sequence order is encoded in the final representation.


In [3]:
class LearnablePositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super(LearnablePositionalEncoding, self).__init__()
        self.positional_encoding = nn.Embedding(max_seq_length, d_model)

    def forward(self, x):
        # x has shape [batch_size, seq_length, d_model]
        batch_size, seq_length, _ = x.size()

        # Generate positional indices for each position in the sequence
        positions = torch.arange(0, seq_length, device=x.device).unsqueeze(0).expand(batch_size, seq_length)

        # Pass positional indices through learnable embedding
        pos_embeddings = self.positional_encoding(positions)

        # Add positional embeddings to input
        return x + pos_embeddings

## TextClassificationModel Class

The `TextClassificationModel` is a deep learning model built for text classification tasks. It leverages transformer architecture with learnable positional encodings and batch normalization, designed for efficient handling of sequential data. The model takes input sequences, applies embeddings and positional encoding, processes them through a transformer encoder, and outputs a prediction for each input sequence.

### Class: `TextClassificationModel`

#### `__init__(self, vocab_size, d_model, num_heads, num_layers, max_len, dropout=0.1)`
- **Parameters**:
  - `vocab_size`: The size of the vocabulary (total number of unique tokens).
  - `d_model`: The dimension of the embedding space and the model's internal representations.
  - `num_heads`: The number of attention heads in the multi-head self-attention mechanism.
  - `num_layers`: The number of transformer encoder layers to stack.
  - `max_len`: The maximum length of input sequences.
  - `dropout`: Dropout rate for regularization (default is 0.1).
  
- **Functionality**:
  - Initializes an embedding layer that maps each token in the input sequence to a `d_model`-dimensional vector.
  - Uses the `LearnablePositionalEncoding` class to generate learnable positional encodings that will be added to the input embeddings, helping the model capture sequential information.
  - Creates a transformer encoder with `num_layers` stacked layers, each utilizing multi-head self-attention.
  - Includes a batch normalization layer (`nn.BatchNorm1d`) to normalize the activations across the batch for better convergence during training.
  - Defines a fully connected layer (`self.fc`) that outputs a prediction for two classes (binary classification).
  - The model includes dropout (`self.dropout`) for regularization to prevent overfitting.

#### `forward(self, x)`
- **Input**:
  - `x`: A tensor representing a batch of input sequences with shape `[batch_size, seq_length]`.
  
- **Functionality**:
  - The input sequence `x` is first passed through the embedding layer to convert the tokens into their respective embedding vectors.
  - Positional encodings are added to the embeddings to incorporate the sequence order information.
  - The embeddings are then permuted to match the expected input shape for the transformer encoder (`[seq_length, batch_size, d_model]`).
  - The input is processed by the transformer encoder, which applies multi-head self-attention and feed-forward layers.
  - The output is pooled by taking the mean across the sequence length dimension (`dim=0`) to obtain a fixed-length representation for each input sequence.
  - Batch normalization is applied to normalize the pooled output, followed by dropout for regularization.
  - The final output is passed through a fully connected layer (`fc`), which produces the final class predictions.

- **Output**:
  - A tensor of shape `[batch_size, 2]`, where each entry corresponds to the model's prediction for two classes (e.g., for binary classification).


In [4]:
class TextClassificationModel(nn.Module):
  def __init__(self, vocab_size, d_model, num_heads, num_layers, max_len, dropout=0.1):
    super(TextClassificationModel, self).__init__()
    self.embedding = nn.Embedding(vocab_size, d_model)
    self.positional_encoding = LearnablePositionalEncoding(d_model, max_len)
    encoder_layer = nn.TransformerEncoderLayer(d_model, num_heads, dim_feedforward=4*d_model, dropout=dropout, activation='gelu', norm_first=True)
    self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers)
    self.bn = nn.BatchNorm1d(d_model)  # Add batch normalization
    self.fc = nn.Linear(d_model, 2)
    self.dropout = nn.Dropout(dropout)

  def forward(self, x):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        x = x.permute(1, 0, 2)
        x = self.transformer_encoder(x)
        x = x.mean(dim=0)
        x = self.bn(x)  # Apply batch normalization
        x = self.dropout(x)
        x = self.fc(x)
        return x

## Training and Evaluation Functions

### `train_model` Function

The `train_model` function trains a PyTorch model using a given dataloader, optimizer, and loss function. It includes features like gradient clipping, learning rate scheduling, and logging.

#### Parameters:
- `model`: The model to train (should be a PyTorch neural network model).
- `dataloader`: A PyTorch DataLoader that provides the training data in batches.
- `batch_size`: The size of the batch for each training step.
- `num_epochs`: The number of epochs (iterations over the entire dataset) for training.
- `learning_rate`: The learning rate for the optimizer.

#### Functionality:
1. **Setup**:
   - The model is set to training mode using `model.train()`.
   - `CrossEntropyLoss` is used as the loss function, which is typical for classification tasks.
   - `AdamW` optimizer is used with a weight decay of `0.01` to prevent overfitting.
   - A learning rate scheduler (`OneCycleLR`) is set up to adjust the learning rate dynamically during training. The learning rate will warm up for the first 10% of the training steps, then decay.

2. **Training Loop**:
   - For each epoch:
     - The model processes batches from the `dataloader`.
     - Gradients are cleared, and predictions are made by the model.
     - The loss is calculated using the `CrossEntropyLoss`.
     - Gradients are backpropagated and updated using the optimizer.
     - Gradient clipping is applied to avoid exploding gradients by limiting the norm of gradients to a maximum value of `1.0`.
     - The scheduler is updated to adjust the learning rate based on the training progress.
     - Average loss for the epoch is calculated and printed.
     - The current learning rate is also printed for monitoring.
   
3. **Return**:
   - The function returns the trained model after all epochs are completed.


In [5]:
def train_model(model, dataloader, batch_size, num_epochs, learning_rate):
    model.train()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=0.01)

    # Calculate total steps
    total_steps = len(dataloader) * num_epochs

    # Create scheduler with warmup
    scheduler = OneCycleLR(optimizer, max_lr=learning_rate, total_steps=total_steps, pct_start=0.1)

    for epoch in range(num_epochs):
        epoch_loss = 0
        for batch_idx, (data, target) in enumerate(dataloader):
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()

            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

            optimizer.step()
            scheduler.step()

            epoch_loss += loss.item()

            # if batch_idx % 100 == 0:
                # print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}")

        avg_loss = epoch_loss / len(dataloader)
        print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss}")

        # Print current learning rate
        current_lr = optimizer.param_groups[0]['lr']
        print(f"Current Learning Rate: {current_lr}")

    return model

def evaluate_model(model, dataloader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for data, target in dataloader:
            output = model(data)
            _, predicted = torch.max(output.data, 1)
            total += target.size(0)
            correct += (predicted == target).sum().item()
    accuracy = 100 * correct / total
    return accuracy

### Defining the Hyperparameters for the model

In [6]:
# Hyperparameters
vocab_size = 1000    # Example vocab size
d_model = 256        # Embedding dimension and transformer model dimension
num_heads = 8        # Number of attention heads
num_layers = 4       # Number of Transformer encoder layers
max_len = 50         # Maximum sequence length
dropout = 0.1        # Dropout rate
batch_size = 32      # Batch size
num_epochs = 10      # Number of training epochs
learning_rate = 1e-4 # Learning rate

### Dataset Loader

In [7]:
# Create Dataset and DataLoader
train_dataset = DummyTextDataset(num_samples=1000, seq_length=max_len, vocab_size=vocab_size)
test_dataset = DummyTextDataset(num_samples=200, seq_length=max_len, vocab_size=vocab_size)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

### Model Definition and Model Training Pipeline

In [8]:
model = TextClassificationModel(vocab_size, d_model, num_heads, num_layers, max_len, dropout)



In [9]:
trained_model = train_model(model, train_loader, batch_size, num_epochs, learning_rate)

# Evaluate the Model
accuracy = evaluate_model(trained_model, test_loader)
print(f"Test Accuracy: {accuracy:.2f}%")

Epoch 1/10, Average Loss: 0.7417266722768545
Current Learning Rate: 9.99970252619065e-05
Epoch 2/10, Average Loss: 0.673397159203887
Current Learning Rate: 9.679530915668094e-05
Epoch 3/10, Average Loss: 0.607156009413302
Current Learning Rate: 8.794941226490184e-05
Epoch 4/10, Average Loss: 0.5650016283616424
Current Learning Rate: 7.452628030325177e-05
Epoch 5/10, Average Loss: 0.5031066173687577
Current Learning Rate: 5.814494109063477e-05
Epoch 6/10, Average Loss: 0.4424118036404252
Current Learning Rate: 4.0781225898910744e-05
Epoch 7/10, Average Loss: 0.4077887600287795
Current Learning Rate: 2.452945504134528e-05
Epoch 8/10, Average Loss: 0.36373831517994404
Current Learning Rate: 1.1349831933953822e-05
Epoch 9/10, Average Loss: 0.3531329296529293
Current Learning Rate: 2.8320136340088967e-06
Epoch 10/10, Average Loss: 0.3503647171892226
Current Learning Rate: 3.374738093512601e-09
Test Accuracy: 49.00%
