## **Image Transformers with PyTorch**

This notebook trains a Transformer-based (ViT) image classification model on the CIFAR-10 dataset using PyTorch. The workflow includes single thread data loading, model definition, training and evaluation loops, and performance tracking across epochs.

The ViT processes image patches as token sequences and is optimized using supervised cross-entropy loss. Training metrics such as loss and accuracy are logged to monitor convergence and generalization.

The notebook is designed to be reproducible and configurable, allowing easy adjustment of hyperparameters such as learning rate, batch size, model depth, and attention heads.

### **Load Minimal Libraries**

Import the core PyTorch modules required for model definition and training. It also includes *tqdm* for progress bar visualization during training and evaluation loops.
Well use torchvision utilities for accessing the CIFAR-10 dataset.

In [None]:
import torch
import random

import torch.nn as nn
import torch.nn.functional as F

from tqdm import tqdm
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

### **Reproducibility**

This  cell sets fixed random seeds to ensure experiment reproducibility. It initializes the seed for Python’s random module and for PyTorch. Additionally, it configures cuDNN to use deterministic algorithms, reducing sources of nondeterminism in training results across runs.

In [None]:

# Random Seeds
seed = 42
random.seed(seed)

torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

### **Hyper Parameters**

Define the main optimization and model hyperparameters, the learning rate, batch size, and number of training epochs, along with Vision Transformer–specific settings such as patch size, embedding dimension, number of attention heads, and number of Transformer layers. The number of target classes is set to match the CIFAR-10 dataset.

These parameters control model capacity, optimization behavior, and overall training configuration.

In [None]:
# Training and model hyperparameters
LEARNING_RATE = 3e-4
BATCH_SIZE = 512
EPOCHS = 10

PATCH_SIZE = 4

HEADS = 8
LAYERS = 6
EMBED_DIM = 256

CIFAR_CLASSES = 10

### **Data Preparation**

This code cell prepares the CIFAR-10 dataset for training and evaluation. It defines a basic preprocessing pipeline that converts images to tensors and normalizes them to a fixed range. The training and test splits of CIFAR-10 are downloaded and loaded using torchvision.datasets, and PyTorch DataLoaders are created to handle batching and shuffling for efficient iteration during training and evaluation

In [None]:
# Data preparation
transform = transforms.Compose([
      transforms.ToTensor(),
      transforms.Normalize(
            mean=(0.4914, 0.4822, 0.4465),
            std=(0.2470, 0.2435, 0.2616)
      )
])

# Load CIFAR10 from torchvision for simplicity
train_data = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
test_data = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)

train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, num_workers=0)
test_loader = DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=False, num_workers=0)


### **Network Definition**

Define a Vision Transformer (ViT) model for image classification. Images are split into non-overlapping patches using a convolutional layer with stride equal to the patch size. A learnable class token and positional embeddings are also defined.

The patch sequence is processed by a stack of Transformer encoder layers with multi-head self-attention. The final classification is performed by applying layer normalization and a linear classification head.

In [None]:
# Vision Transformer Model
class VisionTransformer(nn.Module):
    def __init__(self,
                 num_classes,
                 embed_dim,
                 num_heads,
                 num_layers,
                 img_size=32,
                 patch_size=4):
        super().__init__()
        self.patch_size = patch_size
        num_patches = (img_size // patch_size) ** 2

        # Patch embedding
        self.patch_embedding = nn.Conv2d(3,
                                         embed_dim,
                                         kernel_size=patch_size,
                                         stride=patch_size)
        self.learnable_pos_embed = nn.Parameter(torch.randn(1, num_patches + 1, embed_dim))
        self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim))

        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(d_model=embed_dim,
                                                   nhead=num_heads,
                                                   dim_feedforward=embed_dim*4,
                                                   activation='gelu',
                                                   dropout=0.1,
                                                   batch_first=True)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)

        # Classification head
        self.norm = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, num_classes)

    def forward(self, x):
        B = x.shape[0]

        # Patch embedding
        x = self.patch_embedding(x)  # B, embed_dim, H/p, W/p
        x = x.flatten(2).transpose(1, 2)  # B, num_patches, embed_dim (1D sequence)

        # Add cls token
        cls_tokens = self.cls_token.expand(B, -1, -1) # expand singleton dimension
        x = torch.cat([cls_tokens, x], dim=1)

        # Add positional embedding
        x = x + self.learnable_pos_embed

        # Run Transformer layers
        x = self.transformer(x)

        # Classification
        x = self.norm(x[:, 0]) # normalzie CLS token
        x = self.head(x)
        return x

### **Initialize the Network**

Initialize the ViT model with the specified hyperparameters and number of output classes. Also detect whether a GPU is available and move the model to fastest device.


In [None]:
# Initialize model
model = VisionTransformer(num_classes=CIFAR_CLASSES,
                          embed_dim=EMBED_DIM,
                          num_heads=HEADS,
                          num_layers=LAYERS,
                          patch_size=PATCH_SIZE)
# push model to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)


### **Actual Training**

Finally set up the optimization components, along with the training and evaluation loops. First instantiate the cross-entropy loss, and the optimizer. Also define a cosine annealing for the learning rate scheduler.

The *train_epoch* function performs one training epoch with gradient updates per batch. The *test* function evaluates the model computing classification accuracy without gradient updates.

In [None]:

# Optimization setup
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE, weight_decay=0.05)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS)

# Training loop
def train_epoch(model,
                loader,
                criterion,
                optimizer):
    model.train()
    total_loss, correct, total = 0.0, 0, 0

    pbar = tqdm(loader, desc="Train", leave=False)

    for imgs, labels in pbar:
        imgs, labels = imgs.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(imgs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        _, predicted = outputs.max(1)
        correct += predicted.eq(labels).sum().item()
        total += labels.size(0)

        pbar.set_postfix(
            loss=f"{total_loss / (pbar.n + 1):.4f}",
            acc=f"{100. * correct / total:.2f}%"
        )

    return total_loss / len(loader), 100. * correct / total


def test(model, loader):
    model.eval()
    correct, total = 0, 0

    pbar = tqdm(loader, desc="Val", leave=False)
    with torch.no_grad():
        for imgs, labels in pbar:
            imgs, labels = imgs.to(device), labels.to(device)
            outputs = model(imgs)
            _, predicted = outputs.max(1)
            correct += predicted.eq(labels).sum().item()
            total += labels.size(0)

            pbar.set_postfix(
                acc=f"{100. * correct / total:.2f}%"
            )

    return 100. * correct / total

# Train the model
print(f"Training on {device}")
for epoch in range(EPOCHS):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer)
    test_acc = test(model, test_loader)
    scheduler.step()

    print(f'Epoch {epoch+1}/{EPOCHS} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc:.2f}% | Test Acc: {test_acc:.2f}%')

print("Training complete!")

Training on cuda




Epoch 1/10 | Train Loss: 1.838 | Train Acc: 32.31% | Test Acc: 44.16%




Epoch 2/10 | Train Loss: 1.450 | Train Acc: 47.64% | Test Acc: 50.48%




Epoch 3/10 | Train Loss: 1.280 | Train Acc: 53.81% | Test Acc: 54.25%




Epoch 4/10 | Train Loss: 1.172 | Train Acc: 57.96% | Test Acc: 57.29%




Epoch 5/10 | Train Loss: 1.078 | Train Acc: 61.45% | Test Acc: 59.18%




Epoch 6/10 | Train Loss: 1.000 | Train Acc: 64.22% | Test Acc: 60.89%




KeyboardInterrupt: 