## Project: ViT vs CNN

The objective of this project is to compare the performances of two different model in the task of image classification: Vision Transformers (ViT) an Convolutional Neural Network (CNN). In this project, we will use three different labelled datasets for comparision: CIFAR-10 (10 classes), CIFAR-100 (100 classes) and Imagenet-200 (200 classes).

This project use information from these sources:

_ An Image is Worth 16x16 Words Transformers for Image Recognition at Scale, Vision Transformer, ViT, by Google Research, Brain Team 2021 ICLR. https://arxiv.org/abs/2010.11929

_ https://www.geeksforgeeks.org/deep-learning/vision-transformer-vit-architecture/

_ https://sh-tsang.medium.com/review-vision-transformer-vit-406568603de0

##### **1. Prepare the data**

In [1]:
# Import necessary library
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import time
import numpy as np
import os
import kagglehub
import shutil
from tqdm import tqdm


In [2]:
def prepare_writable_data(read_only_dir):
    """Copies the dataset from the read-only input path to the writable working path."""

    if read_only_dir is None or not os.path.exists(read_only_dir):
        return None

    # Define the destination path in the writable working directory
    writable_dir = os.path.join('/kaggle/working/', 'tiny-imagenet-writable')

    if os.path.exists(writable_dir):
        print("Writable directory already exists. Skipping copy.")
        return writable_dir

    print(f"Copying data from {read_only_dir} to {writable_dir}...")

    try:
        # Use copytree to copy the entire directory structure
        shutil.copytree(read_only_dir, writable_dir)
        print("Data copied successfully.")
        return writable_dir
    except Exception as e:
        print(f"Error during data copying: {e}")
        return None

In [3]:
def sort_tiny_imagenet_validation(data_dir):
    """
    Sorts the validation images of Tiny ImageNet into class-specific folders.
    This is necessary because the raw dataset dumps all validation images
    into a single folder, while ImageFolder expects class subdirectories.
    """

    val_dir = os.path.join(data_dir, 'val')
    val_images_dir = os.path.join(val_dir, 'images')
    annotations_file = os.path.join(val_dir, 'val_annotations.txt')

    print("Sorting Tiny ImageNet validation set...")

    # 1. Read annotations and create class directories
    with open(annotations_file, 'r') as f:
        annotations = f.readlines()

    # 2. Process each annotation line
    for line in tqdm(annotations, desc="Processing validation images"):
        parts = line.strip().split('\t')
        if len(parts) < 2:
            continue

        filename = parts[0]  # e.g., 'val_0.JPEG'
        synset_id = parts[1] # e.g., 'n01440764'

        # Define source and destination paths
        src_path = os.path.join(val_images_dir, filename)
        dst_dir = os.path.join(val_dir, synset_id)
        dst_path = os.path.join(dst_dir, filename)

        # Create the destination class directory if it doesn't exist
        os.makedirs(dst_dir, exist_ok=True)

        # Move the image
        if os.path.exists(src_path):
            shutil.move(src_path, dst_path)

    # 3. Clean up the original 'images' folder and annotations file
    if os.path.exists(val_images_dir):
        try:
            os.rmdir(val_images_dir)
        except OSError:
            # Directory might not be empty if some files failed to move
            print("Warning: Could not remove original 'val/images' directory.")

    print("Validation set sorting complete.")

In [4]:
print("Attempting to download Tiny ImageNet via KaggleHub...")

# This returns the local path where the dataset files are stored
TINY_IMAGENET_PATH = kagglehub.dataset_download("nikhilshingadiya/tinyimagenet200")
TINY_IMAGENET_ROOT = None
WRITABLE_DATA_ROOT = None
print(f"Tiny ImageNet downloaded successfully to: {TINY_IMAGENET_PATH}")

TINY_IMAGENET_ROOT = os.path.join(TINY_IMAGENET_PATH, 'tiny-imagenet-200')
if not os.path.isdir(TINY_IMAGENET_ROOT):
    # If the structure is flat, the path itself might be the root
    TINY_IMAGENET_ROOT = TINY_IMAGENET_PATH

Attempting to download Tiny ImageNet via KaggleHub...
Using Colab cache for faster access to the 'tinyimagenet200' dataset.
Tiny ImageNet downloaded successfully to: /kaggle/input/tinyimagenet200


#### **2. Model architecture:**







##### **2.1 Vision Transformer (ViT) Architecture Overview**

![Vision Transformer Architecture](ViTarchi.png)

##### **2.1 Vision Transformer (ViT) Architecture Overview**

![Vision Transformer Architecture](ViTarchi.png)

The Vision Transformer (ViT) adapts the Transformer architecture from natural language processing to computer vision by representing images as sequences of visual tokens, similar to words in a sentence. Its design consists of several key components, each contributing to effective image representation and classification:


**1. Image Patching and embedding:**

In this stage, the ViT will convert a 2D image into a sequence of fixed-size, non-overlapping patches, with the same idea of the tokens in NLP (Natural language processing). Each patch of size $P \times P \times C$   is flattened into a one-dimensional vector of length
$P^2 \times C$ . These vectors are then projected into a shared D-dimensional embedding space using a learnable linear transformation, enabling the model to extract high-level visual features.

**2. Positional Encoding**

Since Transformers are inherently permutation invariant, positional information must be explicitly provided. ViT adds learnable positional embeddings to the patch embeddings to encode the spatial arrangement of patches within the image. These embeddings allow the model to understand relative and absolute patch positions and adapt more flexibly to different image resolutions compared to fixed positional encodings.

**3. Adding the Classification Token (CLS Token)**

A learnable classification token (CLS) is prepended to the sequence of patch embeddings. This token aggregates information from all patches through self-attention and serves as a global image representation. Unlike CNNs, which rely on pooling operations, ViT uses the final representation of the CLS token directly for classification.

**4. Transformer Encoder (Pre-LayerNorm Architecture)**

![Pre-LayerNorm](prelayer_norm.png)

The core of ViT consists of stacked Transformer encoder blocks following a pre-layer normalization (Pre-LN) design. In this setup, LayerNorm is applied before both the multi-head self-attention and feed-forward sublayers, improving gradient stability and enabling deeper architectures.

$\mathrm{LayerNorm}(x) = \frac{x - \mu}{\sigma} \odot \gamma + \beta$
where
*  $\mu, \sigma $ are mean and std across features

* $\gamma, \beta $ are learnable


Each encoder block includes multi-head self-attention, a feed-forward network, residual connections, and layer normalization.


**5. Multi-Head Self -Attention (MSA)**

Self-attention allows each patch to interact with every other patch, capturing long-range dependencies across the image. Queries, keys, and values are computed through linear projections of the input embeddings, and attention scores are obtained using scaled dot-product attention.

By using multiple attention heads, the model can focus on different aspects of the image simultaneously—such as textures, edges, colors, or global structure—leading to richer feature representations. The outputs of all heads are concatenated and linearly projected to form the final attention output.

**6. Feed-Forward Network (FFN)**

Following the attention block, each token is processed independently by a feed-forward network consisting of two fully connected layers with a GELU activation in between. This network expands the embedding dimension and then projects it back, enabling non-linear transformations that enhance representational capacity while sharing weights across tokens.

$\mathrm{FFN}(x) = W_2\, \mathrm{GELU}(W_1 x + b_1) + b_2$

**7. Residual Conections and Layer Normalization**

Residual (skip) connections are used throughout the encoder to preserve information from earlier layers and prevent performance degradation in deep networks. Layer normalization stabilizes training by normalizing feature distributions, while the pre-LN design ensures well-conditioned gradients and consistent scaling across layers.

**8. Classification Head (MLP Head)**

The final representation of the CLS token is passed through a small multi-layer perceptron (MLP) head to produce class logits. A softmax function converts these logits into class probabilities, enabling multi-class classification and training with cross-entropy loss.

**9. Training Vision Transformers**

Unlike CNNs, ViTs have minimal inductive bias, as they do not inherently encode locality or translation invariance. As a result, they typically require large-scale datasets and strong data augmentation to generalize well. To address this, ViTs are often pretrained on large datasets—using supervised or self-supervised methods—before being fine-tuned on downstream tasks with fewer labeled samples. Fine-tuning commonly employs techniques such as layer-wise learning rate decay to improve performance and stability.









In [5]:

# Patch Embedding (Tokenization):
# This modules is to convert the original 2D images into
# 1D sequence of embeded vectors.

class PatchEmbedding(nn.Module):
    def __init__(self, img_size, patch_size, in_channels, embed_dim):
        super().__init__()
        self.patch_size = patch_size
        self.num_patches = (img_size // patch_size) ** 2
        # Conv2d implements the non-overlapping patch embedding/linear projection
        self.proj = nn.Conv2d(in_channels, embed_dim, kernel_size=patch_size, stride=patch_size)

    def forward(self, x):
        # x: (B, C, H, W) -> (B, D, H', W')
        x = self.proj(x)
        # Flatten H' x W' into sequence length N, transpose to (B, N, D)
        x = x.flatten(2).transpose(1, 2)
        return x


In [6]:
# --- Vision Transformer (ViT) Model ---

class VisionTransformer(nn.Module):
    def __init__(self, img_size=32, patch_size=4, in_channels=3, num_classes=10,
                 embed_dim=128, depth=6, num_heads=8, mlp_ratio=4.0):
        super().__init__()

        self.patch_embed = PatchEmbedding(img_size, patch_size, in_channels, embed_dim)

        num_patches = self.patch_embed.num_patches
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
        self.pos_drop = nn.Dropout(p=0.1)

        # Define a single standard Transformer Encoder Layer
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embed_dim,
            nhead=num_heads,
            dim_feedforward=int(embed_dim * mlp_ratio),
            dropout=0.1,
            batch_first=True,
            norm_first=True
        )

        # Stack the layers using nn.TransformerEncoder
        self.blocks = nn.TransformerEncoder(encoder_layer, num_layers=depth)

        self.head = nn.Linear(embed_dim, num_classes)

    def forward(self, x):
        B = x.shape[0]
        x = self.patch_embed(x)

        cls_tokens = self.cls_token.expand(B, -1, -1)
        x = torch.cat((cls_tokens, x), dim=1)

        x = self.pos_drop(x + self.pos_embed)

        # The standard PyTorch TransformerEncoder handles the sequence of blocks
        x = self.blocks(x)

        # Classification uses the CLS token output
        cls_output = x[:, 0]
        x = self.head(cls_output)
        return x



##### **2.2 CNN architecture**

In this project, we will build a CNN architecture inspired by ResNet, which is highly effective for image classification.

This model uses $3 \times 3$ convolutions and gradually increases the channel depth while reducing the spatial resolution via stride-2 convolutions in the main blocks.

###### **2.2.1 Structure of the Core Building Block (`CNNBlock`)**

The `CNNBlock` implements the Basic Block structure from ResNet, ensuring efficient training even with limited depth.

$$
\text{Output} = \text{ReLU}(\text{BN}(\text{Conv}_2(\text{ReLU}(\text{BN}(\text{Conv}_1(\text{Input})))) + \text{Shortcut}(\text{Input}))
$$

| Step | Operation | Output Size |
| :--- | :--- | :--- |
| **Input** | Feature Map | $H \times W \times C_{in}$ |
| **Conv 1** | Conv $3 \times 3$, Stride $S$, Padding 1 | $H/S \times W/S \times C_{out}$ |
| **Conv 2** | Conv $3 \times 3$, Stride 1, Padding 1 | $H/S \times W/S \times C_{out}$ |
| **Shortcut** | Conv $1 \times 1$ (if stride $\neq 1$ or $C_{in} \neq C_{out}$) | $H/S \times W/S \times C_{out}$ |
| **Output** | Output Feature Map (after addition and ReLU) | $H/S \times W/S \times C_{out}$ |

---

###### **2.2.2 Full Model Architecture (`CustomCNN`)**

Assuming an input image size of **$32 \times 32$** (for CIFAR), here is the layer breakdown:

| Layer Name | Module/Operation | Stride | Output Channels (Depth) | Output Size (H x W) | Notes |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **Input** | Image | - | 3 | $32 \times 32$ | |
| **Conv 1** | Conv $3 \times 3$ + BN + ReLU | 1 | 16 | $32 \times 32$ | Initial Feature Map |
| **Layer 1** | **2x** `CNNBlock` | 1 | 16 | $32 \times 32$ | No spatial reduction |
| **Layer 2** | **2x** `CNNBlock` | 2 (in first block) | 32 | $16 \times 16$ | Spatial Downsampling (32 $\to$ 16) |
| **Layer 3** | **2x** `CNNBlock` | 2 (in first block) | 64 | $8 \times 8$ | Spatial Downsampling (16 $\to$ 8) |
| **Avg Pool** | AdaptiveAvgPool2d | - | 64 | $1 \times 1$ | Global Pooling |
| **Linear** | Fully Connected | - | Num Classes | 1 | Final Classification |

In [7]:


class CNNBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        # 3x3 Conv, BN, ReLU
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        # 3x3 Conv, BN (Residual part)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        # Shortcut connection for ResNet-like structure
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        result = nn.ReLU()(self.bn1(self.conv1(x)))
        result = self.bn2(self.conv2(result))
        result += self.shortcut(x)
        result = nn.ReLU()(result)
        return result

In [8]:
# --- Custom CNN Model ---

class CustomCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.in_channels = 16

        self.conv1 = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(16),
            nn.ReLU()
        )

        # ResNet-like blocks (16 -> 32 -> 64 channels)
        self.layer1 = self._make_layer(16, 2, stride=1)
        self.layer2 = self._make_layer(32, 2, stride=2)
        self.layer3 = self._make_layer(64, 2, stride=2)

        self.avg_pool = nn.AdaptiveAvgPool2d((1, 1))
        self.linear = nn.Linear(64, num_classes)

    def _make_layer(self, out_channels, num_blocks, stride):
        strides = [stride] + [1] * (num_blocks - 1)
        layers = []
        for stride in strides:
            layers.append(CNNBlock(self.in_channels, out_channels, stride))
            self.in_channels = out_channels
        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv1(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.avg_pool(x)
        x = x.view(x.size(0), -1)
        x = self.linear(x)
        return x

#### 2 Training

To improve the accuracy and effectiveness of the ViT architecture, we will use some data augmentation (translation, rotation, flip) to load the training data.

In [9]:
def get_cifar_loaders(dataset_name, batch_size=64):
    """Downloads and prepares data loaders for CIFAR-10 or CIFAR-100 with augmentation."""
    if dataset_name == 'CIFAR-10':
        dataset_class = datasets.CIFAR10
        num_classes = 10
    elif dataset_name == 'CIFAR-100':
        dataset_class = datasets.CIFAR100
        num_classes = 100
    else:
        raise ValueError(f"Unknown dataset: {dataset_name}")

    #Augmentation for Training Data (32x32)
    train_transform = transforms.Compose([
        # Random Augmentations
        transforms.RandomHorizontalFlip(),
        transforms.RandomRotation(10), # Rotate up to 10 degrees
        transforms.RandomAffine(
            degrees=0,
            translate=(0.1, 0.1) # Translate up to 10% horizontally/vertically
        ),
        # Final steps
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    # Standard Transform for Test Data (No augmentation)
    test_transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    train_data = dataset_class(root='./data', train=True, download=True, transform=train_transform)
    test_data = dataset_class(root='./data', train=False, download=True, transform=test_transform)

    train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=False)

    print(f"Loaded {dataset_name}: {len(train_data)} training images, {len(test_data)} test images (with augmentation).")

    return train_loader, test_loader, num_classes

In [10]:
from PIL import Image, ImageFile

# Crucial setting for handling truncated images (common in ImageNet)
ImageFile.LOAD_TRUNCATED_IMAGES = True

def is_valid_image_file(path):
    """Checks if a file is a valid, non-corrupted image."""
    try:
        # 1. Check if file is empty
        if os.path.getsize(path) == 0:
            return False

        # 2. Attempt to open and verify the image header
        img = Image.open(path)
        img.verify() # Verify the file integrity
        return True
    except Exception:
        # If PIL throws any error (UnidentifiedImageError, IOError, etc.), it's invalid
        return False

In [11]:
def get_imagenet_200_loaders(data_dir, batch_size=64):
    """
    Loads ImageNet-200 (Tiny ImageNet) data with augmentation.
    """
    if data_dir is None:
        print("Data directory is invalid. Cannot load data.")
        return None, None, 200, None

    # Ensure validation data is structured correctly
    sort_tiny_imagenet_validation(data_dir)

    #Augmentation for Training Data (64x64)
    train_transform = transforms.Compose([
        # Initial resizing/cropping
        transforms.Resize(64),
        transforms.CenterCrop(64),
        # Random Augmentations
        transforms.RandomHorizontalFlip(),
        transforms.RandomRotation(15), # Slightly more rotation
        transforms.RandomAffine(
            degrees=0,
            translate=(0.15, 0.15) # Slightly more translation
        ),

        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])

    #Standard Transform for Test Data
    test_transform = transforms.Compose([
        transforms.Resize(64),
        transforms.CenterCrop(64),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])

    try:
        train_data = datasets.ImageFolder(
            root=os.path.join(data_dir, 'train'),
            transform=train_transform,
            is_valid_file=is_valid_image_file
        )
        test_data = datasets.ImageFolder(
            root=os.path.join(data_dir, 'val'),
            transform=test_transform,
            is_valid_file=is_valid_image_file
        )

    except Exception as e:
        print(f"ERROR loading ImageNet-200 structure: {e}")
        return None, None, 200, None

    train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, num_workers=4)
    test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=False, num_workers=4)

    num_classes = 200
    print(f"Loaded ImageNet-200: {len(train_data)} training images (with augmentation).")

    return train_loader, test_loader, num_classes

In [12]:

def train_model(model, train_loader, criterion, optimizer, device, epochs):
    """Trains the model and measures total training time."""
    model.train()
    start_time = time.time()

    for epoch in range(epochs):
        running_loss = 0.0
        # Use tqdm for progress bar if installed
        for i, (inputs, labels) in enumerate(train_loader): # wrap with tqdm(train_loader) if using tqdm
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        avg_loss = running_loss / len(train_loader)
        print(f"  Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")

    end_time = time.time()
    train_time = end_time - start_time
    return train_time


In [13]:

def test_model(model, test_loader, device, img_size):
    """Evaluates the model, calculating accuracy and average inference time."""
    model.eval()
    correct = 0
    total = 0
    start_time_total = time.time()

    with torch.no_grad():
        for inputs, labels in test_loader:
            inputs, labels = inputs.to(device), labels.to(device)

            # Forward pass (Inference)
            outputs = model(inputs)

            # Accuracy calculation
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    end_time_total = time.time()


    total_inference_time_s = end_time_total - start_time_total
    accuracy = 100 * correct / total

    # Calculate average time per image based on the total time and total images
    avg_inference_time_ms = (total_inference_time_s / total) * 1000

    print(f"Total images tested: {total}")
    print(f"Total inference time for dataset: {total_inference_time_s:.2f} seconds")

    return accuracy, avg_inference_time_ms

def count_parameters(model):
    """Returns the total number of trainable parameters in the model."""
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

#### **3. Experiments with different datasets**

In [14]:
# Initialize results list
FINAL_RESULTS = []

In [15]:
# Function to see the results
def print_summary_table(results):

    header = ['Model', 'Dataset', 'Accuracy (%)', 'Total Parameters (M)', 'Train Time (s)', 'Inference Time (ms/img)']
    print(f"{header[0]:<8} | {header[1]:<15} | {header[2]:<15} | {header[3]:<20} | {header[4]:<15} | {header[5]:<25}")
    print("-" * 100)

    for res in results:
        print(f"{res['Model']:<8} | {res['Dataset']:<15} | {res['Accuracy (%)']:<15} | {res['Total Parameters (M)']:<20} | {res['Train Time (s)']:<15} | {res['Inference Time (ms/img)']:<25}")
    print("="*100)

##### **3.1. CIFAR-10:**

First, we use the Cifar-10 dataset which divide the data into 10 classes.

In [16]:
DATASET_NAME = 'CIFAR-10'
NUM_EPOCHS = 10
BATCH_SIZE = 128
LEARNING_RATE = 1e-3
DEVICE = 'cuda'

# ViT Configuration for CIFAR-10 (32x32 input)
IMG_SIZE_C10 = 32
PATCH_SIZE = 4 # (32/4)^2 + 1 = 65 tokens

print(f"\n\n================ Starting Experiment: {DATASET_NAME} ================")

# 1. Load Data
train_loader, test_loader, num_classes = get_cifar_loaders(DATASET_NAME, BATCH_SIZE)

# 2. Instantiate Models
vit_model = VisionTransformer(
    img_size=IMG_SIZE_C10, patch_size=PATCH_SIZE, num_classes=num_classes,
    embed_dim=128, depth=6, num_heads=8
).to(DEVICE)

cnn_model = CustomCNN(num_classes=num_classes).to(DEVICE)

# --- ViT Run ---
print(f"\n--- Running ViT on {DATASET_NAME} ---")
optimizer_vit = optim.Adam(vit_model.parameters(), lr=LEARNING_RATE)
criterion = nn.CrossEntropyLoss()
train_time_vit = train_model(vit_model, train_loader, criterion, optimizer_vit, DEVICE, NUM_EPOCHS)
accuracy_vit, inference_time_vit = test_model(vit_model, test_loader, DEVICE,IMG_SIZE_C10)
params_vit = count_parameters(vit_model)

FINAL_RESULTS.append({
    'Model': 'ViT', 'Dataset': DATASET_NAME,
    'Accuracy (%)': f"{accuracy_vit:.2f}",
    'Train Time (s)': f"{train_time_vit:.1f}",
    'Inference Time (ms/img)': f"{inference_time_vit:.3f}",
    'Total Parameters (M)': f"{params_vit/1e6:.2f}"
})

# --- CNN Run ---
print(f"\n--- Running CNN on {DATASET_NAME} ---")
optimizer_cnn = optim.Adam(cnn_model.parameters(), lr=LEARNING_RATE)
train_time_cnn = train_model(cnn_model, train_loader, criterion, optimizer_cnn, DEVICE, NUM_EPOCHS)
accuracy_cnn, inference_time_cnn = test_model(cnn_model, test_loader, DEVICE, IMG_SIZE_C10)
params_cnn = count_parameters(cnn_model)

FINAL_RESULTS.append({
    'Model': 'CNN', 'Dataset': DATASET_NAME,
    'Accuracy (%)': f"{accuracy_cnn:.2f}",
    'Train Time (s)': f"{train_time_cnn:.1f}",
    'Inference Time (ms/img)': f"{inference_time_cnn:.3f}",
    'Total Parameters (M)': f"{params_cnn/1e6:.2f}"
})

print_summary_table(FINAL_RESULTS)





100%|██████████| 170M/170M [00:13<00:00, 12.8MB/s]


Loaded CIFAR-10: 50000 training images, 10000 test images (with augmentation).





--- Running ViT on CIFAR-10 ---
  Epoch 1/10, Loss: 1.8017
  Epoch 2/10, Loss: 1.4386
  Epoch 3/10, Loss: 1.3356
  Epoch 4/10, Loss: 1.2878
  Epoch 5/10, Loss: 1.2369
  Epoch 6/10, Loss: 1.2012
  Epoch 7/10, Loss: 1.1696
  Epoch 8/10, Loss: 1.1468
  Epoch 9/10, Loss: 1.1223
  Epoch 10/10, Loss: 1.0961
Total images tested: 10000
Total inference time for dataset: 3.47 seconds

--- Running CNN on CIFAR-10 ---
  Epoch 1/10, Loss: 1.4589
  Epoch 2/10, Loss: 1.0747
  Epoch 3/10, Loss: 0.9268
  Epoch 4/10, Loss: 0.8212
  Epoch 5/10, Loss: 0.7495
  Epoch 6/10, Loss: 0.6981
  Epoch 7/10, Loss: 0.6583
  Epoch 8/10, Loss: 0.6230
  Epoch 9/10, Loss: 0.5884
  Epoch 10/10, Loss: 0.5689
Total images tested: 10000
Total inference time for dataset: 2.86 seconds
Model    | Dataset         | Accuracy (%)    | Total Parameters (M) | Train Time (s)  | Inference Time (ms/img)  
----------------------------------------------------------------------------------------------------
ViT      | CIFAR-10        | 

We can observe clearly that for the Cifar-10 dataset, the CNN is better than ViT in both accuracy, training time and inference time.

In [17]:
print(vit_model)

VisionTransformer(
  (patch_embed): PatchEmbedding(
    (proj): Conv2d(3, 128, kernel_size=(4, 4), stride=(4, 4))
  )
  (pos_drop): Dropout(p=0.1, inplace=False)
  (blocks): TransformerEncoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
        )
        (linear1): Linear(in_features=128, out_features=512, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=512, out_features=128, bias=True)
        (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (head): Linear(in_features=128, out_features=10, bias=True)
)


In [18]:
print(cnn_model)

CustomCNN(
  (conv1): Sequential(
    (0): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
  )
  (layer1): Sequential(
    (0): CNNBlock(
      (conv1): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (shortcut): Sequential()
    )
    (1): CNNBlock(
      (conv1): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNor

##### **3.2. CIFAR-100**

Next, we will test our architectures on the Cifar-100 datasets, which includes labelled data of 100 different classes. This dataset have much more classes than the Cifar-10 but it has the same number of training images (50000) and test images (10000).

In [19]:
DATASET_NAME = 'CIFAR-100'
NUM_EPOCHS = 10
BATCH_SIZE = 128
LEARNING_RATE = 1e-3
DEVICE = 'cuda'

# ViT Configuration for CIFAR-10 (32x32 input)
IMG_SIZE_C100 = 32
PATCH_SIZE = 4 # (32/4)^2 + 1 = 65 tokens

print(f"\n\n================ Starting Experiment: {DATASET_NAME} ================")

# 1. Load Data
train_loader, test_loader, num_classes = get_cifar_loaders(DATASET_NAME, BATCH_SIZE)

# 2. Instantiate Models
vit_model = VisionTransformer(
    img_size=IMG_SIZE_C100, patch_size=PATCH_SIZE, num_classes=num_classes,
    embed_dim=128, depth=6, num_heads=8
).to(DEVICE)

cnn_model = CustomCNN(num_classes=num_classes).to(DEVICE)

# --- ViT Run ---
print(f"\n--- Running ViT on {DATASET_NAME} ---")
optimizer_vit = optim.Adam(vit_model.parameters(), lr=LEARNING_RATE)
criterion = nn.CrossEntropyLoss()
train_time_vit = train_model(vit_model, train_loader, criterion, optimizer_vit, DEVICE, NUM_EPOCHS)
accuracy_vit, inference_time_vit = test_model(vit_model, test_loader, DEVICE, IMG_SIZE_C100)
params_vit = count_parameters(vit_model)

FINAL_RESULTS.append({
    'Model': 'ViT', 'Dataset': DATASET_NAME,
    'Accuracy (%)': f"{accuracy_vit:.2f}",
    'Train Time (s)': f"{train_time_vit:.1f}",
    'Inference Time (ms/img)': f"{inference_time_vit:.3f}",
    'Total Parameters (M)': f"{params_vit/1e6:.2f}"
})

# --- CNN Run ---
print(f"\n--- Running CNN on {DATASET_NAME} ---")
optimizer_cnn = optim.Adam(cnn_model.parameters(), lr=LEARNING_RATE)
train_time_cnn = train_model(cnn_model, train_loader, criterion, optimizer_cnn, DEVICE, NUM_EPOCHS)
accuracy_cnn, inference_time_cnn = test_model(cnn_model, test_loader, DEVICE, IMG_SIZE_C100)
params_cnn = count_parameters(cnn_model)

FINAL_RESULTS.append({
    'Model': 'CNN', 'Dataset': DATASET_NAME,
    'Accuracy (%)': f"{accuracy_cnn:.2f}",
    'Train Time (s)': f"{train_time_cnn:.1f}",
    'Inference Time (ms/img)': f"{inference_time_cnn:.3f}",
    'Total Parameters (M)': f"{params_cnn/1e6:.2f}"
})
print_summary_table(FINAL_RESULTS)





100%|██████████| 169M/169M [00:19<00:00, 8.71MB/s]


Loaded CIFAR-100: 50000 training images, 10000 test images (with augmentation).

--- Running ViT on CIFAR-100 ---
  Epoch 1/10, Loss: 3.9135
  Epoch 2/10, Loss: 3.3593
  Epoch 3/10, Loss: 3.1276
  Epoch 4/10, Loss: 2.9644
  Epoch 5/10, Loss: 2.8447
  Epoch 6/10, Loss: 2.7469
  Epoch 7/10, Loss: 2.6607
  Epoch 8/10, Loss: 2.5911
  Epoch 9/10, Loss: 2.5183
  Epoch 10/10, Loss: 2.4560
Total images tested: 10000
Total inference time for dataset: 3.49 seconds

--- Running CNN on CIFAR-100 ---
  Epoch 1/10, Loss: 3.9000
  Epoch 2/10, Loss: 3.3384
  Epoch 3/10, Loss: 2.9782
  Epoch 4/10, Loss: 2.7417
  Epoch 5/10, Loss: 2.5488
  Epoch 6/10, Loss: 2.4081
  Epoch 7/10, Loss: 2.2899
  Epoch 8/10, Loss: 2.1946
  Epoch 9/10, Loss: 2.1052
  Epoch 10/10, Loss: 2.0448
Total images tested: 10000
Total inference time for dataset: 2.46 seconds
Model    | Dataset         | Accuracy (%)    | Total Parameters (M) | Train Time (s)  | Inference Time (ms/img)  
------------------------------------------------

For this dataset, the result is still the same with the Cifar-10. Although the accuracy is lower than the Cifar-10 dataset (because we have more classes and we just train on 10 epoches and the architecture are not complex enough), the CNN is still better in accuracy, training time and inference time.

##### **3.3. Image-net 200:**

Lastly, we use the Image-net 200 dataset. This is a small version of Image-net dataset, which has 200 classes. This dataset have 42905 training images and 10000 test images.

In [20]:
DATASET_NAME = 'ImageNet-200'
NUM_EPOCHS = 10
BATCH_SIZE = 64
LEARNING_RATE = 1e-3
DEVICE = 'cuda'

# ViT Configuration for ImageNet-200 (64x64 input)
IMG_SIZE_200 = 64
PATCH_SIZE = 8 # (64/8)^2 + 1 = 65 tokens

print(f"\n\n================ Starting Experiment: {DATASET_NAME} ================")

#  Download and Prepare Data
TINY_IMAGENET_ROOT = None
try:
    TINY_IMAGENET_PATH = kagglehub.dataset_download("akash2sharma/tiny-imagenet")
    TINY_IMAGENET_ROOT = os.path.join(TINY_IMAGENET_PATH, 'tiny-imagenet-200')
    if not os.path.isdir(TINY_IMAGENET_ROOT):
        TINY_IMAGENET_ROOT = TINY_IMAGENET_PATH
except Exception as e:
    print(f"KaggleHub download failed. Error: {e}")



Using Colab cache for faster access to the 'tiny-imagenet' dataset.


In [21]:


WRITABLE_DATA_ROOT = None
if TINY_IMAGENET_ROOT:
    WRITABLE_DATA_ROOT = prepare_writable_data(TINY_IMAGENET_ROOT)

if WRITABLE_DATA_ROOT is None:
    print("Skipping ImageNet-200 experiment due to data preparation failure.")
else:
    # 2. Load Data (Sorting happens inside this call)
    train_loader, test_loader, num_classes = get_imagenet_200_loaders(WRITABLE_DATA_ROOT, BATCH_SIZE)

    if train_loader is None:

        print("Data loading failed.")
    else:
        # 3. Instantiate Models
        vit_model = VisionTransformer(
            img_size=IMG_SIZE_200, patch_size=PATCH_SIZE, num_classes=num_classes,
            embed_dim=128, depth=6, num_heads=8
        ).to(DEVICE)

        cnn_model = CustomCNN(num_classes=num_classes).to(DEVICE)

        # --- ViT Run ---
        print(f"\n--- Running ViT on {DATASET_NAME} ---")
        optimizer_vit = optim.Adam(vit_model.parameters(), lr=LEARNING_RATE)
        criterion = nn.CrossEntropyLoss()
        train_time_vit = train_model(vit_model, train_loader, criterion, optimizer_vit, DEVICE, NUM_EPOCHS)
        accuracy_vit, inference_time_vit = test_model(vit_model, test_loader, DEVICE, IMG_SIZE_200)
        params_vit = count_parameters(vit_model)

        FINAL_RESULTS.append({
            'Model': 'ViT', 'Dataset': DATASET_NAME,
            'Accuracy (%)': f"{accuracy_vit:.2f}",
            'Train Time (s)': f"{train_time_vit:.1f}",
            'Inference Time (ms/img)': f"{inference_time_vit:.3f}",
            'Total Parameters (M)': f"{params_vit/1e6:.2f}"
        })

        # --- CNN Run ---
        print(f"\n--- Running CNN on {DATASET_NAME} ---")
        optimizer_cnn = optim.Adam(cnn_model.parameters(), lr=LEARNING_RATE)
        train_time_cnn = train_model(cnn_model, train_loader, criterion, optimizer_cnn, DEVICE, NUM_EPOCHS)
        accuracy_cnn, inference_time_cnn = test_model(cnn_model, test_loader, DEVICE, IMG_SIZE_200)
        params_cnn = count_parameters(cnn_model)

        FINAL_RESULTS.append({
            'Model': 'CNN', 'Dataset': DATASET_NAME,
            'Accuracy (%)': f"{accuracy_cnn:.2f}",
            'Train Time (s)': f"{train_time_cnn:.1f}",
            'Inference Time (ms/img)': f"{inference_time_cnn:.3f}",
            'Total Parameters (M)': f"{params_cnn/1e6:.2f}"
        })



Copying data from /kaggle/input/tiny-imagenet/tiny-imagenet-200 to /kaggle/working/tiny-imagenet-writable...
Data copied successfully.
Sorting Tiny ImageNet validation set...


Processing validation images: 100%|██████████| 10000/10000 [00:00<00:00, 29616.92it/s]


Validation set sorting complete.
Loaded ImageNet-200: 100000 training images (with augmentation).

--- Running ViT on ImageNet-200 ---




  Epoch 1/10, Loss: 4.6694
  Epoch 2/10, Loss: 4.0706
  Epoch 3/10, Loss: 3.7755
  Epoch 4/10, Loss: 3.5902
  Epoch 5/10, Loss: 3.4520
  Epoch 6/10, Loss: 3.3436
  Epoch 7/10, Loss: 3.2556
  Epoch 8/10, Loss: 3.1794
  Epoch 9/10, Loss: 3.1173
  Epoch 10/10, Loss: 3.0640
Total images tested: 10000
Total inference time for dataset: 5.55 seconds

--- Running CNN on ImageNet-200 ---
  Epoch 1/10, Loss: 4.6212
  Epoch 2/10, Loss: 4.0275
  Epoch 3/10, Loss: 3.7576
  Epoch 4/10, Loss: 3.5650
  Epoch 5/10, Loss: 3.4166
  Epoch 6/10, Loss: 3.3028
  Epoch 7/10, Loss: 3.2052
  Epoch 8/10, Loss: 3.1293
  Epoch 9/10, Loss: 3.0652
  Epoch 10/10, Loss: 3.0032
Total images tested: 10000
Total inference time for dataset: 4.92 seconds


In [22]:
print_summary_table(FINAL_RESULTS)

Model    | Dataset         | Accuracy (%)    | Total Parameters (M) | Train Time (s)  | Inference Time (ms/img)  
----------------------------------------------------------------------------------------------------
ViT      | CIFAR-10        | 62.92           | 1.21                 | 442.6           | 0.347                    
CNN      | CIFAR-10        | 80.04           | 0.18                 | 323.2           | 0.286                    
ViT      | CIFAR-100       | 36.77           | 1.22                 | 438.8           | 0.349                    
CNN      | CIFAR-100       | 42.87           | 0.18                 | 326.3           | 0.246                    
ViT      | ImageNet-200    | 31.15           | 1.25                 | 883.9           | 0.555                    
CNN      | ImageNet-200    | 29.85           | 0.19                 | 838.6           | 0.492                    


For ImageNet-200 dataset, we can clearly observe that the ViT outperforms the CNN in terms of accuracy, while the training time and inference time do not show any significant differences (less than a one-minute difference). This result is consistent with the theory that, for datasets with a large number of classes, ViTs tend to outperform CNNs.

### **Summary of CNNs and ViTs**

| Features | CNNs | ViTs |
| :--- | :--- | :--- |
| **Attention Scope** | Capture local features via convolutions | Capture global relationships via self-attention |
| **Inductive Bias** | Strong biases (locality, translation invariance) | Minimal biases, more flexible but data-hungry |
| **Data Requirement** | Work well with small datasets | Need large datasets for best performance |
| **Feature Learning** | Learn hierarchical features | Learn context-rich, long-range features |

### **Comparative Analysis and Conclusion**

#### **Performance on Small Datasets (e.g., CIFAR-10, CIFAR-100)**

For datasets characterized by limited data volume and a small number of classes, the Convolutional Neural Network (CNN) architecture demonstrates superior performance across multiple metrics compared to the Vision Transformer (ViT).

*   **Performance:** CNNs consistently achieve higher classification accuracy.
*   **Efficiency:** CNNs exhibit significantly shorter training times and lower inference latency.
*   **Cost:** The CNN architecture requires a substantially smaller number of trainable parameters, resulting in a lower memory footprint and computational cost.

This outcome is attributed to the strong **inductive biases** (locality and translation invariance) inherent in CNNs, which enable effective feature learning from limited data.

#### **Performance on Large-Scale Datasets (e.g., ImageNet-200)**

When transitioning to large datasets featuring high data volume and a large number of classes, the comparative advantage shifts:

*   **Accuracy:** ViT architectures generally achieve superior classification accuracy, demonstrating their ability to leverage extensive data to learn complex, global feature representations without relying on local biases.
*   **Efficiency:** While ViTs are often computationally slower than optimized CNNs, the difference in training and inference time becomes less pronounced or even slightly favors the ViT in certain highly optimized implementations, particularly when considering the superior accuracy achieved.
*   **Cost and Resource Allocation:** A major drawback of the ViT is its **high parameter count**. ViTs typically require a significantly larger number of parameters than comparable CNNs (e.g., ResNet variants) to achieve peak performance. This translates directly to higher memory requirements and greater computational expense during both training and deployment.

#### **Conclusion on Architectural Choice**

The choice between a CNN and a ViT should be dictated by the available resources and the project's primary objective:

1.  **Resource-Constrained Environments:** If computational budget, memory constraints, or training time are critical factors, the **CNN architecture (e.g., ResNet)** remains the preferred and most resource-efficient choice, often providing acceptable accuracy with minimal cost.
2.  **Accuracy-Driven Objectives:** If the primary goal is maximizing classification accuracy and resources are abundant, the **Vision Transformer** is the superior choice, as its global attention mechanism allows it to achieve state-of-the-art results by effectively modeling long-range dependencies in large datasets. The higher computational cost is accepted as a necessary trade-off for improved performance.
