# Vision Transformer (ViT) for Image Classification [5 points]
Use a Vision Transformer to solve the Cats and Dogs Dataset. You can use pre-defined ViT model or implement from scratch.
Deploy the model and record a short video (~5 mins) on how it works.

## Steps:

1. Load and preprocess the dataset. This may include resizing images, normalizing pixel values, and splitting the dataset into training, validation, and testing sets.

In [7]:
import os
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms
import torch
import torch.nn as nn
import timm
import torch.optim as optim
from tqdm import tqdm

In [3]:
!unzip PetImages.zip -d PetImages/

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: PetImages/PetImages/Dog/7272.jpg  
  inflating: PetImages/__MACOSX/PetImages/Dog/._7272.jpg  
  inflating: PetImages/PetImages/Dog/8141.jpg  
  inflating: PetImages/__MACOSX/PetImages/Dog/._8141.jpg  
  inflating: PetImages/PetImages/Dog/1603.jpg  
  inflating: PetImages/__MACOSX/PetImages/Dog/._1603.jpg  
  inflating: PetImages/PetImages/Dog/397.jpg  
  inflating: PetImages/__MACOSX/PetImages/Dog/._397.jpg  
  inflating: PetImages/PetImages/Dog/5465.jpg  
  inflating: PetImages/__MACOSX/PetImages/Dog/._5465.jpg  
  inflating: PetImages/PetImages/Dog/3014.jpg  
  inflating: PetImages/__MACOSX/PetImages/Dog/._3014.jpg  
  inflating: PetImages/PetImages/Dog/11553.jpg  
  inflating: PetImages/__MACOSX/PetImages/Dog/._11553.jpg  
  inflating: PetImages/PetImages/Dog/10895.jpg  
  inflating: PetImages/__MACOSX/PetImages/Dog/._10895.jpg  
  inflating: PetImages/PetImages/Dog/5471.jpg  
  inflating: PetImages/__MACO

In [4]:
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

In [10]:
dataDir = 'PetImages/PetImages/'

In [18]:
dataset = datasets.ImageFolder(root=dataDir, transform=transform)

In [19]:
dataset_size = len(dataset)
train_size = int(0.7 * dataset_size)
val_size = int(0.15 * dataset_size)
test_size = dataset_size - train_size - val_size

In [20]:
train_dataset, val_dataset, test_dataset = random_split(dataset, [train_size, val_size, test_size])

batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=4)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=4)

print(f"Total samples: {dataset_size}")
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Testing samples: {len(test_dataset)}")

Total samples: 24998
Training samples: 17498
Validation samples: 3749
Testing samples: 3751


In [17]:
import os
from PIL import Image, UnidentifiedImageError

def verify_images(root_dir):
    removed_files = []
    for subdir, _, files in os.walk(root_dir):
        for file in files:
            file_path = os.path.join(subdir, file)
            try:
                with Image.open(file_path) as img:
                    img.verify()
            except (UnidentifiedImageError, OSError) as e:
                print(f"Removing corrupted image: {file_path}")
                os.remove(file_path)
                removed_files.append(file_path)
    print(f"Total corrupted images removed: {len(removed_files)}")

data_dir = 'PetImages/PetImages'
verify_images(data_dir)

Removing corrupted image: PetImages/PetImages/.DS_Store
Removing corrupted image: PetImages/PetImages/Dog/11702.jpg




Removing corrupted image: PetImages/PetImages/Dog/Thumbs.db
Removing corrupted image: PetImages/PetImages/Cat/666.jpg
Removing corrupted image: PetImages/PetImages/Cat/Thumbs.db
Total corrupted images removed: 5


2. Choose to use a pre-defined ViT model or implement it from scratch. You can use an in-built predefined models for this part.

In [14]:
num_classes = 2

model = timm.create_model('vit_base_patch16_224', pretrained=True)

print("Original classification head:", model.head)

in_features = model.head.in_features
model.head = nn.Linear(in_features, num_classes)
print("Modified classification head:", model.head)

for param in model.parameters():
    param.requires_grad = False
for param in model.head.parameters():
    param.requires_grad = True

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]

Original classification head: Linear(in_features=768, out_features=1000, bias=True)
Modified classification head: Linear(in_features=768, out_features=2, bias=True)


VisionTransformer(
  (patch_embed): PatchEmbed(
    (proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
    (norm): Identity()
  )
  (pos_drop): Dropout(p=0.0, inplace=False)
  (patch_drop): Identity()
  (norm_pre): Identity()
  (blocks): Sequential(
    (0): Block(
      (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
      (attn): Attention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (q_norm): Identity()
        (k_norm): Identity()
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj): Linear(in_features=768, out_features=768, bias=True)
        (proj_drop): Dropout(p=0.0, inplace=False)
      )
      (ls1): Identity()
      (drop_path1): Identity()
      (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
      (mlp): Mlp(
        (fc1): Linear(in_features=768, out_features=3072, bias=True)
        (act): GELU(approximate='none')
        (drop1): Dropout(p=0.0, inplace=False)
        (norm): Identity(

3. Train and evaluate your ViT model. Discuss your results.

In [22]:
criterion = nn.CrossEntropyLoss()

optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4)

num_epochs = 4
train_losses = []
val_losses = []
val_accuracies = []

for epoch in range(num_epochs):
    # Training
    model.train()
    running_loss = 0.0
    for images, labels in tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs} - Training", leave=False):
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item() * images.size(0)

    epoch_loss = running_loss / len(train_loader.dataset)
    train_losses.append(epoch_loss)
    print(f"Epoch [{epoch+1}/{num_epochs}] Training Loss: {epoch_loss:.4f}")

    # Validation
    model.eval()
    val_running_loss = 0.0
    correct_preds = 0
    total_samples = 0
    with torch.no_grad():
        for images, labels in val_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            val_running_loss += loss.item() * images.size(0)

            _, predicted = torch.max(outputs, 1)
            total_samples += labels.size(0)
            correct_preds += (predicted == labels).sum().item()

    epoch_val_loss = val_running_loss / len(val_loader.dataset)
    epoch_val_acc = correct_preds / total_samples
    val_losses.append(epoch_val_loss)
    val_accuracies.append(epoch_val_acc)

    print(f"Epoch [{epoch+1}/{num_epochs}] Validation Loss: {epoch_val_loss:.4f}, Validation Accuracy: {epoch_val_acc:.4f}")

                                                                       

Epoch [1/4] Training Loss: 0.0152




Epoch [1/4] Validation Loss: 0.0233, Validation Accuracy: 0.9933


                                                                       

Epoch [2/4] Training Loss: 0.0133




Epoch [2/4] Validation Loss: 0.0236, Validation Accuracy: 0.9936


                                                                       

Epoch [3/4] Training Loss: 0.0120




Epoch [3/4] Validation Loss: 0.0240, Validation Accuracy: 0.9936


                                                                       

Epoch [4/4] Training Loss: 0.0109




Epoch [4/4] Validation Loss: 0.0245, Validation Accuracy: 0.9931


In [23]:

model.eval()

test_running_loss = 0.0
correct_test = 0
total_test = 0

with torch.no_grad():
    for images, labels in test_loader:

        images, labels = images.to(device), labels.to(device)

        outputs = model(images)

        loss = criterion(outputs, labels)
        test_running_loss += loss.item() * images.size(0)

        _, predicted = torch.max(outputs, 1)
        total_test += labels.size(0)
        correct_test += (predicted == labels).sum().item()

test_loss = test_running_loss / len(test_loader.dataset)
test_accuracy = correct_test / total_test

print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.4f}")

Test Loss: 0.0209, Test Accuracy: 0.9931


In [30]:
modelSavePath = "vit_catdog_model.pth"
torch.save(model.state_dict(), modelSavePath)

print(f"Model weights saved to {modelSavePath}")

Model weights saved to vit_catdog_model.pth


The model demonstrates very strong performance across training, validation, and test sets. Over the course of four epochs, the training loss steadily decreased from 0.0152 to 0.0109, which indicates that the model is effectively learning. The validation loss remains consistently low, ranging from 0.0233 to 0.0245, while the validation accuracy hovered around 99.33% to 99.36%. These consistent metrics shows us that the model is generalizing well and not overfitting, as the gap between training and validation performance is low. The test set further confirms the stength of the model with a loss of 0.0209 and an accuracy of 99.31%, showing us that the model performs reliably on unseen data. Overall, the results indicate that the fine tuned Vision Transformer is highly effective for the binary classification task of distinguishing between cats and dogs.

4. Deploy your trained ViT model. This could be a simple script or application that takes an image as input and predicts whether it's a cat or a dog.

In [4]:
!pip install gradio

Collecting gradio
  Downloading gradio-5.25.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<25.0,>=22.0 (from gradio)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.8.0 (from gradio)
  Downloading gradio_client-1.8.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11.5-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting safehttpx<0.2.0,>=0.1.6 (

In [5]:
import gradio as gr

In [8]:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

num_classes = 2
model = timm.create_model('vit_base_patch16_224', pretrained=False)
in_features = model.head.in_features
model.head = nn.Linear(in_features, num_classes)

model_save_path = "vit_catdog_model.pth"
model.load_state_dict(torch.load(model_save_path, map_location=device))
model.to(device)

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

def predict(image):
    image = image.convert("RGB")

    input_tensor = transform(image).unsqueeze(0)
    input_tensor = input_tensor.to(device)

    model.eval()
    with torch.no_grad():
        outputs = model(input_tensor)
        _, predicted = torch.max(outputs, 1)

    class_names = ['Cat', 'Dog']
    prediction = class_names[predicted.item()]

    return prediction

iface = gr.Interface(
    fn=predict,
    inputs=gr.Image(type="pil"),
    outputs="text",
    title="Cat vs Dog Classifier",
    description="Upload an image and the model will predict whether it's a cat or a dog."
)

iface.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://b5491cf41a53dc8925.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




5. Record a short video (~5 mins) demonstrating how your deployed ViT model works. The video should showcase the model taking image inputs and providing predictions. Explain the key aspects of your implementation and deployment process in the video.
   a. Upload the video to UBbox and create a shared link
   b. Add the link at the end of your ipynb file.

**Shared UBbox Video Link: https://buffalo.box.com/s/3ay8aflm5pvzattgct0r81x0mhds19is

<span style='color:green'>### YOUR ANSWER ###</span>

6. References. Include details on all the resources used to complete this part.

Hugging face vit trasnformer - https://huggingface.co/docs/transformers/en/model_doc/vit

Gradio -User Interface - https://www.gradio.app

