# Convolutions-Vision-Transformers

Implementation of [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808)

![Pipeline](figures/pipeline.svg)

Dataset used: [Tom and Jerry Image classification | Kaggle](https://www.kaggle.com/datasets/balabaskar/tom-and-jerry-image-classification)



University of Rome, La Sapienza. Artificial Intelligence and Robotics. Neural Networks Course A.Y. 2022/23

Esteban Vincent | Aurélien Lurois

In [6]:
#!pip install -q -r requirements.txt 

In [7]:
#!pip install einops

In [8]:
#archive.zip is an archive containing the folders 'cvt' and 'dataset'
#!unzip archive

In [9]:
#@title Imports
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from cvt.cvt import CvT
import pandas as pd
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
from PIL import Image
from tqdm import tqdm

In [10]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = CvT(128, 3, 4).to(device)

df = pd.read_csv("dataset/ground_truth.csv")
df = df[["filename", "class"]]

train_df, test_df = train_test_split(df, test_size=0.2)

In [11]:
class FrameDataset(Dataset):
    def __init__(self, df):
        self.filenames = df["filename"].tolist()
        self.images = []
        self.labels = df["class"].tolist()

        for filename in tqdm(self.filenames):
            img = Image.open(f"dataset/imgs/{filename}")
            # Convert the image to a PyTorch tensor
            img_tensor = torch.from_numpy(
                np.array(img)).permute(2, 0, 1)

            # Normalize the tensor by dividing by 255
            img_tensor = img_tensor.float() / 255
            self.images.append(img_tensor)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return (self.images[idx], self.labels[idx])

In [17]:
train_loader = DataLoader(FrameDataset(train_df), batch_size=32, shuffle=True)

# Define the loss function
loss_fun = nn.CrossEntropyLoss()

# Choose an optimizer
optimizer = optim.Adam(model.parameters(), lr=0.02)

100%|██████████| 4382/4382 [00:02<00:00, 1631.84it/s]


In [None]:
#@title Train the model
model.train()
for epoch in tqdm(range(15), desc="Epochs"):
    running_loss = 0.0
    for image, label in tqdm(train_loader, desc="Images", position=0, leave=True):
        optimizer.zero_grad()

        output = model(image.to(device))
        loss = loss_fun(output, label.to(device))
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f'Epoch {epoch+1} training loss: {running_loss/len(train_loader):.3f}')

Images: 100%|██████████| 35/35 [00:25<00:00,  1.38it/s]
Epochs:   7%|▋         | 1/15 [00:25<05:55, 25.41s/it]

Epoch 1 training loss: 1.363


Images: 100%|██████████| 35/35 [00:25<00:00,  1.35it/s]
Epochs:  13%|█▎        | 2/15 [00:51<05:34, 25.70s/it]

Epoch 2 training loss: 1.338


Images: 100%|██████████| 35/35 [00:25<00:00,  1.39it/s]
Epochs:  20%|██        | 3/15 [01:16<05:06, 25.51s/it]

Epoch 3 training loss: 1.338


Images:  43%|████▎     | 15/35 [00:10<00:14,  1.37it/s]