# Vision Transformer
To classify to disease classes, I am going to use a vision transformer. I chose to use the pretrained model with 86 million parameters and I am also trying smaller versions. This is because it was mentioned in the foundational paper on Vision transformers called An Image is Worth 16x16 Words and the image data set is not large enough to warrant a bigger model, at around 85 thousand samples. 

In [3]:
options = [
    "google/vit-base-patch16-224-in21k",    #86M params, baseline, traiend on 14M images w/ 21k classes
    
    "WinKawaks/vit-small-patch16-224",       #22M params, faster training
    "WinKawaks/vit-tiny-patch16-224" # smallest model.
    "google/vit-base-patch16-224"           #86M params, fine tuned on ImageNet-1k classes. DECIDED AGAINST IT
]

### Imports

In [79]:
#!pip install transformers
from transformers import ViTForImageClassification, ViTImageProcessor, Trainer
from torchvision.datasets import ImageFolder # for dataset loading
from torchvision import transforms
import torch
import torch.nn as nn
from tqdm import tqdm #progress bar for training

### Data loading

In [41]:
trainPath = "data/New Plant Diseases Dataset(Augmented)/train"
validPath = "data/New Plant Diseases Dataset(Augmented)/valid"
#basic transformation
processed = transforms.Compose([
    transforms.Resize((224, 224)),  #ViT expects 224x224. It is in the model's name!
    transforms.ToTensor(),          #converts image to tensor: <0-1> range of values
      transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])#pixel value of 0 becomes -1, 0.5->0, 1->1

])
trainDataset = ImageFolder(root=trainPath, transform=processed)
validDataset = ImageFolder(root=validPath, transform=processed)

In [42]:
print(f"Training samples: {len(trainDataset)}")
print(f"Validation samples: {len(validDataset)}")
print(f"Classes: {len(trainDataset.classes)}")
print(f"First 5 class names: {trainDataset.classes[:5]}")

Training samples: 70295
Validation samples: 17572
Classes: 38
First 5 class names: ['Apple___Apple_scab', 'Apple___Black_rot', 'Apple___Cedar_apple_rust', 'Apple___healthy', 'Blueberry___healthy']


### Normalization

The model was pretrained with specific mean/std values:

In [47]:
processor = ViTImageProcessor.from_pretrained('WinKawaks/vit-tiny-patch16-224')
print("Expected image size: " + str(processor.size))
print("stamdard deviation: "+str(processor.image_std))
print("Mean: " + str(processor.image_mean)) 

Expected image size: {'height': 224, 'width': 224}
stamdard deviation: [0.5, 0.5, 0.5]
Mean: [0.5, 0.5, 0.5]


Now that I found out what mean and sd of the transformer are, I added it back into the previous transform.

# WinKawaks/vit-tiny-patch16-224
source: https://huggingface.co/WinKawaks/vit-tiny-patch16-224

In [49]:


# pretrained model. I am going to try a couple more of these to find the ebst result
model = ViTForImageClassification.from_pretrained(
    'WinKawaks/vit-tiny-patch16-224',
    num_labels=38,  # this corresponds to the number of disease classes, discovered with EDA
    ignore_mismatched_sizes=True #the model expects 1000 classes just like in ImgNet, 
    #so the final layer has to bereplaced
)

Some weights of ViTForImageClassification were not initialized from the model checkpoint at WinKawaks/vit-tiny-patch16-224 and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([1000]) in the checkpoint and torch.Size([38]) in the model instantiated
- classifier.weight: found shape torch.Size([1000, 192]) in the checkpoint and torch.Size([38, 192]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The new classification head is randomly initialized, so I will need to fine tune it in order for it to work, as stated in the paper:

*Typically,  we  pre-train  ViT  on  large  datasets,  and  fine-tune  to  (smaller)  downstream  tasks.   For this, we remove the pre-trained prediction head and attach a zero-initialized D×K feedforward layer, where K is the number of downstream classes.  It is often beneficial to fine-tune at higher resolution than pre-training*

## Fine Tuning

In [69]:
batchSize = 32 # my memory is limited by my inadequate laptop
trainLoader = DataLoader(trainDataset, batch_size=batchSize, shuffle=True)
validLoader = DataLoader(validDataset, batch_size=batchSize, shuffle=False)
print('Training batches amount: ' + str(len(trainLoader)))#2197
print('Validation batches amount: ' + str(len(validLoader)))# 550, checks out

Training batches amount: 2197
Validation batches amount: 550


The paper says:
*We fine-tune all ViT models using SGD with a momentum of 0.9.*


For learning rates, they tried: {0.001, 0.003, 0.01, 0.03} depending on the dataset.

In [77]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.003, momentum=0.9)
criterion = nn.CrossEntropyLoss() #typical for classification
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#i dont have an nVidia gpu, so cpu it is
model = model.to(device)               

## Training

In [98]:
def trainEpoch(model, trainLoader, optimizer, criterion, device):
    model.train()
    cumuLoss = float(0)
    correct = 0
    total = 0
    for batch in tqdm(trainLoader, desc="Training model..."):
        images, labels = batch
        images = images.to(device)
        labels = labels.to(device)
        
        #nulling gradients:
        optimizer.zero_grad()
        
        #forward pass
        #returns an object with logits inside it
        outputs = model(images)
        logits = outputs.logits
        #backward pass
        loss = criterion(logits, labels)
        # loss calculation for this step       
        loss.backward()
        optimizer.step()
        #metrics update
        cumuLoss += loss.item()
        _, predicted = torch.max(logits, 1)
        correct += (predicted == labels).sum().item() # if prediction matches label, it counts
        total += labels.size(0)
    epochLoss = runningLoss /len(trainLoader)
    epochAccuracy = correct/total
    return epochLoss, epochAccuracy

In [None]:
print("Starting training (1 epoch test)")
loss, acc = trainEpoch(model, trainLoader, optimizer, criterion, device)
print(f"Loss: {loss:.4f}, Accuracy: {acc:.4f}")

Starting training (1 epoch test)


Training model...:   7%|████▎                                                     | 161/2197 [08:29<1:52:50,  3.33s/it]

## google/vit-base-patch16-224
Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.
source:  https://huggingface.co/google/vit-base-patch16-224-in21k