# Comparison

Over the course of the project I have touched my hand at 4 models (ConvNeXt, ViT, Swin, BEiT(?)). They have all been competitive in terms of performance after fine-tuning, but they are shared the same trait: big.

Now, "size matters" cuts both ways: larger models (tend to) have better performance, but they are always slower and requires more CPU (and GPU/TPU i.e. accelerated hardware) RAM. If you have infinite computing and processing power (e.g., 100+ NVIDIA 80GB H100s lying around), no problem. But if you are deploying your model to something else, or if latency costs you a lot of money, you are in for a big problem. You will want to reduce the size of the model while retaining as much performance as possible.

My first deployment on Hugging Face Space is a Swin-Large model. It fits just fine on the space, but each prediction takes ~5.4s to carry out. I want to explore different alternatives, which may have worse performance but offer better latency.

## Looking around

A helpful rule of thumb I follow is: "People have already done that."

I always expect that whatever I can think of, people have already thought of, achieved, or come very close to. There are many reasons behind this, but the interesting corollary is that the first thing I do is looking up what people have done.

I found [Jeremy Howard's visualization of `timm`'s benchmark](https://www.kaggle.com/code/jhoward/which-image-models-are-best/) and [Daniel Bourke's result with ViT and EfficientNet](https://www.learnpytorch.io/09_pytorch_model_deployment/) (and while I am at it, yes, I am following Bourke's course).

Jeremy's visualization suggested that I should check out:
- LeViT
- ViT (okay, it was not even there - just my pick)
- EVA-02 (who hates teenagers driving robots to save the world?)
- Swin
- ConvNeXt
- BeIT
- EfficientNet

from `timm`.

And I planed to do exactly that in this notebook, with the help of Pytorch-Lightning.

## Objectives

I want a model that is:
- 95%++ accuracy
- As low latency as possible, preferably close to FPS24 (the standard one for old movie with synchonized sound...)
- As low memory as possible
- High F1 score

Specifications:
- Dataset: Food101 (100% data)
- Hardware: 2 $\times$ NVIDIA GeForce RTX 3090 + CUDA 11.7 + PyTorch 2.0.1
- Batch size: 64
- Epochs: 3
- Optimizer: AdamW
- Scheduler: OneCycler
- Metrics: Accuracy, F1-Score
- Tracking: Tensorboard

In [3]:
import lightning as L
import torch
import timm
import torch.nn.functional as F
import torchmetrics

from typing import Union
from pathlib import Path
from torchvision.datasets import Food101
from torch.utils.data import random_split, DataLoader
from pytorch_lightning.loggers import TensorBoardLogger
from lightning.pytorch.callbacks import ModelCheckpoint
torch.backends.cuda.allow_tf32=True

In [4]:
class Food101DataModule(L.LightningDataModule):
    def __init__(self, transform, data_dir: Union[str, Path] = "data", batch_size: int = 64) -> None:
        super().__init__()
        self.data_dir = data_dir
        self.batch_size = batch_size
        self.transform = transform

    def prepare_data(self):
        Food101(self.data_dir, split='train', download=True)
        Food101(self.data_dir, split='test', download=True)

    def setup(self, stage: str = 'fit'):
        if stage == 'fit':
            food101_full = Food101(self.data_dir, split='train', download=True, transform=self.transform)
            self.food101_train, self.food101_val = random_split(food101_full, [0.8, 0.2])

        if stage == 'test':
            self.food101_test = Food101(self.data_dir, split='test', download=True, transform=self.transform)

        if stage == "predict":
            self.food101_predict = Food101(self.data_dir, split='test', download=True, transform=self.transform)

    def train_dataloader(self):
        return DataLoader(self.food101_train, batch_size=64)

    def val_dataloader(self):
        return DataLoader(self.food101_val, batch_size=64)

    def test_dataloader(self):
        return DataLoader(self.food101_test, batch_size=64)

    def predict_dataloader(self):
        return DataLoader(self.food101_predict, batch_size=64)

In [5]:
class Food101Classifier(L.LightningModule):
    def __init__(self, model_name: str, epochs: int) -> None:
        super().__init__()
        self.num_classes = 101
        self.epochs = epochs
        self.model = timm.create_model(model_name, pretrained=True, num_classes=101)
        self.train_acc = torchmetrics.Accuracy(task="multiclass", num_classes=101)
        self.valid_acc = torchmetrics.Accuracy(task="multiclass", num_classes=101)
        self.f1_metric = torchmetrics.F1Score(task="multiclass", num_classes=101)

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        inputs, labels = batch
        outputs = self.forward(inputs)
        preds = torch.argmax(outputs, 1)
        loss = F.cross_entropy(outputs, labels)
        self.log("train_loss", loss, on_epoch=True, prog_bar=True)
        self.train_acc(preds, labels)
        self.log('train_acc', self.train_acc, on_epoch=True, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        inputs, labels = batch
        self.model.eval()
        outputs = self.forward(inputs)
        preds = torch.argmax(outputs, 1)
        loss = F.cross_entropy(outputs, labels)
        self.log("val_loss", loss, prog_bar=True)
        self.valid_acc(preds, labels)
        self.log('val_acc', self.valid_acc, prog_bar=True)
        self.f1_metric(preds, labels)
        self.log("val_f1", self.f1_metric, prog_bar=True)

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=0.001, fused=True)
        scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, 0.01, steps_per_epoch=947, epochs=self.epochs)
        scheduler = {"scheduler": scheduler, "interval" : "step" }
        return [optimizer], [scheduler]

In [6]:
models = ["levit_128s.fb_dist_in1k", "levit_128.fb_dist_in1k", "levit_192.fb_dist_in1k", "levit_256.fb_dist_in1k", "levit_384.fb_dist_in1k",
          "convnextv2_nano.fcmae_ft_in22k_in1k_384", "convnextv2_tiny.fcmae_ft_in22k_in1k_384", "convnextv2_base.fcmae_ft_in22k_in1k_384",
          "convnextv2_large.fcmae_ft_in22k_in1k_384", "tf_efficientnetv2_s.in21k_ft_in1k", "tf_efficientnetv2_m.in21k_ft_in1k",
          "tf_efficientnetv2_l.in21k_ft_in1k", "tf_efficientnetv2_b3.in21k_ft_in1k", "tf_efficientnet_b2.ns_jft_in1k",
          "beitv2_large_patch16_224.in1k_ft_in22k_in1k", "beitv2_base_patch16_224.in1k_ft_in22k_in1k", "vit_base_patch14_dinov2.lvd142m",
          "vit_large_patch14_dinov2.lvd142m", "vit_small_patch14_dinov2.lvd142m", "vit_large_patch14_clip_336.laion2b_ft_in12k_in1k_inat21",
          "vit_large_patch14_clip_336.datacompxl_ft_inat21", "eva02_large_patch14_clip_336.merged2b_ft_inat21", "vit_relpos_medium_patch16_rpn_224.sw_in1k",
          "swinv2_tiny_window8_256.ms_in1k", "swinv2_small_window8_256.ms_in1k", "swinv2_base_window8_256.ms_in1k", "timm/swinv2_large_window12to16_192to256.ms_in22k_ft_in1k"]

In [None]:
for model in models:
    logger = TensorBoardLogger("runs", version=1, name=f"{model}/logs")
    food_model = Food101Classifier("hf_hub:timm/"+model, 3)
    compiled_model = torch.compile(food_model)
    data_cfg = timm.data.resolve_data_config(food_model.model.pretrained_cfg)
    transform = timm.data.create_transform(**data_cfg)
    food_data = Food101DataModule(transform)
    checkpoint_callback = ModelCheckpoint(monitor="val_acc", dirpath="models", filename=f"{model}/checkpoints")
    trainer = L.Trainer(
        logger=logger,
        accelerator='gpu',
        devices=2,
        strategy="fsdp|ddp",
        precision="bf16-mixed",
        accumulate_grad_batches=1,
        enable_checkpointing=True,
        callbacks=[checkpoint_callback],
        max_epochs=3,
        fast_dev_run=False,
        profiler="advanced",
        inference_mode=False
    )
    trainer.fit(compiled_model, food_data)
    trainer.test(compiled_model, food_data)