The notebook served to explore 3 purposes:

- Tensorboard integration for training.
- Autotransform generation with PyTorch Image Models (`timm`).
- Model inference with raw image (`.jpg`, `.png`, etc.) with `timm`.

# Setup

In [1]:
from helper import setup_data, utils, plot, engine
import torch
import torchmetrics
from torchinfo import summary
import timm

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


# Get model

I will get autotransformation from the model, so the model determines the transformations passed to DataLoader, hence the model is the first thing to get.

In [3]:
timm.list_models()

['bat_resnext26ts',
 'beit_base_patch16_224',
 'beit_base_patch16_384',
 'beit_large_patch16_224',
 'beit_large_patch16_384',
 'beit_large_patch16_512',
 'beitv2_base_patch16_224',
 'beitv2_large_patch16_224',
 'botnet26t_256',
 'botnet50ts_256',
 'caformer_b36',
 'caformer_m36',
 'caformer_s18',
 'caformer_s36',
 'cait_m36_384',
 'cait_m48_448',
 'cait_s24_224',
 'cait_s24_384',
 'cait_s36_384',
 'cait_xs24_384',
 'cait_xxs24_224',
 'cait_xxs24_384',
 'cait_xxs36_224',
 'cait_xxs36_384',
 'coat_lite_medium',
 'coat_lite_medium_384',
 'coat_lite_mini',
 'coat_lite_small',
 'coat_lite_tiny',
 'coat_mini',
 'coat_small',
 'coat_tiny',
 'coatnet_0_224',
 'coatnet_0_rw_224',
 'coatnet_1_224',
 'coatnet_1_rw_224',
 'coatnet_2_224',
 'coatnet_2_rw_224',
 'coatnet_3_224',
 'coatnet_3_rw_224',
 'coatnet_4_224',
 'coatnet_5_224',
 'coatnet_bn_0_rw_224',
 'coatnet_nano_cc_224',
 'coatnet_nano_rw_224',
 'coatnet_pico_rw_224',
 'coatnet_rmlp_0_rw_224',
 'coatnet_rmlp_1_rw2_224',
 'coatnet_rmlp_1_r

Let's try BeiT V2 - the model was mentioned in one of The Batch issue with good performance.

In [4]:
model = timm.create_model('beitv2_large_patch16_224', pretrained=True, num_classes=101).to(device)
# Okay, it was stupid - I still need to number of classes to initialize the model.
data_cfg = timm.data.resolve_data_config(model.pretrained_cfg)
transform = timm.data.create_transform(**data_cfg)
transform

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]


Compose(
    Resize(size=235, interpolation=bicubic, max_size=None, antialias=warn)
    CenterCrop(size=(224, 224))
    ToTensor()
    Normalize(mean=tensor([0.4850, 0.4560, 0.4060]), std=tensor([0.2290, 0.2240, 0.2250]))
)

In [5]:
summary(model, input_size=(32, 3, 224, 224),
        col_names=["input_size", "output_size", "num_params", "trainable"],
        col_width=40,
        row_settings=['var_names'])

Layer (type (var_name))                  Input Shape                              Output Shape                             Param #                                  Trainable
Beit (Beit)                              [32, 3, 224, 224]                        [32, 101]                                1,024                                    True
├─PatchEmbed (patch_embed)               [32, 3, 224, 224]                        [32, 196, 1024]                          --                                       True
│    └─Conv2d (proj)                     [32, 3, 224, 224]                        [32, 1024, 14, 14]                       787,456                                  True
│    └─Identity (norm)                   [32, 196, 1024]                          [32, 196, 1024]                          --                                       --
├─Dropout (pos_drop)                     [32, 197, 1024]                          [32, 197, 1024]                          --                           

# Get data

In [6]:
setup_data.get_data_loaders??

[0;31mSignature:[0m
[0msetup_data[0m[0;34m.[0m[0mget_data_loaders[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mtrain_transforms[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtest_transforms[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbatch_size[0m[0;34m=[0m[0;36m64[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Returns the data loaders for training, validation, and testing.

Args:
    batch_size (int): Batch size for the data loaders
    
Returns:
    train_loader (torch.utils.data.DataLoader): Data loader for training data
    valid_loader (torch.utils.data.DataLoader): Data loader for validation data
    test_loader (torch.utils.data.DataLoader): Data loader for testing data
    test_data (torchvision.datasets.ImageFolder): Testing data
[0;31mSource:[0m   
[0;32mdef[0m [0mget_data_loaders[0m[0;34m([0m[0mtrain_transforms[0m[0;34m,[0m [0mtest_transforms[0m[0;34m,[0m [0mbatch_size[0m[0;34m=[0m[0;36m64

In [7]:
train_loader, val_loader, test_loader, test_data = setup_data.get_data_loaders(transform, transform, batch_size=32)
classes, class_to_idx, idx_to_class = setup_data.get_metadata(test_data)

In [8]:
for param in model.blocks.parameters():
    param.requires_grad = False

In [9]:
summary(model, input_size=(32, 3, 224, 224),
        col_names=["input_size", "output_size", "num_params", "trainable"],
        col_width=40,
        row_settings=['var_names'])

Layer (type (var_name))                  Input Shape                              Output Shape                             Param #                                  Trainable
Beit (Beit)                              [32, 3, 224, 224]                        [32, 101]                                1,024                                    Partial
├─PatchEmbed (patch_embed)               [32, 3, 224, 224]                        [32, 196, 1024]                          --                                       True
│    └─Conv2d (proj)                     [32, 3, 224, 224]                        [32, 1024, 14, 14]                       787,456                                  True
│    └─Identity (norm)                   [32, 196, 1024]                          [32, 196, 1024]                          --                                       --
├─Dropout (pos_drop)                     [32, 197, 1024]                          [32, 197, 1024]                          --                        

# Training & Tracking

In [10]:
EPOCHS = 5

In [11]:
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.01, steps_per_epoch=len(train_loader), epochs=EPOCHS)
accuracy_fn = torchmetrics.Accuracy(task='multiclass', num_classes=len(classes)).to(device)

In [12]:
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter()

The `train` function needs to be modified to support Tensorboard logging.

In [14]:
from typing import Dict, List
from tqdm.auto import tqdm

def train(model: torch.nn.Module,
          train_loader: torch.utils.data.DataLoader,
          valid_loader: torch.utils.data.DataLoader,
          loss_fn: torch.nn.Module,
          optimizer: torch.optim.Optimizer,
          scheduler: torch.optim.lr_scheduler.LRScheduler,
          accuracy_fn: torchmetrics.Accuracy,
          device: torch.device,
          epochs: int,
          threshold: List[float]) -> Dict[str, List[torch.Tensor]]:
    """
    Trains the model and evaluates it on the validation set.
    
    Args:
        model (torch.nn.Module): Model to be trained
        train_loader (torch.utils.data.DataLoader): Training data loader
        valid_loader (torch.utils.data.DataLoader): Validation data loader
        loss_fn (torch.nn.Module): Loss function
        optimizer (torch.optim.Optimizer): Optimizer
        scheduler (torch.optim.lr_scheduler.LRScheduler): Learning rate scheduler
        accuracy_fn (torchmetrics.Accuracy): Accuracy function
        device (torch.device): Device to run the training on
        epochs (int): Number of epochs to train the model for
        threshold (float): Threshold for early stopping
    
    Returns:
        Dictionary containing training and validation losses and accuracies for each epoch.
        In the form: {train_loss: [...],
                  train_acc: [...],
                  test_loss: [...],
                  test_acc: [...]} 
        For example if training for epochs=2: 
                    {train_loss: [2.0616, 1.0537],
                    train_acc: [0.3945, 0.3945],
                    test_loss: [1.2641, 1.5706],
                    test_acc: [0.3400, 0.2973]} 
    """
    results = {"train_losses": [], "train_accuracies": [],
               "valid_losses": [], "valid_accuracies": []}
    
    tolerance = 0
    threshold = torch.Tensor(threshold)
    for epoch in tqdm(range(epochs)):
        train_loss, train_acc = engine.train_step(model, train_loader, loss_fn, optimizer, scheduler, accuracy_fn, device)
        valid_loss, valid_acc = engine.test_step(model, valid_loader, loss_fn, accuracy_fn, device)
        print(
            f"Epoch {epoch + 1} of {epochs}"
            f"\n-------------------------------"
        )

        results["train_losses"].append(train_loss.detach())
        results["train_accuracies"].append(train_acc.cpu())
        results["valid_losses"].append(valid_loss.detach())
        results["valid_accuracies"].append(valid_acc.cpu())
        if len(results["valid_losses"]) > 1 and results["valid_losses"][-2] - results["valid_losses"][-1] < threshold:
            tolerance += 1
            if tolerance > 2:
                break
        # Add loss results to SummaryWriter
        writer.add_scalars(main_tag="Loss", 
                           tag_scalar_dict={"train_loss": train_loss,
                                            "valid_loss": valid_loss},
                           global_step=epoch)

        # Add accuracy results to SummaryWriter
        writer.add_scalars(main_tag="Accuracy", 
                           tag_scalar_dict={"train_acc": train_acc,
                                            "valid_acc": valid_acc}, 
                           global_step=epoch)
        
        # Track the PyTorch model architecture
        writer.add_graph(model=model, 
                         input_to_model=torch.randn(32, 3, 224, 224).to(device))
    
    # Close the writer
    writer.close()
    
    ### End new ###

    # Return the filled results at the end of the epochs
    return results

I used VSCode to run Tensorboard. However, after the first launch, Tensorboard cannot be launched again. This is an artifact of the plugin. To fix this, I need to kill the process manually and launch Tensorboard from CLI. This [answer](https://stackoverflow.com/a/68119251) helped me to reach this. I also needed to install `lsof` to get the process ID and use `kill` to kill the process. The process can be found [here](https://landoflinux.com/linux_lsof_command_examples.html) and [here](https://linuxhandbook.com/kill-process-port/).

In [15]:
results = train(model,
                train_loader,
                val_loader,
                loss_fn,
                optimizer,
                scheduler,
                accuracy_fn,
                device,
                EPOCHS,
                threshold=[1e-5])

  0%|          | 0/5 [00:00<?, ?it/s]

Training model::   0%|          | 0/2218 [00:00<?, ?it/s]

Train loss: 1.20150 | Train accuracy: 0.70


Making predictions::   0%|          | 0/1038 [00:00<?, ?it/s]

Test loss: 0.60373 | Test accuracy: 0.83
Epoch 1 of 5
-------------------------------


  assert condition, message


Training model::   0%|          | 0/2218 [00:00<?, ?it/s]

Train loss: 0.74406 | Train accuracy: 0.80


Making predictions::   0%|          | 0/1038 [00:00<?, ?it/s]

Test loss: 0.48520 | Test accuracy: 0.87
Epoch 2 of 5
-------------------------------


Training model::   0%|          | 0/2218 [00:00<?, ?it/s]

Train loss: 0.56183 | Train accuracy: 0.84


Making predictions::   0%|          | 0/1038 [00:00<?, ?it/s]

Test loss: 0.43722 | Test accuracy: 0.88
Epoch 3 of 5
-------------------------------


Training model::   0%|          | 0/2218 [00:00<?, ?it/s]

Train loss: 0.49756 | Train accuracy: 0.86


Making predictions::   0%|          | 0/1038 [00:00<?, ?it/s]

Test loss: 0.44402 | Test accuracy: 0.87
Epoch 4 of 5
-------------------------------


Training model::   0%|          | 0/2218 [00:00<?, ?it/s]

Train loss: 0.52504 | Train accuracy: 0.85


Making predictions::   0%|          | 0/1038 [00:00<?, ?it/s]

Test loss: 0.77162 | Test accuracy: 0.79
Epoch 5 of 5
-------------------------------


Finished training! As you can see, OneCycleLR from SuperConvergence is amazing - I could get very near to 90% accuracy in just 3 epochs - that's SuperConvergence. However, it is also aggressive: the accuracy on test set start to drop after 3 epochs. It is a great scheduler at first, but after 3-4 epochs the model will benefit from a more conservative scheduler.

Now check the performance on the test set.

I nearly forgot: the visualization of the metrics inside Tensorboard can be found [here](https://tensorboard.dev/experiment/WVRhweDbRdKs3WnTEIL34w/) or at this link: https://tensorboard.dev/experiment/WVRhweDbRdKs3WnTEIL34w/.

In [None]:
test_loss, test_acc = engine.test_step(model, test_loader, loss_fn, accuracy_fn, device)