# Table of content

## [I. Intro](#intro)
## [II. Dataset](#dataset)
## [III. Data augmentation](#data-augmentation)
## [IV. Base model selection](#base-model-selection)
## [V. Hyper parameters tuning](#hpo)
## [VI. K-fold cross validation](#k-fold_cross_validation)

# I. Intro <a class="anchor" id="intro"></a>

This work is shared for beginners who wants a solution for the competition `petals to the metals` from the Kaggle platform using pytorch and regular techniques such as:
- data augmentation
- transfer learning
- hyper parameter search
- custom learning rate scheduler
- cross validation

The entire project can be found on my repo [here](https://github.com/NoeGuedet/Kaggle-efficientnet_v2_l-PyTorch).

WARNING : I am a student and I am still learning ! It took me a couple of month to develop this solution that I am satisfied with. I am not a professionnal by any mean and this work may content some basic error, bad optimization... Contact me and I'll be happy to correct them :)

Hardware :
- CPU : TR1920x 12C/24T @ 3.9Ghz
- RAM : 128Gb DDR4 @ 3200Mhz
- GPU : RTX 3090 EVGA FTW3

## Import library

In [None]:
!pip install tfrecord

In [None]:
import os
import sys
import logging
import gc
import time
import random

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import optuna
from PIL import Image
from tqdm.notebook import tqdm
import tfrecord
import cv2

import torch
import torchvision.transforms.v2 as transforms
from torch.utils.data import Dataset, DataLoader, ConcatDataset, SubsetRandomSampler
import torch.nn as nn
from torch.optim.lr_scheduler import LambdaLR
from torch.optim import Adam

from torchvision.models import densenet161, efficientnet_v2_l, vgg19_bn

from sklearn.model_selection import KFold

In [None]:
print(torch.__version__)

In [None]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

print(f'Device : {DEVICE}')

# II. Dataset <a class="anchor" id="dataset"></a>

The dataset class is define in the `utility.py` file, with functions inspired by [this notebook]('https://www.kaggle.com/code/adikaboost/transfer-learning-efficientnet-pytorch') (credit to BOOTLEG).

In [None]:
DATASET_PATH = "/kaggle/input/tpu-getting-started/tfrecords-jpeg-512x512"
# batch size is limited by the amount of available GPU memory (24Gb) 
BATCH_SIZE = 8
NUM_CLASSES = 104
IMG_RESOLUTION = (512, 512)

In [None]:
# From https://www.kaggle.com/code/adikaboost/transfer-learning-efficientnet-pytorch
def transform_tf_to_df(dataset_path, subset_data):
    df = pd.DataFrame({"id": pd.Series(dtype="str"), 
                       "class": pd.Series(dtype="int"), 
                       "img": pd.Series(dtype="object")})    
    tf_files = []
    
    for subdir, dirs, files in os.walk(dataset_path):
        if subdir.split("/")[-1] == subset_data:
            for file in files:
                filepath = subdir + os.sep + file
                tf_files.append(filepath)
                
    for tf_file in tf_files:
        if subset_data == "test":
            loader = tfrecord.tfrecord_loader(tf_file, None, {"id": "byte", "image": "byte"})
        else:
            loader = tfrecord.tfrecord_loader(tf_file, None, {"id": "byte","image": "byte", "class": "int"})
        
        for record in loader:
            id_label = record["id"].decode('utf-8')
            label = record["class"][0].item() if subset_data != "test" else None
            img_bytes = np.frombuffer(record["image"], dtype=np.uint8)
            img = cv2.imdecode(img_bytes, cv2.IMREAD_COLOR)
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            df.loc[len(df.index)] = [id_label, label, img]
    return df

In [None]:
class FlowerDataset(Dataset):
    def __init__(self, dataset_path, subset_data, num_classes=104, transform=None):
        self.num_classes = num_classes
        self.transform = transform
        self.df_data = transform_tf_to_df(dataset_path, subset_data)

    def __len__(self):
        return self.df_data.shape[0]

    def __getitem__(self, idx):
        "Iterable function which applies to each row"
        img_id = self.df_data.iloc[idx, 0]
        label = self.df_data.iloc[idx, 1]
        image = self.df_data.iloc[idx, 2]
        image = Image.fromarray(image)
        
        if self.transform:
            image = self.transform(image)
        
        y = np.zeros(self.num_classes, dtype=np.float32)
        y[label] = int(1)
        return img_id, y, image

In [None]:
train_data = FlowerDataset(DATASET_PATH, 'train', num_classes=NUM_CLASSES)
val_data = FlowerDataset(DATASET_PATH, 'val', num_classes=NUM_CLASSES)

In [None]:
n_rows = 2
n_cols = 3

for _ in range(0, n_rows):
    fig, ax = plt.subplots(1, n_cols)
    for n in range(0, n_cols):    
        idx = random.randint(1, len(train_data))
        img = train_data[idx][2]
        ax[n].imshow(img)

# III. Data augmentation <a class="anchor" id="data-augmentation"></a>

The data transformation can probably be upgraded to better suit the dataset.

## Random dropout

The code below is inspired by this [notebook](https://www.kaggle.com/code/tuckerarrants/kfold-efficientnet-augmentation-s#III.-Augmentation).

In [None]:
class RandomImgDropout(object):
    """
        Apply randomly drops out rectangular regions of an image by setting them to zero. 

        Attributes:
        - p (float): Probability of applying the dropout transformation. Default is 0.5.
        - dim (int): Dimension of the image (assuming a square image). Default is IMG_RESOLUTION[0].
        - n_dropout (int): Number of dropout regions to create in the image. Default is 5.
        - scaled_size (float): Size of the dropout regions as a fraction of the image dimension. Default is 0.1.

        Parameters:
            img (torch.Tensor): Input image tensor of shape (C, H, W).

        Returns:
            torch.Tensor: Output image tensor of shape (C, H, W).
    """
    
    def __init__(self, p=0.5, dim=IMG_RESOLUTION[0], n_dropout=5, scaled_size=0.1):
        self.p = p
        self.dim = dim
        self.n_dropout = n_dropout
        self.scaled_size = scaled_size
            
    def __call__(self, img):
        do_tr = torch.rand(1)[0] < self.p
        
        if not do_tr:
            return img

        for _ in range(0, self.n_dropout):
            x = torch.randint(0, self.dim, ()).type(torch.int32)
            y = torch.randint(0, self.dim, ()).type(torch.int32)
            width = torch.tensor(self.scaled_size * self.dim, dtype=torch.int32)

            ya = torch.maximum(y-width//2, torch.tensor(0, dtype=torch.int32))
            yb = torch.minimum(y+width//2, torch.tensor(self.dim, dtype=torch.int32))
            xa = torch.maximum(x-width//2, torch.tensor(0, dtype=torch.int32))
            xb = torch.minimum(x+width//2, torch.tensor(self.dim, dtype=torch.int32))

            img[:, ya:yb, xa:xb] = 0
            
        return img

In [None]:
stats = ((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))

In [None]:
visual_transform = transforms.Compose([
    transforms.RandomResizedCrop(size=IMG_RESOLUTION, scale=(0.8, 1)),
    transforms.RandomEqualize(),
    transforms.RandomVerticalFlip(),
    transforms.RandomHorizontalFlip(),
    transforms.RandomApply([transforms.ElasticTransform(alpha=80.0)]),
    transforms.RandomPerspective(distortion_scale=(0.3), p=0.4),
    transforms.PILToTensor(),
    RandomImgDropout(scaled_size=0.12, n_dropout=10),
    transforms.ToPILImage()
])

# Same as visual transform but convert image to a normalize tensor at the end
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(size=IMG_RESOLUTION, scale=(0.8, 1)),
    transforms.RandomEqualize(),
    transforms.RandomVerticalFlip(),
    transforms.RandomHorizontalFlip(),
    transforms.RandomApply([transforms.ElasticTransform(alpha=80.0)]),
    transforms.RandomPerspective(distortion_scale=(0.3), p=0.4),
    transforms.PILToTensor(),
    RandomImgDropout(scaled_size=0.12, n_dropout=10),
    transforms.ToDtype(torch.float32),
    transforms.Normalize(*stats,inplace=True)
])

val_transform = transforms.Compose([
    transforms.PILToTensor(),
    transforms.ToDtype(torch.float32),
    transforms.Normalize(*stats,inplace=True)
])

In [None]:
train_data.transform = visual_transform

n_rows = 2
n_cols = 3

for _ in range(0, n_rows):
    fig, ax = plt.subplots(1, n_cols)
    for n in range(0, n_cols):    
        idx = random.randint(1, len(train_data))
        img = train_data[idx][2]
        ax[n].imshow(img)

In [None]:
train_data.transform = train_transform
val_data.transform = val_transform

## (Optionnal) Benchmark of the CPU with the dataloader

The goal of this benchmark is to use an optimal number of thread to process the dataset and avoid a potential bottleneck of the CPU 'feeding' the GPU to slowly.

The cell below have been converted to raw format to avoid accidentaly running the benchmark at each restart of the notebook. 

In [None]:
# My cpu is 12 Cores 24 Threads. I avoid using all the threads which make the system crash.
# torch.get_num_thread() return only the number of cores, not the actual number of threads.
# n_workers = [i for i in range(4, 24, 2)]
# n_workers

In [None]:
# time_score = []

# for n in n_workers:
#     data = DataLoader(train_data, batch_size=BATCH_SIZE, num_workers=n)
#     start_time = time.time()
#     for img_id, label, img in data:
#         pass
#     end_time = time.time()
#     time_score.append(end_time - start_time)

In [None]:
# fig, ax = plt.subplots()

# ax.bar(n_workers, time_score)

# ax.set_ylabel('Time (in seconds)')
# ax.set_xlabel('Number of workers')
# ax.set_title('AVG time to process the dataset per number of workers')

# plt.show()

As shown in the plot below, the optimal number of workers for this hardware is 16 (fast enough without to much power consumption from using all the threads).

![CPU Benchmark](./cpu_benchmark.png "AVG time to process the dataset per number of workers")

In [None]:
# N_WORKERS = 16
# Only 2 available CPU on kaggle
N_WORKERS = 2

In [None]:
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, num_workers=N_WORKERS, shuffle=True, drop_last=True)
val_loader = DataLoader(val_data, batch_size=BATCH_SIZE, num_workers=N_WORKERS, shuffle=True, drop_last=True)

# IV. Base model selection <a class="anchor" id="base-model-selection"></a>

## Training function

In [None]:
def model_trainer(model, criterion, optimizer, train_loader, val_loader=None, epochs=0, scheduler=None, device='cpu', show_progress=False, trial=None):
    # Move the model to the specified device
    model.to(device)
    
    history = {
        'train_loss': [],
        'train_accuracy': []
    }
    
    if val_loader is not None:
        history['val_loss'] = []
        history['val_accuracy'] = []
    
    epochs_loop = tqdm(range(0, epochs), desc="Epoch", leave=True) if show_progress else range(0, epochs)
    for epoch in epochs_loop:        
        model.train()
        running_loss = 0.0
        correct = 0
        total = 0

        # training loop
        training_loop = tqdm(train_loader, leave=False, desc="Training") if show_progress else train_loader
        for _, labels, inputs in training_loop:
            inputs, labels = inputs.to(device), labels.to(device)
            
            # Zero the parameter gradients
            optimizer.zero_grad()
            
            # Forward pass
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            
            # Backward pass and optimize
            loss.backward()
            optimizer.step()
            
            # Calculate statistics
            running_loss += loss.item()
            
            _, predicted = torch.max(outputs.data, 1)
            _, y_class = torch.max(labels.data, 1)

            total += labels.size(0)
            correct += (predicted == y_class).sum().item()

            if show_progress:
                training_loop.set_postfix(training_loss=running_loss / (training_loop.n + 1), training_accuracy=correct / total)
            
        # Calculate average loss and accuracy for the epoch
        train_epoch_loss = running_loss / len(train_loader)
        train_epoch_accuracy = correct / total

        if show_progress:
            epochs_loop.set_postfix(train_loss=train_epoch_loss, train_accuracy=train_epoch_accuracy)
            
        history['train_loss'].append(train_epoch_loss)
        history['train_accuracy'].append(train_epoch_accuracy)
        
        if val_loader is not None:
            model.eval()
            val_loss = 0.0
            val_correct = 0
            val_total = 0

            val_loop = tqdm(val_loader, leave=False, desc="Validating") if show_progress else val_loader
            
            with torch.no_grad():
                for _, labels, inputs in val_loop:
                    inputs, labels = inputs.to(device), labels.to(device)
                    
                    outputs = model(inputs)
                    loss = criterion(outputs, labels)
                    
                    val_loss += loss.item()
                    _, predicted = torch.max(outputs.data, 1)
                    _, y_class = torch.max(labels.data, 1)
                    
                    val_total += labels.size(0)
                    val_correct += (predicted == y_class).sum().item()

                    if show_progress:
                        val_loop.set_postfix(val_loss=val_loss / (val_loop.n + 1), val_accuracy=val_correct / val_total)
            
            val_epoch_loss = val_loss / len(val_loader)
            val_epoch_accuracy = val_correct / val_total

            if show_progress:
                epochs_loop.set_postfix(train_loss=train_epoch_loss, val_loss=val_epoch_loss, train_accuracy=train_epoch_accuracy,  val_accuracy=val_epoch_accuracy)
            
            history['val_loss'].append(val_epoch_loss)
            history['val_accuracy'].append(val_epoch_accuracy)

        if scheduler is not None:
            scheduler.step()

        if trial is not None:
            rep_acc = val_epoch_accuracy if val_loader is not None else train_epoch_accuracy
            trial.report(rep_acc, epoch)
            
            if trial.should_prune():
                raise optuna.exceptions.TrialPruned()
    
    return history

## Model benchmark

Now, the selection of the best base model can be done. For that, some preselected CNN will be benchmarked and the one with the best accuracy after a 10 epoch of training will be used as the base model.

Here are all the models that will be tested:
- DenseNet161
- EfficientNetV2-L
- VGG-19_BN

In [None]:
# replace the final layer of each model to a new one with the right amount of classes
densenet_model = densenet161(weights='DEFAULT')
densenet_model.classifier = nn.LazyLinear(out_features=NUM_CLASSES)

efficient_model = efficientnet_v2_l(weights='DEFAULT')
efficient_model.classifier[-1] = nn.LazyLinear(out_features=NUM_CLASSES)

vgg_model = vgg19_bn(weights='DEFAULT')
vgg_model.classifier[-1] = nn.LazyLinear(out_features=NUM_CLASSES)

models = {
    'densenet161': densenet_model,
    'efficientnet_v2_l': efficient_model,
    'vgg19_bn': vgg_model
}

The cell below has been converted to raw format to avoid accidentally running the benchmark at each restart of the notebook.

In [None]:
# models_benchmark = {}

# epochs = 10
# lr = 0.001
# criterion = nn.CrossEntropyLoss()

# benchmark_train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, num_workers=12, shuffle=True)
# benchmark_val_loader = DataLoader(val_data, batch_size=BATCH_SIZE, num_workers=12, shuffle=False)

# for model_name, model in models.items():
#     optimizer = Adam(model.parameters(), lr=lr)

#     print(f"#### {model_name} ####")
#     history = model_trainer(
#         model,
#         criterion,
#         optimizer,
#         benchmark_train_loader,
#         val_loader=benchmark_val_loader,
#         epochs=epochs,
#         device=DEVICE
#     )

#     for metric, data in history.items():
#         models_benchmark[model_name + '_' + metric] = data
    
#     # free gpu memory for the next trial
#     model.cpu()
#     del model
#     gc.collect()
#     torch.cuda.empty_cache()

# models_benchmark_df = pd.DataFrame.from_dict(models_benchmark)
# models_benchmark_df.to_csv('./models_benchmark.csv', index=False)

In [None]:
models_benchmark_df = pd.read_csv('/kaggle/input/petals-to-the-metals-model/models_benchmark.csv')

In [None]:
models_benchmark_df.head()

In [None]:
for model_name in models.keys():
    print(f"#### {model_name} stats ####")
    fig, ax = plt.subplots(1, 2)
    fig.set_size_inches(15, 4)

    # Plotting loss
    ax[0].set_ylim([0, 5])
    ax[0].plot(models_benchmark_df[model_name + '_train_loss'], label='Train Loss')
    ax[0].plot(models_benchmark_df[model_name + '_val_loss'], label='Val Loss')
    ax[0].set_ylabel('Loss')
    ax[0].set_xlabel('Epoch')
    ax[0].set_title('Loss')
    ax[0].legend()

    # Plotting accuracy
    ax[1].set_ylim([0, 1])
    ax[1].plot(models_benchmark_df[model_name + '_train_accuracy'], label='Train Accuracy')
    ax[1].plot(models_benchmark_df[model_name + '_val_accuracy'], label='Val Accuracy')
    ax[1].set_ylabel('Accuracy')
    ax[1].set_xlabel('Epoch')
    ax[1].set_title('Accuracy')
    ax[1].legend()

    plt.show()

The efficient V2 large model is the one which perform the best with a better accuracy than the others one after 10 epochs of training. It converge faster then the densenet 161 while still having a validation loss smaller than the training loss which mean that there is still room for even better performance.

All the model do not overfit at all, which can suggest that the data augmentation is ok.

# V. Hyper parameters tuning <a class="anchor" id="hpo"></a>

Now that the base model has been selected (efficient V2 large), I am going to proceed to a hyper parameters search to improve the base model. The goal of this search will be to find a good classifier architecture with the right amount of hidden and the right amount of dropout and perceptrons per hidden layer.

The library used for this HPO is optuna, which is simple to work with.
We only need to create an objective function that will ask for hyperparameters variables and return a score. The framework will try to optimize the score of the objective function by giving better combinations of variables values.

## 1. Custom classifier function

This function will take in input some hyper parameters variables and return a Sequential container with the applied inputs.

In [None]:
def get_custom_classifier(linear_layers: list, dropout_layers: list, bn_layers: list):
    layers = []
    for out_features, dropout_rate, is_bn in zip(linear_layers, dropout_layers, bn_layers):
        layers.append(nn.Dropout(p=dropout_rate))
        layers.append(nn.LazyLinear(out_features))
        if is_bn : layers.append(nn.LazyBatchNorm1d())
        layers.append(nn.ReLU())
        
    layers.append(nn.LazyLinear(NUM_CLASSES))
    layers.append(nn.Softmax(dim=1))
    return nn.Sequential(*layers)

## 2. Objective function && data retrieval

In [None]:
def objective(trial):     
    model = efficientnet_v2_l(weights='DEFAULT')
    
    n_hidden_layers = trial.suggest_int('num_hidden_layers', 0, 3)
    linear_layers = []
    dropout_layers = []
    bn_layers = []
    
    for i in range(0, n_hidden_layers):
        linear_layers.append(trial.suggest_int(f'l{i}_out_features', 128, 1024, step=32))
        dropout_layers.append(trial.suggest_float(f'l{i}_dropout_rate', 0, 0.6))
        bn_layers.append(trial.suggest_categorical(f'l{i}_is_bn', [True, False]))

    model.classifier = get_custom_classifier(linear_layers, dropout_layers, bn_layers)

    criterion = nn.CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr=1e-5)
    
    hist = model_trainer(model, criterion, optimizer, train_loader, val_loader=val_loader, epochs=15, device=DEVICE, show_progress=False, trial=trial)

    model.cpu()
    del model, criterion, optimizer
    gc.collect()
    torch.cuda.empty_cache()

    return hist['val_accuracy'][-1]

## 3. Let the calculation begin ! 

I actually wasn't able to run the hyper parameters search within the jupyterlab interface as it requires to keep the tab open to maintain the connection and the kernel would die after a certain amount of time (I don't know why).

Here is how I did :
- Convert the notebook to a .py file using this command line : `jupyter nbconvert --to script petals_to_the_metals.ipynb`
- Rename the file : `mv ./petals_to_the_metals.py ./hyper_parameter_tuning.py`
- Clear the code to just run the HPO and save the result within `HPO.csv` file and add logging info to track the process
- Start the HPO with this command line : `nohup python hyperparameter_tuning.py > hyper_parameter_tuning.log 2>&1 &` which will start a new background process that will not stop on the jupyterlab session closing

In [None]:
# study = optuna.create_study(direction="maximize")
# study.optimize(objective, n_trials=100, gc_after_trial=True, show_progress_bar=False)
# HPO_df = study.trials_dataframe()
# HPO_df.to_csv('./HPO.csv', index=False)

In [None]:
HPO_df = pd.read_csv('/kaggle/input/petals-to-the-metals-model/HPO.csv').sort_values('value', ascending=False)

In [None]:
HPO_df.head()

The best parameters for the classifier seems to be a single hidden layer with 1024 perceptronsand with a dropout of ~0.38. Using batch normalization does not seem to have a significant impact.

In [None]:
hidden_layers = [1024] 
dropout_layers = [0.38] 
bn_layers = [False]

# VI. K-fold cross validation <a class="anchor" id="k-fold_cross_validation"></a>

## Learning rate scheduler

### Features extractor

Since we are using a pretrained model, it is important to have a specific learning rate scheduler shich will update the pretrained weights slowly at the beginning to avoid overshooting or getting stuck in suboptimal points.

Here is the learning rate stratgey that will be use :
1. warmup stage : adjust to the new dataset without making large, potentially harmful updates
2. aggressive learning rate at the middle of the training to make the model converge faster
3. decrease the learning rate to better find a local minima

The code below is taken from [this notebook](https://www.kaggle.com/code/tuckerarrants/kfold-efficientnet-augmentation-s#IV.-Model-Training).

In [None]:
# since the model is pretrained and the batch size is small, a small lr is better
head_lr_start = 1e-5
head_lr_min = 1e-5
head_lr_max = 1e-4
head_lr_rampup_epochs = 5
head_lr_sustain_epoch = 0
head_lr_decay = .8

def custom_head_lr_scheduler(epoch):
    if epoch < head_lr_rampup_epochs:
        return (head_lr_max - head_lr_start) / head_lr_rampup_epochs * epoch + head_lr_start
        
    elif epoch < head_lr_rampup_epochs + head_lr_sustain_epoch:
        return head_lr_max
        
    else:
        return (head_lr_max - head_lr_min) * head_lr_decay**(epoch - head_lr_rampup_epochs - head_lr_sustain_epoch) + head_lr_min

In [None]:
# Generate learning rates for each epoch
learning_rates = [custom_head_lr_scheduler(epoch) for epoch in range(0, 30)]

# Plot the learning rate schedule
plt.figure(figsize=(10, 6))
plt.plot(range(0, 30), learning_rates, marker='o')
plt.title('Custom Learning Rate Schedule - Head')
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.grid(True)

plt.show()

### Classifier

The classifier is made of one hidden layer and the final layer and is trained from scratch (no pretraining).
A warmup cooldown strategy is not necessary for non-pretrained weights so a decay of the learning time over epoch will be used.

In [None]:
# same for the classifier, low lr because of small batch size
clr_lr_max = 1e-4
clr_lr_min = 1e-6
clr_lr_decay = 0.8

def custom_clr_lr_scheduler(epoch):
    lr = (clr_lr_max - clr_lr_min) * clr_lr_decay**(epoch) + clr_lr_min
    return lr

In [None]:
# Generate learning rates for each epoch
learning_rates = [custom_clr_lr_scheduler(epoch) for epoch in range(0, 30)]

# Plot the learning rate schedule
plt.figure(figsize=(10, 6))
plt.plot(range(0, 30), learning_rates, marker='o')
plt.title('Custom Learning Rate Schedule - Classifier')
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.grid(True)

plt.show()

## Early stopping

From this [stackoverflow topic](https://stackoverflow.com/questions/71998978/early-stopping-in-pytorch).

In [None]:
# Early stop class based of the trend of a given loss
class EarlyStopper:
    def __init__(self, patience=1, min_delta=0):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.min_validation_loss = float('inf')
        self.early_stop = False

    def step(self, validation_loss):
        if validation_loss < self.min_validation_loss:
            self.min_validation_loss = validation_loss
            self.counter = 0
        elif validation_loss > (self.min_validation_loss + self.min_delta):
            self.counter += 1
            if self.counter >= self.patience:
                self.early_stop = True

## Training function with early stopping

Since the dataloader dynamically change from a folder to another during training, I had to pass the dataset as an argument of the training function and to dynamically change the transform attribute of the dataset. 

In [None]:
def es_training(model, optimizer, criterion, dataset, train_loader, val_loader, epochs, patience, min_delta, scheluder = None, device='cpu'):
    early_stopper = EarlyStopper(patience, min_delta)
    model.to(device)

    history = {
        'train_loss': [],
        'train_accuracy': [],
        'val_loss': [],
        'val_accuracy': []
    }

    for epoch in range(0, epochs):        
        model.train()
        # change the data augmentation to match the training stage
        dataset.transform = train_transform
        train_loss = 0.0
        train_correct = 0
        train_total = 0
        
        for _, labels, inputs in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            # Zero the parameter gradients
            optimizer.zero_grad()
            # Forward pass
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            # Backward pass and optimize
            loss.backward()
            optimizer.step()
            # Calculate statistics            
            _, predicted = torch.max(outputs.data, 1)
            _, y_class = torch.max(labels.data, 1)
            train_loss += loss.item()
            train_total += labels.size(0)
            train_correct += (predicted == y_class).sum().item()

        # Update lr
        if scheduler is not None:
            scheduler.step()
            
        # Calculate average training loss and accuracy for the epoch
        train_epoch_loss = train_loss / len(train_loader)
        train_epoch_accuracy = train_correct / train_total

        model.eval()
        # change the data augmentation to match the evaluating stage
        dataset.transform = val_transform
        val_loss = 0.0
        val_correct = 0
        val_total = 0

        with torch.no_grad():
            for _, labels, inputs in val_loader:
                inputs, labels = inputs.to(device), labels.to(device)
                # Forward pass
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                # Calculate statistics
                val_loss += loss.item()
                _, predicted = torch.max(outputs.data, 1)
                _, y_class = torch.max(labels.data, 1)
                val_total += labels.size(0)
                val_correct += (predicted == y_class).sum().item()

        # Calculate average validation loss and accuracy for the epoch
        val_epoch_loss = val_loss / len(val_loader)
        val_epoch_accuracy = val_correct / val_total

        # update history
        history['train_loss'].append(train_epoch_loss)
        history['train_accuracy'].append(train_epoch_accuracy)
        history['val_loss'].append(val_epoch_loss)
        history['val_accuracy'].append(val_epoch_accuracy)

        print(
            f'Epoch {epoch} completed',
            f'training loss = {train_epoch_loss}',
            f'training accuracy = {train_epoch_accuracy}',
            f'val loss = {val_epoch_loss}',
            f'val accuracy = {val_epoch_accuracy}',
            sep=' | '
        )

        # check for early stop
        early_stopper.step(val_epoch_loss)
        if early_stopper.early_stop:
            print('-'*5 + f'Early stop trigger --> stopping training' + '-'*5)
            break

    return history

## Defining K-fold variables

In [None]:
def get_model():
    model = efficientnet_v2_l(weights='DEFAULT')
    
    # Unfreeze all layers
    for param in model.parameters():
        param.requires_grad = True
    
    model.classifier = get_custom_classifier(hidden_layers, dropout_layers, bn_layers)
        
    return model

In [None]:
random_state = 3210
k_folds = 5
kfold = KFold(n_splits=k_folds, shuffle=True, random_state=random_state)

In [None]:
k_models = [get_model() for _ in range(0, k_folds)]

In [None]:
# the training and validation dataset are no longer needed
del train_data, val_data
gc.collect()

In [None]:
# This simplify the loading of the entire dataset

!mkdir /kaggle/input/tpu-getting-started/tfrecords-jpeg-512x512/train+val
!cp -r /kaggle/input/tpu-getting-started/tfrecords-jpeg-512x512/train /kaggle/input/tpu-getting-started/tfrecords-jpeg-512x512/train+val
!cp -r /kaggle/input/tpu-getting-started/tfrecords-jpeg-512x512/val /kaggle/input/tpu-getting-started/tfrecords-jpeg-512x512/train+val

In [None]:
dataset = FlowerDataset(DATASET_PATH, 'train+val')

## K-fold training

Same problem as the hyper parameter search, when running the code below, after some times the kernel and the session is shutting down automatically which stop the training.

Here is how I managed it :

1. Convert the notebook to a .py file using this command line : `jupyter nbconvert --to script petals_to_the_metals.ipynb`
2. Rename the file : `mv ./petals_to_the_metals.py ./k-fold_training.py`
3. Clear the code to just run the training and save the result within the `k-fold_training_histories.csv` file and add logging info to track the process
4. Start the training with this command line : `nohup python k-fold_training.py > k-fold_training.log 2>&1 &` which will start a new background process that will not stop on the jupyterlab session closing

In [None]:
# Early stop variables
es_patience = 5
es_min_delta = 0.02

In [None]:
criterion = nn.CrossEntropyLoss()
epochs = 50
histories = {}

The cell below has been converted to raw format to avoid accidentally running the training at each restart of the notebook.

In [None]:
# for fold, (train_ids, val_ids) in enumerate(kfold.split(dataset)):
#     print(f'###### FOLD {fold} ######')
#     # Sample elements randomly from a given list of ids, no replacement.
#     train_subsampler = torch.utils.data.SubsetRandomSampler(train_ids)
#     val_subsampler = torch.utils.data.SubsetRandomSampler(val_ids)

#     # Define data loaders for training and testing data in this fold
#     train_loader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, num_workers=N_WORKERS, sampler=train_subsampler)
#     val_loader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, num_workers=N_WORKERS, sampler=val_subsampler)

#     model = k_models[fold]
#     optimizer = Adam([
#                         {'params': model.features.parameters()},
#                         {'params': model.classifier.parameters()}
#                     ], lr=1)
    
#     scheduler = LambdaLR(optimizer, lr_lambda=[custom_head_lr_scheduler, custom_clr_lr_scheduler])
    
#     history = es_training(
#                     model, 
#                     optimizer, 
#                     criterion,
#                     dataset,
#                     train_loader, 
#                     val_loader, 
#                     epochs, 
#                     patience=es_patience, 
#                     min_delta=es_min_delta, 
#                     scheluder = scheduler, 
#                     device=DEVICE
#                 )
    
#     histories[f'model_f-{fold}_train_accuracy'] = history['train_accuracy']
#     histories[f'model_f-{fold}_val_accuracy'] = history['val_accuracy']
#     histories[f'model_f-{fold}_train_loss'] = history['train_loss']
#     histories[f'model_f-{fold}_val_loss'] = history['val_loss']
    
#     save_path = f'./models/model_f-{fold}.pth'
#     torch.save(model.state_dict(), save_path)

#     # free GPU memory
#     model.cpu()
#     del scheduler, optimizer, model
#     gc.collect()
#     torch.cuda.empty_cache()

# # Convert dict to dataframe with all the key not having the same size :
# # https://stackoverflow.com/questions/38446457/filling-dict-with-na-values-to-allow-conversion-to-pandas-dataframe
# histories = pd.DataFrame.from_dict(histories, orient='index').T
# histories.to_csv('k-fold_training_histories.csv', index=False)

In [None]:
histories = pd.read_csv('/kaggle/input/petals-to-the-metals-model/k-fold_training_histories.csv')

## Models metrics

In [None]:
for i in range(0, k_folds):
    print(f"####### MODEL FOLD {i} #######")
    train_acc = histories[f'model_f-{i}_train_accuracy']
    val_acc = histories[f'model_f-{i}_val_accuracy']
    train_loss = histories[f'model_f-{i}_train_loss']
    val_loss = histories[f'model_f-{i}_val_loss']

    fig, ax = plt.subplots(1, 2)
    fig.set_size_inches(15, 4)

    # Plotting loss
    ax[0].set_ylim([0, 5])
    ax[0].plot(train_loss, label='Train Loss')
    ax[0].plot(val_loss, label='Val Loss')
    ax[0].set_ylabel('Loss')
    ax[0].set_xlabel('Epoch')
    ax[0].set_title('Loss')
    ax[0].legend()

    # Plotting accuracy
    ax[1].set_ylim([0, 1])
    ax[1].plot(train_acc, label='Train Accuracy')
    ax[1].plot(val_acc, label='Val Accuracy')
    ax[1].set_ylabel('Accuracy')
    ax[1].set_xlabel('Epoch')
    ax[1].set_title('Accuracy')
    ax[1].legend()

    plt.show()

The model converge very fast (~3 epochs) and have an accuracy of ~0.95 accross all the fold. A better learning rate strategy and a better data augmentation could help the model converging slower but with a better generalisation ? 

# Submission

Loading the test dataset and dataloader

In [None]:
# free memory as `dataset` is no longer needed
del dataset
gc.collect()

In [None]:
test_data = FlowerDataset(DATASET_PATH, 'test', transform=val_transform)
# the batch size can drastically be raise up since the model will not be trained, more GPU memory will be available.
test_dataloader = DataLoader(test_data, batch_size=128, num_workers=N_WORKERS)

Loading the previously trained weights

In [None]:
for i in range(0, k_folds):
    path = f'/kaggle/input/petals-to-the-metals-model/model_f-{i}.pth'
    k_models[i].load_state_dict(torch.load(path))

In [None]:
ids = []
preds = []

# Make predictions with each model and average them
with torch.no_grad():
    for sample_id, _, inputs in test_dataloader:
        inputs = inputs.to(DEVICE)
        ids.extend(sample_id)
        mean_logps = []

        for model in k_models:
            model.to(DEVICE)
            model.eval()
            outputs = model(inputs)
            mean_logps.append(outputs)
            # free GPU memory
            model.cpu()
            del model
            gc.collect()
        
        mean_logp = torch.mean(torch.stack(mean_logps), dim=0)
        preds.extend(torch.argmax(mean_logp, dim=1).tolist())

In [None]:
submission = pd.DataFrame({'id': ids, 'label': preds})
submission.head()

In [None]:
submission.to_csv('/kaggle/working/submission.csv', index=False)

# Final accuracy

The final accuracy is 0.967 :

![Final Accuracy](./final_acc.png)