<img src="https://i.imgur.com/xRy6VZ2.png">

<center><h1>- Full Training Pipeline - Step by Step Guide with Explanation -</h1></center>

> 🐳 **Competition Goal:** Identify and group all images that contain the same individual through time.

> 🙏 **Inspiration**: A super huge thank you to [Debarshi Chanda](https://www.kaggle.com/debarshichanda) and his very [clean comprehensive notebook](https://www.kaggle.com/debarshichanda/pytorch-arcface-gem-pooling-starter) that helped me put a start to this competition.

### ⬇ Libraries

🐳 **What is `timm`?** - It is a library that gathers all **PyTorch Image Models**, for ease of access and convenience. The [full documentation can be found here](https://rwightman.github.io/pytorch-image-models/models/).
* **Feature Extraction** - All of the [models in timm](https://rwightman.github.io/pytorch-image-models/feature_extraction/) have consistent mechanisms for obtaining various types of features from the model for tasks besides classification. This will help us when extracting the embeddings from the pretrained algorithms (aka the backbone).
* **Models** - I am also leaving here a [link](https://rwightman.github.io/pytorch-image-models/results/) to all the models and *their names* that are currently available within the library.

In [None]:
# Helpful Installs
!pip install timm

In [None]:
# Libraries
import os
import gc
import wandb
import time
import random
import math
from scipy import spatial
from tqdm import tqdm
import warnings
import cv2
import pandas as pd
import numpy as np
from numpy import dot, sqrt
import seaborn as sns
import matplotlib as mpl
import matplotlib.patches as patches
import matplotlib.pyplot as plt
from IPython.display import display_html

from sklearn.model_selection import StratifiedKFold
import torch
from torch.utils.data import DataLoader, Dataset
from torch.optim import Adam, lr_scheduler
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import GradScaler, autocast

from albumentations.pytorch import transforms
import albumentations
import timm

from cuml.neighbors import NearestNeighbors
from sklearn.preprocessing import normalize

# Environment check
warnings.filterwarnings("ignore")
os.environ["WANDB_SILENT"] = "true"
CONFIG = {'competition': 'happywhale', '_wandb_kernel': 'aot'}

# Custom colors
class clr:
    S = '\033[1m' + '\033[96m'
    E = '\033[0m'
    
my_colors = ["#21295C", "#1F4E78", "#1C7293", "#73ABAF", "#C9E4CA", "#87BBA2", "#618E83", "#3B6064"]
print(clr.S+"Notebook Color Scheme:"+clr.E)
sns.palplot(sns.color_palette(my_colors))
plt.show()

### 🐝 W&B Fork & Run

In order to run this notebook you will need to input your own **secret API key** within the `! wandb login $secret_value_0` line. 

🐝**How do you get your own API key?**

Super simple! Go to **https://wandb.ai/site** -> Login -> Click on your profile in the top right corner -> Settings -> Scroll down to API keys -> copy your very own key (for more info check [this amazing notebook for ML Experiment Tracking on Kaggle](https://www.kaggle.com/ayuraj/experiment-tracking-with-weights-and-biases)).

<center><img src="https://i.imgur.com/fFccmoS.png" width=500></center>

In [None]:
# 🐝 Secrets
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("wandb")

! wandb login $secret_value_0

### ⬇ Helper Functions

In [None]:
def plot_loss_graph(train_losses, valid_losses, epoch, fold):
    '''Lineplot of the training/validation losses.'''
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 2.5))
    fig.suptitle(f"Fold {fold} | Epoch {epoch}", fontsize=12, y=1.05)
    axes = [ax1, ax2]
    data = [train_losses, valid_losses]
    sns.lineplot(y=train_losses, x=range(len(train_losses)),
                 lw=2.3, ls=":", color=my_colors[1], ax=ax1)
    sns.lineplot(y=valid_losses, x=range(len(valid_losses)),
                 lw=2.3, ls="-", color=my_colors[6], ax=ax2)
    for ax, t, d in zip(axes, ["Train", "Valid"], data):
        ax.set_title(f"{t} Evolution", size=12, weight='bold')
        ax.set_xlabel("Iteration", weight='bold', size=9)
        ax.set_ylabel("Loss", weight='bold', size=9)
        ax.tick_params(labelsize=9)
    plt.show()
    
    
def show_values_on_bars(axs, h_v="v", space=0.4):
    '''Plots the value at the end of the a seaborn barplot.
    axs: the ax of the plot
    h_v: weather or not the barplot is vertical/ horizontal'''
    
    def _show_on_single_plot(ax):
        if h_v == "v":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() / 2
                _y = p.get_y() + p.get_height()
                value = int(p.get_height())
                ax.text(_x, _y, format(value, ','), ha="center") 
        elif h_v == "h":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() + float(space)
                _y = p.get_y() + p.get_height()
                value = int(p.get_width())
                ax.text(_x, _y, format(value, ','), ha="left")

    if isinstance(axs, np.ndarray):
        for idx, ax in np.ndenumerate(axs):
            _show_on_single_plot(ax)
    else:
        _show_on_single_plot(axs)


# === 🐝 W&B ===
def save_dataset_artifact(run_name, artifact_name, path):
    '''Saves dataset to W&B Artifactory.
    run_name: name of the experiment
    artifact_name: under what name should the dataset be stored
    path: path to the dataset'''
    
    run = wandb.init(project='happywhale', 
                     name=run_name, 
                     config=CONFIG)
    artifact = wandb.Artifact(name=artifact_name, 
                              type='dataset')
    artifact.add_file(path)

    wandb.log_artifact(artifact)
    wandb.finish()
    print("Artifact has been saved successfully.")
    
    
def create_wandb_plot(x_data=None, y_data=None, x_name=None, y_name=None, title=None, log=None, plot="line"):
    '''Create and save lineplot/barplot in W&B Environment.
    x_data & y_data: Pandas Series containing x & y data
    x_name & y_name: strings containing axis names
    title: title of the graph
    log: string containing name of log'''
    
    data = [[label, val] for (label, val) in zip(x_data, y_data)]
    table = wandb.Table(data=data, columns = [x_name, y_name])
    
    if plot == "line":
        wandb.log({log : wandb.plot.line(table, x_name, y_name, title=title)})
    elif plot == "bar":
        wandb.log({log : wandb.plot.bar(table, x_name, y_name, title=title)})
    elif plot == "scatter":
        wandb.log({log : wandb.plot.scatter(table, x_name, y_name, title=title)})
        
        
def create_wandb_hist(x_data=None, x_name=None, title=None, log=None):
    '''Create and save histogram in W&B Environment.
    x_data: Pandas Series containing x values
    x_name: strings containing axis name
    title: title of the graph
    log: string containing name of log'''
    
    data = [[x] for x in x_data]
    table = wandb.Table(data=data, columns=[x_name])
    wandb.log({log : wandb.plot.histogram(table, x_name, title=title)})

### 🌱 Setting the Seed & Device

In [None]:
def set_seed(seed = 1234):
    '''
    🌱src:https://www.kaggle.com/andradaolteanu/melanoma-competiton-aug-resnet-effnet-lb-0-91
    Sets the seed of the entire notebook so results are the same every time we run.
    This is for REPRODUCIBILITY.'''
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    # When running on the CuDNN backend, two further options must be set
    torch.backends.cudnn.deterministic = True
    # Set a fixed value for the hash seed
    os.environ['PYTHONHASHSEED'] = str(seed)
    
set_seed()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(clr.S+'Device available now:'+clr.E, device)

# 1. Data & Parameters

🐳 **The Data:** I am using my preprocessed `.csv` file for training and test data (for more info on how I made them [check out this notebook](https://www.kaggle.com/andradaolteanu/whales-dolphins-effnet-embedding-cos-distance)). I am also using the [images dataset](https://www.kaggle.com/phalanx/whale2-cropped-dataset) that has been resized and cropped by [phalanx](https://www.kaggle.com/phalanx) using the Detic methodology ([Discussion and Explanation here](https://www.kaggle.com/c/happy-whale-and-dolphin/discussion/305503))

In [None]:
# --------- INITIAL PARAMETERS ---------
TRAIN_FOLDER = "../input/whale2-cropped-dataset/cropped_train_images/cropped_train_images/"
TEST_FOLDER = "../input/whale2-cropped-dataset/cropped_test_images/cropped_test_images/"

# Set some parameters for sanity checks & experimenting
N_SPLITS = 5
BATCH_SIZE = 16
MODEL_NAME = 'efficientnet_b0'
NUM_CLASSES = 15587
NO_NEURONS = 250
EMBEDDING_SIZE = 128
# -------------------------------------

In [None]:
# Import the data
train = pd.read_csv("../input/happywhale-2022/train.csv")
test = pd.read_csv("../input/happywhale-2022/test.csv")

# Update path to new image folders
train["path"] = TRAIN_FOLDER + train["image"]
test["path"] = TEST_FOLDER + test["image"]

print(clr.S+"TRAIN:"+clr.E)
display_html(train.head())
print("\n", clr.S+"TEST:"+clr.E)
display_html(test.head())

# 2. PyTorch Dataset

In [None]:
class HappyWhaleDataset(Dataset):
    
    def __init__(self, csv, trainFlag):
        '''Module to create the PyTorch Dataset.
        csv: full dataframe (train or test)
        trainFlag: True if csv is a training/validation dataset, False otherwise
        return: image and class target if trainFlag, otherwise only image'''
        
        self.csv = csv
        self.trainFlag = trainFlag
        if self.trainFlag:
            self.transform = albumentations.Compose([
                albumentations.Resize(128, 128),
                albumentations.HorizontalFlip(),
                albumentations.VerticalFlip(),
                albumentations.Rotate(),
                albumentations.Normalize(),
                # B&W?
            ])
        else:
            self.transform = albumentations.Compose([
                albumentations.Normalize()
            ])

            
    def __len__(self):
        return self.csv.shape[0]

    
    def __getitem__(self, index):
        # Get data
        row = self.csv.iloc[index]
        
        # Read and transform the image
        image = cv2.imread(row.path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        
        transformed_img = self.transform(image=image)['image'].astype(np.float32)
        image = transformed_img.transpose(2, 0, 1)
        image = torch.tensor(image)            

        if self.trainFlag:
            # Retrieve the target group
            target = torch.tensor(row.individual_key)
            return image, target
        
        else:
            return image

### Dataset Example

🐳 In the below example we have:
* 4 batches in total for 12 images, each batch having a size of 3 images => 3 images/batch
* The output per each batch:
    * a 4D tensor of 3 images with size 3 channels x 128 width x 128 height
    * a 1D tensor of size 3 -> one target class per each image

In [None]:
# Example for the Dataset data
example_dataset = HappyWhaleDataset(train.head(12), trainFlag=True)
example_loader = DataLoader(example_dataset, batch_size=3)

for k, (image, target) in enumerate(example_loader):
    print(clr.S+f"--- Batch {k} ---"+clr.E)
    print("Image Shape:", image.shape)
    print("Target:", target, "\n")

# 3. PyTorch Helper Functions

## 3.1 Generalized Mean (or GeM)

🐳 There is a great article [here](https://amaarora.github.io/2020/08/30/gempool.html) from Aman Arora that explains in detail the mathematics behind GeM Pooling. A notebook comparison between GeM Pooling and Average Pooling can also be found [here](https://github.com/amaarora/amaarora.github.io/blob/master/nbs/GeM%20Pooling.ipynb).

In short, an image has 3 dimensions: `K x H x W`, where:
* K: the number of channels
* H: the image height
* W: the image width

Let `Xk` be the **spatial feature map activation**, then the difference between *Max Pooling*, *Average Pooling* and *GeM Pooling* is the following:

<center><img src="https://i.imgur.com/HMaaKjD.png" width=700></center>

🐳 The pooling parameter `pk` can be *set* or *learned*, since this operation can be learned during back-propagation. In other words, *GeM Pooling* can also be trainable.

In [None]:
# src: https://amaarora.github.io/2020/08/30/gempool.html

class GeM(nn.Module):
    def __init__(self, p=3, eps=1e-6):
        super(GeM,self).__init__()
        self.p = nn.Parameter(torch.ones(1)*p)
        self.eps = eps

    def forward(self, x):
        return self.gem(x, p=self.p, eps=self.eps)
        
    def gem(self, x, p=3, eps=1e-6):
        # Applies 2D average-pooling operation in kH * kW regions by step size
        return F.avg_pool2d(x.clamp(min=eps).pow(p), (x.size(-2), x.size(-1))).pow(1./p)
        
    def __repr__(self):
        return self.__class__.__name__ + '(' + 'p=' + '{:.4f}'.format(self.p.data.tolist()[0]) + ', ' + 'eps=' + str(self.eps) + ')'

## 3.2 Additive Angular Margin Loss (or ArcFace)

🐳 **ArcFace**, or [Additive Angular Margin Loss](https://paperswithcode.com/method/arcface#:~:text=ArcFace%2C%20or%20Additive%20Angular%20Margin,traditionally%20used%20in%20these%20tasks.), *is a loss function* used in face recognition tasks.

The `softmax` is traditionally used in these tasks. However, the softmax loss function does not *explicitly optimise* the feature embedding to enforce **higher similarity for intraclass samples** and **diversity for inter-class samples** - in other words? We want the ambeddings that are super similar to be VERY CLOSE to each-other and the embeddings that are different to be VERY FAR from each-other:

<center><img src="https://i.imgur.com/nTJkOUj.png" width=800></center>

In [None]:
# src: https://github.com/lyakaap/Landmark2019-1st-and-3rd-Place-Solution/blob/master/src/modeling/metric_learning.py

class ArcMarginProduct(nn.Module):
    def __init__(self, in_features, out_features, s=30.0, 
                 m=0.50, easy_margin=False, ls_eps=0.0):
        '''
        in_features: dimension of the input
        out_features: dimension of the last layer (in our case the classification)
        s: norm of input feature
        m: margin
        ls_eps: label smoothing'''
        
        super(ArcMarginProduct, self).__init__()
        self.in_features, self.out_features = in_features, out_features
        self.s = s
        self.m = m
        self.ls_eps = ls_eps
        self.weight = nn.Parameter(torch.FloatTensor(out_features, in_features))
        # Fills the input `Tensor` with values according to the method described in
        # `Understanding the difficulty of training deep feedforward neural networks`
        # Glorot, X. & Bengio, Y. (2010)
        # using a uniform distribution.
        nn.init.xavier_uniform_(self.weight)

        self.easy_margin = easy_margin
        self.cos_m, self.sin_m = math.cos(m), math.sin(m)
        self.th = math.cos(math.pi - m)
        self.mm = math.sin(math.pi - m) * m

    def forward(self, input, label):
        # --------------------------- cos(theta) & phi(theta) ---------------------
        cosine = F.linear(F.normalize(input), F.normalize(self.weight))
        sine = torch.sqrt(1.0 - torch.pow(cosine, 2))
        phi = cosine * self.cos_m - sine * self.sin_m
        if self.easy_margin:
            phi = torch.where(cosine > 0, phi, cosine)
        else:
            phi = torch.where(cosine > self.th, phi, cosine - self.mm)
        # --------------------------- convert label to one-hot ---------------------
        one_hot = torch.zeros(cosine.size()).to(device)
        one_hot.scatter_(1, label.view(-1, 1).long(), 1)
        if self.ls_eps > 0:
            one_hot = (1 - self.ls_eps) * one_hot + self.ls_eps / self.out_features
        # -------------torch.where(out_i = {x_i if condition_i else y_i) ------------
        output = (one_hot * phi) + ((1.0 - one_hot) * cosine)
        output *= self.s

        return output

# 4. The PyTorch Model

In [None]:
class HappyWhaleModel(nn.Module):
    def __init__(self, modelName, numClasses, noNeurons, embeddingSize):
        
        super(HappyWhaleModel, self).__init__()
        # Retrieve pretrained weights
        self.backbone = timm.create_model(modelName, pretrained=True)
        # Save the number features from the backbone
        ### different models have different numbers e.g. EffnetB3 has 1536
        backbone_features = self.backbone.classifier.in_features
        self.backbone.classifier = nn.Identity() # ?????
        self.backbone.global_pool = nn.Identity() # ?????
        self.gem = GeM()
        # Embedding layer (what we actually need)
        self.embedding = nn.Sequential(nn.Linear(backbone_features, noNeurons),
                                       nn.BatchNorm1d(noNeurons),
                                       nn.ReLU(),
                                       nn.Dropout(p=0.2),
                                       
                                       nn.Linear(noNeurons, embeddingSize),
                                       nn.BatchNorm1d(embeddingSize),
                                       nn.ReLU(),
                                       nn.Dropout(p=0.2))
        self.arcface = ArcMarginProduct(in_features=embeddingSize, 
                                        out_features=numClasses,
                                        s=30.0, m=0.50, easy_margin=False, ls_eps=0.0)
        
        
    def forward(self, image, target=None, prints=False):
        '''If there is a target it means that the model is training on the dataset.
        If there is no target, that means the model is predicting on the test dataset.
        In this case we would skip the ArcFace layer and return only the image embeddings.
        '''
        
        features = self.backbone(image)
        # flatten transforms from e.g.: [3, 1536, 1, 1] to [3, 1536]
        gem_pool = self.gem(features).flatten(1)
        embedding = self.embedding(gem_pool)
        if target != None:
            out = self.arcface(embedding, target)
        
        if prints:
            print(clr.S+"0. IN:", "image shape:"+clr.E, image.shape, "target:", target)
            print(clr.S+"1. Backbone Output:"+clr.E, features.shape)
            print(clr.S+"2. GeM Pool Output:"+clr.E, gem_pool.shape)
            print(clr.S+"3. Embedding Output:"+clr.E, embedding.shape)
            if target != None:
                print(clr.S+"4. ArcFace Output:"+clr.E, out.shape)
        
        if target != None:
            return out, embedding
        else:
            return embedding

### Model Example

> 🐳 **Note**: Below you can find a simple schema on what the `HappyWhaleModel()` does:
<center><img src="https://i.imgur.com/1EXE1lR.png" width=900></center>

In [None]:
# Create an example model - Effnet
model_example = HappyWhaleModel(MODEL_NAME, NUM_CLASSES, NO_NEURONS, EMBEDDING_SIZE).to(device)

In [None]:
# Criterion
criterion_example = nn.CrossEntropyLoss()

# We'll use previous datasets & dataloader
for k, (image, target) in enumerate(example_loader):
    print(clr.S+f"=== Batch {k} ==="+clr.E)
    image, target = image.to(device), target.to(device)
    out, _ = model_example(image, target, prints=True)
    loss = criterion_example(out, target)
    print(clr.S+'--- LOSS ---'+clr.E, loss.item(), "\n")

In [None]:
del model_example
gc.collect()

# 5. Training Classifier Model

The problem at hand could be divided into 3 steps:
1. Train a `classification` model on the data
2. Extract the `embedding layer` right before the classification layer
3. Use these embeddings in order to group the individuals together

## 5.1 Training Prepping

## I. Data Loaders

In [None]:
def get_loaders(df, train_i, valid_i):
    '''
    df: the full initial dataframe
    train_i, valid_i: list of indexes for train and validation split
    VALID_PERC: percentage of how much of valid data to preserve - leave 1 for full dataset
    return: train_loader and valid_loader
    '''
    
    train_df = df.iloc[train_i, :]
    # To go quicker through validation
    valid_df = df.iloc[valid_i, :].sample(int(len(valid_i)*VALID_PERC), random_state=23)

    # Datasets & Dataloader
    train_dataset = HappyWhaleDataset(train_df, trainFlag=True)
    valid_dataset = HappyWhaleDataset(valid_df, trainFlag=True)

    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False)
    
    return train_loader, valid_loader

## II. Model, Optimizer, Criterion

🐳 **How to adjust learning rate?**: `torch.optim.lr_scheduler` provides [several methods](https://pytorch.org/docs/stable/optim.html) to adjust the learning rate based on the number of epochs.

Here is a full list of all PyTorch schedulers: https://pytorch.org/docs/stable/optim.html

In [None]:
def get_model_optimizer_criterion():
    
    model = HappyWhaleModel(MODEL_NAME, NUM_CLASSES, NO_NEURONS, EMBEDDING_SIZE).to(device)
    optimizer = Adam(model.parameters(), lr=LR, weight_decay=WEIGHT_DECAY, amsgrad=False)
    scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=T_MAX, eta_min=MIN_LR)
    criterion = nn.CrossEntropyLoss()
    
    return model, optimizer, scheduler, criterion

## III. Training Function

> Below there is a complete step by step schema of what the `train_pipeline()` function does:
<center><img src="https://i.imgur.com/n44OvXa.png" width=900></center>

### ⬇️ Full Training Pipeline and 🐝W&B logging below

In [None]:
def train_pipeline(train):
    '''
    train: the full training dataframe (to be split in train data & valid data)
    '''

    s = time.time()
    
    # 🐝 W&B Initialize  
    RUN_CONFIG = CONFIG.copy()
    params = dict(model=MODEL_NAME, epochs=EPOCHS, split=N_SPLITS, 
                  batch=BATCH_SIZE, lr=LR, weight_decay=WEIGHT_DECAY,
                  t_max=T_MAX, min_lr=MIN_LR)
    RUN_CONFIG.update(params)
    run = wandb.init(project='happywhale', name=RUN_NAME, config=RUN_CONFIG)

    
    # === CV Split ===
    skf = StratifiedKFold(n_splits=N_SPLITS)
    skf_splits = skf.split(X=train, y=train["individual_key"])


    for fold, (train_i, valid_i) in enumerate(skf_splits):

        print("~"*25)
        print("~"*8, clr.S+f"FOLD {fold}"+clr.E, "~"*8)
        print("~"*25)

        # Retrieve data loaders
        train_loader, valid_loader = get_loaders(train, train_i, valid_i)

        # Model/ Optimizer/ Scheduler/ Criterion
        model, optimizer, scheduler, criterion = get_model_optimizer_criterion()
        # Hooks into the torch model to collect gradients and the topology
        wandb.watch(model, log_freq=100)

        # Run Training
        BEST_SCORE = 9999

        for epoch in range(EPOCHS):
            print("~"*8, clr.S+f"Epoch {epoch}"+clr.E, "~"*8)

            # === TRAIN ===
            model.train()
            train_losses = []

            for images, targets in tqdm(train_loader, desc = 'TRAIN'):
                images, targets = images.to(device), targets.to(device)

                # Clear gradients BEFORE prediction
                optimizer.zero_grad()
                # Make predictions
                out, _ = model(images, targets)
                # Compute Loss and Optimize
                loss = criterion(out, targets)             
                loss.backward()
                optimizer.step()

                train_losses.append(loss.cpu().detach().numpy().tolist())

            # Adjust Learning Rate
            scheduler.step()

            mean_train_loss = np.mean(train_losses)
            print(clr.S+"Mean Train Loss:"+clr.E, mean_train_loss)
            wandb.log({"mean_train_loss": np.float(mean_train_loss)}, step=epoch)


            # === EVAL ===
            model.eval()
            valid_losses, valid_preds, valid_targets = [], [], []
            with torch.no_grad():
                for images, targets in valid_loader:
                    valid_targets.append(targets)
                    images, targets = images.to(device), targets.to(device)

                    out, _ = model(images, targets)
                    loss = criterion(out, targets)

                    valid_preds.append(out)
                    valid_losses.append(loss.cpu().detach().numpy().tolist())

            mean_valid_loss = np.mean(valid_losses)
            print(clr.S+"Mean Valid Loss:"+clr.E, mean_valid_loss)
            wandb.log({"mean_valid_loss": np.float(mean_valid_loss)}, step=epoch)
            gc.collect()

            plot_loss_graph(train_losses, valid_losses, epoch, fold)
            create_wandb_plot(x_data=range(len(train_losses)), y_data=train_losses,
                      x_name="Iterations", y_name="Loss", title="Train Loss",
                      log="train_loss", plot="line")

            # === UPDATES ===

            if mean_valid_loss < BEST_SCORE:        
                print("! Saving model in fold {} | epoch {} ...".format(fold, epoch), "\n")
                torch.save(model.state_dict(), f"EffNetB0_fold_{fold}_loss_{round(mean_valid_loss, 3)}.pt")

                BEST_SCORE = mean_valid_loss

        # Clean memory before next fold
        del model, optimizer, scheduler, criterion, images, targets, \
                    train_losses, valid_losses, valid_preds, valid_targets
        torch.cuda.empty_cache()
        gc.collect()


    print(clr.S+f"Time to run: {round((time.time() - s)/60, 2)} minutes"+clr.E)
    wandb.finish()

## 5.2 Train Experiments

🐳 **A few things to keep in mind:**
* `NUM_CLASSES`: target labels need to start from 0 and have consecutive numbers, otherwise PyTorch will get upset
    * e.g.: bad labels example: [1, 2, 5, 10]
    * e.g.: good labels example: [0, 1, 2, 3, 4, ......100]
* `BATCH_SIZE`: if set too hight the notebook might get a memory load. This also applies to `IMAGE_SIZE` and the model used (I have tried de EffNet B3 with 512x512 size and I couldn't run the training pipeline)
* `VALID_PERC`: this I use so the pipeline goes faster through the validation part (so the notebook commits faster); you can set it to `1` to run the full validation dataset.

In [None]:
# --------- GLOBAL PARAMETERS ---------
NUM_CLASSES = 15587
N_SPLITS = 3
BATCH_SIZE = 32
MODEL_NAME = 'efficientnet_b0'
RUN_NAME = "B0_neurons_200_embed_200_epochs_4"
EPOCHS = 6
VALID_PERC = 0.1
NO_NEURONS = 250
EMBEDDING_SIZE = 128
# -> Optimizer
LR = 0.0001
WEIGHT_DECAY = 0.000001
# -> Scheduler
T_MAX = 500              # Maximum number of iterations
MIN_LR = 0.000001        # Minimum learning rate. Default: 0
# ------------------------------------

In [None]:
train_pipeline(train)

# 6. Model Embeddings

Now that we have let the model see the images, we can take the **parameters** from the second to last layer (not the one that creates the output, but the one before that) and use it to **create the image embeddings**.

🐳 **Image embeddings** are the *juice* of an image, the very core of it's *information*. This layer has learned everything there is to know about each image and each individual. Moreover, as the model becomes more accurate, the embeddings become more precise too, not only in classifying the individuals, but also being able to recognize the differences between them.

## 6.1 Retrieve the Embeddings

We will use `torch.load()` to load into a model the pretrained weights & biases that we have created during the Classification task.

In [None]:
# pretrained_name = "EffNetB0_fold_0_loss_14.979"
# pretrained_name = "EffNetB0_fold_1_loss_14.91"
pretrained_name = "EffNetB0_fold_2_loss_15.325"

# Path to trained model parameters (i.e. weights and biases)
classif_model_path = f"../input/happywhale-2022/{pretrained_name}.pt"

# Load the model and append learned params
model = HappyWhaleModel(MODEL_NAME, NUM_CLASSES, NO_NEURONS, EMBEDDING_SIZE).to(device)
model.load_state_dict(torch.load(classif_model_path))

In [None]:
# DataLoader
dataset = HappyWhaleDataset(train, trainFlag=True)
dataloader = DataLoader(dataset, batch_size=64, shuffle=False)

# Retrieve all embeddings for each image
all_embeddings = []

model.eval()
with torch.no_grad():
    for image, target in tqdm(dataloader): 
        image, target = image.to(device), target.to(device)
        _, embedding = model(image, target)
        embedding = embedding.detach().cpu().numpy()
        all_embeddings.append(embedding)
        
# Concatenate batches together
image_embeddings = np.concatenate(all_embeddings)

# Save embeddings and corresponding image
np.save(f'{pretrained_name}.npy', image_embeddings)

In [None]:
# 🐝Save embeddings to W&B
save_dataset_artifact(run_name=pretrained_name, 
                      artifact_name=pretrained_name, 
                      path="../input/happywhale-2022/EffNetB0_fold_0_loss_14.979.npy")

# 7. RAPIDS Clustering

Last step is to create the clusters. The `k_neighbors` will be set to 5, as this is the maximum number to predict ([as stated in competition guidelines](https://www.kaggle.com/c/happy-whale-and-dolphin/overview/evaluation)).

In [None]:
# === CLUSTERING ===
# Use the cuml function from RAPIDS suite
knn_model = NearestNeighbors(n_neighbors=5)
# Train the model
knn_model.fit(image_embeddings)

# Infer on the training data
# distances - the distance between each point in the group
# indices - the index row of each image
distances, indices = knn_model.kneighbors(image_embeddings)

In [None]:
# === PREDICTION ===
# Create the grouped predictions based on distances & indices
predictions = {"images": [], "embeddings": []}

for i in tqdm(range(len(image_embeddings))):
    index = np.where(distances[k, ] < 6.0)[0]
    split = indices[i, index]
    
    grouped_images = train.iloc[split]["image"].values
    grouped_embeddings = image_embeddings[split]

    predictions["images"].append(grouped_images)
    predictions["embeddings"].append(grouped_embeddings)

### Compute Cos Distance

Below you can see a few examples of **clustered train embeddings** and the cos distance similarity between them.

In [None]:
def get_cosine_similarity(embeddings):
    '''Compute cos distance between n embedding vector and itself.'''
    similarity_matrix = []
    
    for embed1 in embeddings:
        similarity_row = []
        for embed2 in embeddings:
            similarity_row.append(1 - spatial.distance.cosine(embed1, embed2))
        similarity_matrix.append(similarity_row)
    
    return np.array(similarity_matrix, dtype="float32")


def plot_heatmap(example_paths, cos_matrix, mask):
    '''Computes cos distance heatmap.'''
    
    # Plots
    fig = plt.figure(figsize=(12, 12))
    ax1 = plt.subplot2grid(shape=(6, 6), loc=(5, 1), colspan=1)
    ax2 = plt.subplot2grid(shape=(6, 6), loc=(5, 2), colspan=1)
    ax3 = plt.subplot2grid(shape=(6, 6), loc=(5, 3), colspan=1)
    ax4 = plt.subplot2grid(shape=(6, 6), loc=(5, 4), colspan=1)
    ax5 = plt.subplot2grid(shape=(6, 6), loc=(5, 5), colspan=1)
    h_axes = [ax1, ax2, ax3, ax4, ax5]

    ax6 = plt.subplot2grid(shape=(6, 6), loc=(0, 0), colspan=1)
    ax7 = plt.subplot2grid(shape=(6, 6), loc=(1, 0), colspan=1)
    ax8 = plt.subplot2grid(shape=(6, 6), loc=(2, 0), colspan=1)
    ax9 = plt.subplot2grid(shape=(6, 6), loc=(3, 0), colspan=1)
    ax10 = plt.subplot2grid(shape=(6, 6), loc=(4, 0), colspan=1)
    v_axes = [ax6, ax7, ax8, ax9, ax10]

    ax11 = plt.subplot2grid(shape=(6, 6), loc=(0, 1), colspan=5, rowspan=5)

    fig.suptitle('- Cosine Distance -', size = 21, color = my_colors[7], weight='bold')
    for k, ax in enumerate(h_axes):
        ax.imshow(plt.imread(example_paths[k]))
        ax.set_axis_off()

    for k, ax in enumerate(v_axes):
        ax.imshow(plt.imread(example_paths[k]))
        ax.set_axis_off()

    sns.heatmap(cos_matrix, ax=ax11, fmt=".5",
                cbar=False, annot=True, linewidths=0.5, mask=mask, square=True, cmap="winter_r")

    plt.tight_layout()
    plt.show();

In [None]:
# Select a clustered group
group = 0

example_paths = ["../input/whale2-cropped-dataset/cropped_train_images/cropped_train_images/"+img \
                     for img in predictions["images"][group]]
example_embeds = predictions["embeddings"][group]

# Compute similarity matrix
cos_matrix = get_cosine_similarity(example_embeds)
mask = np.zeros_like(cos_matrix)
mask[np.triu_indices_from(mask)] = True

plot_heatmap(example_paths, cos_matrix, mask)

In [None]:
# Select a clustered group
group = 24

example_paths = ["../input/whale2-cropped-dataset/cropped_train_images/cropped_train_images/"+img \
                     for img in predictions["images"][group]]
example_embeds = predictions["embeddings"][group]

# Compute similarity matrix
cos_matrix = get_cosine_similarity(example_embeds)
mask = np.zeros_like(cos_matrix)
mask[np.triu_indices_from(mask)] = True

plot_heatmap(example_paths, cos_matrix, mask)

In [None]:
# Select a clustered group
group = 100

example_paths = ["../input/whale2-cropped-dataset/cropped_train_images/cropped_train_images/"+img \
                     for img in predictions["images"][group]]
example_embeds = predictions["embeddings"][group]

# Compute similarity matrix
cos_matrix = get_cosine_similarity(example_embeds)
mask = np.zeros_like(cos_matrix)
mask[np.triu_indices_from(mask)] = True

plot_heatmap(example_paths, cos_matrix, mask)

# 8. Inference

🐳 Last step! Let's do the inference using the goodies we have created so far!

We will first set up the inference parameters - in this notebook these might be different than the ones during training, depending on which model I am using to make the inference. Hence, the inference parameters **must** be aligned with the model's parameters.

🐝 How do I remember what I used during model training? I check out the *metrics logged* into [my W&B Dashboard](https://wandb.ai/andrada/happywhale?workspace=user-andrada) for that speciffic experiment.

In [None]:
# --------- INFERENCE PARAMETERS ---------
PRETRAINED_NAME1 = "EffNetB0_fold_0_loss_14.979"
PRETRAINED_NAME2 = "EffNetB0_fold_1_loss_14.91"
PRETRAINED_NAME3 = "EffNetB0_fold_2_loss_15.325"
MODEL_NAME = 'efficientnet_b0'
NUM_CLASSES = 15587
NO_NEURONS = 250
EMBEDDING_SIZE = 128
# ----------------------------------------

### I. Get Test Embeddings

🐳 The first step in this process is to retrieve the embeddings. We are doing this by loading the trained model named `final_model` and extracting the embeddings from the second to last layer. I am saving these into `test_embeddings`.

In [None]:
def retrieve_test_embeddings(PRETRAINED_NAME, MODEL_NAME, NUM_CLASSES, NO_NEURONS, EMBEDDING_SIZE):
    '''Get test embeddings using a pretrained model.'''
    
    # Path to [best] model
    classif_model_path = f"../input/happywhale-2022/{PRETRAINED_NAME}.pt"

    # Load the model and append learned params
    final_model = HappyWhaleModel(MODEL_NAME, NUM_CLASSES, NO_NEURONS, EMBEDDING_SIZE).to(device)
    final_model.load_state_dict(torch.load(classif_model_path))

    # DataLoader
    dataset = HappyWhaleDataset(test, trainFlag=False)
    dataloader = DataLoader(dataset, batch_size=64, shuffle=False)

    # Retrieve test embeddings from the model
    all_embeddings = []

    final_model.eval()
    with torch.no_grad():
        for image in tqdm(dataloader): 
            image = image.to(device)
            embedding = final_model(image, target=None)
            embedding = embedding.detach().cpu().numpy()
            all_embeddings.append(embedding)

    # Concatenate batches together
    test_embeddings = np.concatenate(all_embeddings)

    # Save embeddings
    np.save(f'{PRETRAINED_NAME}_test_.npy', test_embeddings)
    
    return test_embeddings

In [None]:
# ===== I. EMBEDDINGS  =====
# Here I am retrieving the 3 test embeddings and averaging them together
test_embeddings1 = retrieve_test_embeddings(PRETRAINED_NAME1, 
                                            MODEL_NAME, NUM_CLASSES, NO_NEURONS, EMBEDDING_SIZE)
test_embeddings2 = retrieve_test_embeddings(PRETRAINED_NAME2, 
                                            MODEL_NAME, NUM_CLASSES, NO_NEURONS, EMBEDDING_SIZE)
test_embeddings3 = retrieve_test_embeddings(PRETRAINED_NAME3, 
                                            MODEL_NAME, NUM_CLASSES, NO_NEURONS, EMBEDDING_SIZE)

test_embeddings = (test_embeddings1+test_embeddings2+test_embeddings3)/3

### II. Creating Clusters

Now we can group the embeddings into clusters.

🐳 We first load the `train_embeddings` computed during training and train a final `NearestNeighbours` classifier on them. After that we can **predict** on our `test_embeddings` using this `knn_final_model` => the output is **50 nearest Distances & their Index** for *each test embedding*.

#### Embedding Blending

In [None]:
# ===== II. CLUSTERS  =====
# Get full train embeddings
# Here I am adding the 3 train embeddings and averaging them together as well
train_embeddings1 = np.load(f"../input/happywhale-2022/{PRETRAINED_NAME1}.npy")
train_embeddings2 = np.load(f"../input/happywhale-2022/{PRETRAINED_NAME2}.npy")
train_embeddings3 = np.load(f"../input/happywhale-2022/{PRETRAINED_NAME3}.npy")

train_embeddings = (train_embeddings1+train_embeddings2+train_embeddings3)/3

train_individual_ids = train["individual_id"].values
print(clr.S+"Train Embeddings:"+clr.E, train_embeddings.shape, "\n"+
      clr.S+"Train Individual Id:"+clr.E, train_individual_ids.shape, "\n")


# Train a final KNN model with the train embeddings
knn_final_model = NearestNeighbors(n_neighbors=50)
knn_final_model.fit(train_embeddings)

# Get distances & indexes for test
# test_embeddings = normalize(test_embeddings, axis=1, norm='l2')

D, I = knn_final_model.kneighbors(test_embeddings)
print(clr.S+"Distances shape:"+clr.E, D.shape, "\n"+
      clr.S+"Index shape:"+clr.E, I.shape)

# List of the test dataframe image ids (to loop through it)
test_images = test["image"].tolist()

In [None]:
test_df = []

# Loop through each observation within test data
for k, image_id in tqdm(enumerate(test_images)):
    # Get individual_id & distances for the observation
    individual_id = train_individual_ids[I[k]]
    distances = D[k]
    # Create a df subset with this info
    subset_preds = pd.DataFrame(np.stack([individual_id, distances], axis=1),
                                columns=['individual_id','distances'])
    subset_preds['image_id'] = image_id
    test_df.append(subset_preds)
    
    
# Concatenate subset dataframes into 1 dataframe
test_df = pd.concat(test_df).reset_index(drop=True)
# Choose max distance for each unique pair of individual_id & image_id
test_df = test_df.groupby(['image_id','individual_id'])['distances'].max().reset_index()

In [None]:
# Have a look at the predictions dataset now
test_df.sample(n=5, random_state=24)

### III. Final Prediction

🐳 The final step is to create a `predictions` dictionary that will contain the `image_id` and a list of `individual_id` predictions for each image within the test dataset. The maximum number of *guesses* to be made is 5 according to the competition guidelines, and through these guesses a value like `new_individual` can be added too, meaning that the subject within the image wasn't found within the training data.

In [None]:
# ===== III. PREDICTION  =====

# Dictionary in format: {["image_id"]: 000, ["individual_id"]: 9999}
predictions = {}
thresh = 5

for k, row in tqdm(test_df.iterrows()):
    image_id = row["image_id"]
    individual_id = row["individual_id"]
    distance = row["distances"]
    
    # If the image_id has already been added in predictions before
    if image_id in predictions:
        # If total preds for this image_id are < 5 then add, else continue
        if len(predictions[image_id]) != 5:
            predictions[image_id].append(individual_id)
        else:
            continue
    # If the distance is greater than thresh add prediction + "new_individual"
    elif distance > thresh:
        predictions[image_id] = [individual_id, "new_individual"]
    else:
        predictions[image_id] = ["new_individual", individual_id]


# Fill in all lists that have less than 5 predictions as of yet
sample_list = ['37c7aba965a5', '114207cab555', 'a6e325d8e924', '19fbb960f07d','c995c043c353']

for image_id, preds in tqdm(predictions.items()):
    if len(preds) < 5:
        remaining = [individ_id for individ_id in sample_list if individ_id not in preds]
        preds.extend(remaining)
        predictions[image_id] = preds[:5]

In [None]:
# Create final submission
predictions = pd.Series(predictions).reset_index()
predictions.columns = ['image','predictions']
predictions['predictions'] = predictions['predictions'].apply(lambda x: ' '.join(x))
predictions.to_csv('submission.csv',index=False)

predictions.head()

<center><img src="https://i.imgur.com/0cx4xXI.png"></center>

### 🐝 W&B Dashboard

> My [W&B Dashboard](https://wandb.ai/andrada/happywhale?workspace=user-andrada).

<center><video src="https://i.imgur.com/LQ1lHNC.mp4" width=800 controls></center>

<center><img src="https://i.imgur.com/knxTRkO.png"></center>

### My Specs

* 🖥 Z8 G4 Workstation
* 💾 2 CPUs & 96GB Memory
* 🎮 NVIDIA Quadro RTX 8000
* 💻 Zbook Studio G7 on the go