# ArcFaces and EfficientNet with Pytorch Lightning ⚡
Hi Kagglers ! 👋<br>
In this notebook I want to make sort of a baseline with pytorch lightning as I believe this library is very helpful and can help you iterate quickly.<br>
On a high level, this notebook translates into pytorch and merges the following solutions:
1. The template that is used by best public notebooks
2. The 3rd place solution of the [Google Landmark Recognition 2020](https://www.kaggle.com/c/landmark-recognition-2020)
3. The current best public pytorch notebook [Pytorch inference notebok (Arcface + GeM Pooling)](https://www.kaggle.com/vladvdv/pytorch-inference-notebok-arcface-gem-pooling)

## What is implemented in this notebook ?
# What methods are used in this notebook 💡 ?

| Method | Learn more | Description |
|:---|:---|:---|
| EfficientNet | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) | Architecture designed for efficiency. As we have limited GPU quota, I want to be able to run the best model on a limited setup |
| ArcFaces sub-centers | [Sub-center ArcFace: Boosting Face Recognition by Large-scale Noisy Web Faces](https://paperswithcode.com/paper/sub-center-arcface-boosting-face-recognition) | Each class have multiple embeddings and you are assigning to each class the distance of its clothest sub-center |
| ArcFaces Loss | [ArcFace: Additive Angular Margin Loss for Deep Face Recognition](https://arxiv.org/abs/1801.07698) | Making it harder for the model to predict the right class during training as a fegularization method |
| Dynamic margins | [Google Landmark Recognition 2020 Competition Third Place Solution](https://arxiv.org/abs/2010.05350) | Create a specific margin for each class based on its frequency in the dataset |
| GeM Pooling | [Fine-tuning CNN Image Retrieval with No Human Annotation](https://arxiv.org/abs/1711.02512) | Pooling method sort of intermediate between max-pooling and average-pooling. How much is it similar to one or the other is learned |
| Gradient accumulation | [Accumulate gradients](https://pytorch-lightning.readthedocs.io/en/stable/advanced/training_tricks.html#accumulate-gradients) | Accumulate gradient over multiple batches to simulate any batch size regardless of your GPU memory size |
| Mixed precision training | [Precision](https://pytorch-lightning.readthedocs.io/en/latest/advanced/mixed_precision.html) | Train models using 16 bits data types to be able to fit models you couldn't on your hardware without it |

# 📚 Librairies 

In [None]:
!export CUDA_LAUNCH_BLOCKING=1
import numpy as np
import pandas as pd
import os
import torch
import random
import albumentations
import cv2
import math
import copy
import torchmetrics
import torchvision
import pytorch_lightning as pl
import wandb
import json
from torch import nn, Tensor
import torch.nn.functional as F
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset
import matplotlib.pyplot as plt
from albumentations.pytorch.transforms import ToTensorV2
from torchvision import transforms, models
from pytorch_lightning import Callback, LightningModule, Trainer
from pytorch_lightning.loggers import WandbLogger
from sklearn.neighbors import NearestNeighbors
!pip install timm
import timm

# Configuration ⚙️
Feel free to fork and change the configuration here:

In [None]:
class CFG:
    SEED = 69
    ### Dataset
    ## Effective batch size will be BATCH_SIZE*ACCUMULATE_GRAD_BATCHES
    #BATCH_SIZE = 16
    BATCH_SIZE = 48
    #ACCUMULATE_GRAD_BATCHES = 8
    ACCUMULATE_GRAD_BATCHES = 1
    IMAGE_SIZE = 512
    #IMAGE_SIZE = 224
    NUM_WORKERS = 2
    ### Model
    MODEL_NAME = "efficientnet_b0"
    EMBEDDING_SIZE = 512
    ### Training
    ## Arcfaces
    CENTERS_PER_CLASS = 3
    S = 30
    MARGIN_MIN = 0.2
    MARGIN_MAX = 0.4
    EPOCHS = 20
    MIXED_PRECISION = True
    MODEL_PATH="model.ckpt"
    # Inference
    KNN = 100
    Q_NEW = 0.112 # Proportion of new individuals expected in the dataset

# Logging 📄
Logging using Weights and Biases 🪄🐝

In [None]:
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

# I have saved my API token with "wandb_api" as Label. 
# If you use some other Label make sure to change the same below. 
wandb_api = user_secrets.get_secret("wandb_api") 

wandb.login(key=wandb_api)

In [None]:
# Make a deterministic pipeline
def fix_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    #gpu randomseed fixed
    torch.backends.cudnn.deterministic = True

fix_seed(CFG.SEED)

# Dataset 🖼️
Implementing the dataset as a Pytorch Dataset as required by Pytorch Lightning. It applies some augmentations:

In [None]:
BASE_PATH = "../input/happywhale-enhanced-dataset-large"

class WandDID(Dataset):
    def __init__(self, data, augment=True, base_path=BASE_PATH):
        self.base_path = os.path.join(base_path, "train_images")
        self.data = data
        # Augmentations
        if augment:
            transformations = albumentations.Compose([
                albumentations.augmentations.transforms.HueSaturationValue(p=0.5),
                albumentations.Cutout(num_holes=1, max_h_size=(CFG.IMAGE_SIZE//9), max_w_size=(CFG.IMAGE_SIZE//9), p=0.5),
                albumentations.RandomBrightnessContrast(brightness_limit=0.10, contrast_limit=(-0.2, 0.2),p=0.5),
                albumentations.Normalize(),
                ToTensorV2(p=1.0)
            ])
        else:
            transformations = albumentations.Compose([
                albumentations.Normalize(),
                ToTensorV2(p=1.0)
            ])

        def make_transform(transform=False):
            def f(image):
                if transform:
                    image_np = np.array(image)
                    augmented = transform(image=image_np)
                return augmented
            return f

        self.transforms = transforms.Compose([
            transforms.Lambda(make_transform(transformations)),
        ])
        
    def __getitem__(self, idx):
        image = self.preprocess(self.data["image"].iloc[idx])
        label = self.data["individual_id_integer"].iloc[idx]
        return image, torch.tensor(label, dtype=torch.long)
    
    def preprocess(self, image):
        image = os.path.join(self.base_path, image)
        image = cv2.imread(image)[:, :, ::-1]
        if image.shape[0] < CFG.IMAGE_SIZE or image.shape[1] < CFG.IMAGE_SIZE:
            image = cv2.resize(image, (CFG.IMAGE_SIZE, CFG.IMAGE_SIZE), cv2.INTER_CUBIC)
        if self.transforms is not None:
            image = self.transforms(image)["image"]
        return image
    
    def plot_sample(self, idx):
        image = self.data["image"].iloc[idx]
        image = os.path.join(self.base_path, image)
        image = cv2.imread(image)[:, :, ::-1]
        plt.title("{} ({})".format(
            self.data["individual_id"].iloc[idx],
            self.data["species"].iloc[idx]
        ))
        plt.imshow(image)
        plt.show()
    
    def __len__(self):
        return len(self.data)

In [None]:
data = pd.read_csv(os.path.join(BASE_PATH, "train.csv"))
N_CLASSES = len(data["individual_id"].unique())
# Computing an integer mapping for individuals ids
individual_mapping = {k:i for i, k in enumerate(data["individual_id"].unique())}
# Compute margins for ArcFaces with dynamic margins
tmp = np.sqrt(1 / np.sqrt(data['individual_id'].value_counts().loc[list(individual_mapping)].values))
MARGINS = (tmp - tmp.min()) / (tmp.max() - tmp.min()) * (CFG.MARGIN_MAX - CFG.MARGIN_MIN) + CFG.MARGIN_MIN
# Save individual mapping
with open("individual_mapping.json", "w") as f:
    json.dump(individual_mapping, f)

data["individual_id_integer"] = data["individual_id"].map(individual_mapping)
train_dataset = WandDID(data)
# Dataloader
train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=CFG.BATCH_SIZE,
    num_workers=CFG.NUM_WORKERS,
    pin_memory=True,
    shuffle=True
)

In [None]:
train_dataset.plot_sample(random.randint(0, len(train_dataset)-1))

# Model 🤖
Implements the model as a Pytorch lightning module. We use EfficientNetB4 as feature extractor, the upper layers of the 3rd google landmark solution, and GeM Pooling

In [None]:
"""
    https://github.com/haqishen/Google-Landmark-Recognition-2020-3rd-Place-Solution
"""
class Swish(torch.autograd.Function):

    @staticmethod
    def forward(ctx, i):
        result = i * torch.sigmoid(i)
        ctx.save_for_backward(i)
        return result

    @staticmethod
    def backward(ctx, grad_output):
        i = ctx.saved_variables[0]
        sigmoid_i = torch.sigmoid(i)
        return grad_output * (sigmoid_i * (1 + i * (1 - sigmoid_i)))


class Swish_module(nn.Module):
    def forward(self, x):
        return Swish.apply(x)
    
class DenseCrossEntropy(nn.Module):
    def forward(self, x, target):
        x = x.float()
        target = target.float()
        logprobs = torch.nn.functional.log_softmax(x, dim=-1)

        loss = -logprobs * target
        loss = loss.sum(-1)
        return loss.mean()

class ArcMarginProduct_subcenter(nn.Module):
    def __init__(self, in_features=CFG.EMBEDDING_SIZE, out_features=N_CLASSES, k=CFG.CENTERS_PER_CLASS):
        super().__init__()
        self.weight = nn.Parameter(torch.FloatTensor(out_features*k, in_features))
        self.reset_parameters()
        self.k = k
        self.out_features = out_features
        
    def reset_parameters(self):
        stdv = 1. / math.sqrt(self.weight.size(1))
        self.weight.data.uniform_(-stdv, stdv)
        
    def forward(self, features):
        cosine_all = F.linear(F.normalize(features), F.normalize(self.weight))
        cosine_all = cosine_all.view(-1, self.out_features, self.k)
        cosine, _ = torch.max(cosine_all, dim=2)
        return cosine   
    
class ArcFaceLossAdaptiveMargin(nn.modules.Module):
    def __init__(self, margins, out_dim=N_CLASSES, s=CFG.S):
        super().__init__()
        self.crit = DenseCrossEntropy()
        self.s = s
        self.register_buffer('margins', torch.tensor(margins))
        self.out_dim = out_dim
            
    def forward(self, logits, labels):
        #ms = []
        #ms = self.margins[labels.cpu().numpy()]
        ms = self.margins[labels]
        cos_m = torch.cos(ms) #torch.from_numpy(np.cos(ms)).float().cuda()
        sin_m = torch.sin(ms) #torch.from_numpy(np.sin(ms)).float().cuda()
        th = torch.cos(math.pi - ms)#torch.from_numpy(np.cos(math.pi - ms)).float().cuda()
        mm = torch.sin(math.pi - ms) * ms#torch.from_numpy(np.sin(math.pi - ms) * ms).float().cuda()
        labels = F.one_hot(labels, self.out_dim)
        labels = labels.half() if CFG.MIXED_PRECISION else labels.float()
        cosine = logits
        sine = torch.sqrt(1.0 - cosine * cosine)
        phi = cosine * cos_m.view(-1,1) - sine * sin_m.view(-1,1)
        phi = torch.where(cosine > th.view(-1,1), phi, cosine - mm.view(-1,1))
        output = (labels * phi) + ((1.0 - labels) * cosine)
        output *= self.s
        loss = self.crit(output, labels)
        return loss

In [None]:
"""
    https://www.kaggle.com/vladvdv/pytorch-inference-notebok-arcface-gem-pooling
"""

class GeM(nn.Module):
    def __init__(self, p=3, eps=1e-6):
        super(GeM, self).__init__()
        self.p = nn.Parameter(torch.ones(1)*p)
        self.eps = eps

    def forward(self, x):
        return self.gem(x, p=self.p, eps=self.eps)
        
    def gem(self, x, p=3, eps=1e-6):
        return F.avg_pool2d(x.clamp(min=eps).pow(p), (x.size(-2), x.size(-1))).pow(1./p)

In [None]:
 """
     Scheduler coming from: https://www.kaggle.com/andrej0marinchenko/happywhale-0-679
 """
    
class PeakScheduler(torch.optim.lr_scheduler._LRScheduler):
        def __init__(
                self, optimizer,
                epoch_size=-1,
                lr_start   = 0.000001,
                lr_max     = 0.000005 * CFG.BATCH_SIZE * CFG.ACCUMULATE_GRAD_BATCHES,
                lr_min     = 0.000001,
                lr_ramp_ep = 4,
                lr_sus_ep  = 0,
                lr_decay   = 0.8,
                verbose = True
            ):
            self.epoch_size = epoch_size
            self.optimizer= optimizer
            self.lr_start = lr_start
            self.lr_max = lr_max
            self.lr_min = lr_min
            self.lr_ramp_ep = lr_ramp_ep
            self.lr_sus_ep = lr_sus_ep
            self.lr_decay = lr_decay
            self.is_plotting = True
            epochs = list(range(CFG.EPOCHS))
            learning_rates = []
            for i in epochs:
                self.epoch = i
                learning_rates.append(self.get_lr())
            self.is_plotting = False
            self.epoch = 0
            plt.scatter(epochs,learning_rates)
            plt.show()
            super(PeakScheduler, self).__init__(optimizer, verbose=verbose)

        def get_lr(self):
            if not self.is_plotting:
                if self.epoch_size == -1:
                    self.epoch = self._step_count - 1
                else:
                    self.epoch = (self._step_count - 1) / self.epoch_size
                    
            if self.epoch < self.lr_ramp_ep:
                lr = (self.lr_max - self.lr_start) / self.lr_ramp_ep * self.epoch + self.lr_start

            elif self.epoch < self.lr_ramp_ep + self.lr_sus_ep:
                lr = self.lr_max
            else:
                lr = (self.lr_max - self.lr_min) * self.lr_decay**(self.epoch - self.lr_ramp_ep - self.lr_sus_ep) + self.lr_min
            return [lr for _ in self.optimizer.param_groups]

In [None]:
class WandDIDNet(LightningModule):
    def __init__(self, continuous_scheduler=True, s=30, m=0.3):
        super().__init__()
        self.save_hyperparameters()
        self.continuous_scheduler = continuous_scheduler
        # Layers
        self.feature_extractor = timm.create_model(CFG.MODEL_NAME, pretrained=True)
        in_features = self.feature_extractor.classifier.in_features
        self.feature_extractor.classifier = nn.Identity()
        self.feature_extractor.global_pool = nn.Identity()
        self.pooling = GeM()
        self.dropout = nn.Dropout()
        self.dense = nn.Linear(in_features, CFG.EMBEDDING_SIZE)
        self.swish = Swish_module()
        self.metric_classify = ArcMarginProduct_subcenter(CFG.EMBEDDING_SIZE)
        # Loss
        self.criterion = ArcFaceLossAdaptiveMargin(margins=MARGINS)
        # Metrics
        self.train_acc = torchmetrics.Accuracy()
        self.train_top_k_acc = torchmetrics.Accuracy(top_k=5)
        self.val_acc = torchmetrics.Accuracy()
        self.val_top_k_acc = torchmetrics.Accuracy(top_k=5)
        
    def forward(self, image):
        """
            Return embedding of the images
        """
        features = self.feature_extractor(image)
        x = self.pooling(features).flatten(1)
        x = self.dropout(x)
        x = self.dense(x)
        return F.normalize(x)
    
    def training_step(self, batch, batch_idx):
        """
            Return the loss to do a step on
        """
        img, label = batch
        embedding = self(img)
        logits = self.metric_classify(embedding)
        loss  = self.criterion(logits, label)
        # Log metrics
        self.train_acc(logits, label)
        self.train_top_k_acc(logits, label)
        self.log("train/loss", loss)
        self.log("train/acc", self.train_acc)
        self.log("train/top_k_acc", self.train_top_k_acc)
        # Return loss, labels and preds
        return {"loss": loss, "preds": logits.detach(), "targets": label.detach()}
    
    def configure_optimizers(self):
        """
            Build optimizer(s) and lr scheduler(s)
        """
        optimizer = torch.optim.AdamW(self.parameters())
        if self.continuous_scheduler:
            sched = {
                "scheduler": PeakScheduler(optimizer, epoch_size=len(train_loader) // CFG.ACCUMULATE_GRAD_BATCHES, verbose=False),
                "interval": "step",
            }
        else:
            sched = {
                "scheduler": PeakScheduler(optimizer),
                "interval": "epoch",
            }
        return {
            "optimizer": optimizer,
            "lr_scheduler": sched
        }
    
    def save_class_weights(self):
        """
            Save the class centers as a tensor
        """
        torch.save(self.metric_classify.weight, 'class_weights.pt')

# Training 🏃
Create a Pytorch lightning trainer with our configuration, and run our model on our dataset:

In [None]:
model = WandDIDNet()
wandb_logger = WandbLogger(project="W&D - identification")
# Trainer 
trainer = Trainer(
    profiler="simple", # Profiling
    accumulate_grad_batches=CFG.ACCUMULATE_GRAD_BATCHES,# Accumulate gradient over multiple batches
    gpus=1,# Use the one GPU we have
    precision=16 if CFG.MIXED_PRECISION else 32,# Mixed precision
    max_epochs=CFG.EPOCHS,
    logger=wandb_logger,
    log_every_n_steps=10
)
# Let's go ⚡
trainer.fit(model, train_loader)

In [None]:
# Save everything
trainer.save_checkpoint(CFG.MODEL_PATH)
artifact = wandb.log_artifact(CFG.MODEL_PATH, name='w_and_d-id-normal', type='model') 
model.save_class_weights()
artifact = wandb.log_artifact('class_weights.pt', name='w_and_d-id-normal-weights', type="class_weights") 

# Inference 🔮
Now we have to do the inference part. First, we will predict embeddings for the training and testing samples. Then we will use nearest neighboors to make our predictions

In [None]:
class WandDIDPred(Dataset):
    def __init__(self, data, folder, base_path=BASE_PATH):
        self.base_path = os.path.join(base_path, folder)
        self.data = data
        # Augmentations
        transformations = albumentations.Compose([
            albumentations.Normalize(),
            ToTensorV2(p=1.0)
        ])

        def make_transform(transform=False):
            def f(image):
                if transform:
                    image_np = np.array(image)
                    augmented = transform(image=image_np)
                return augmented
            return f

        self.transforms = transforms.Compose([
            transforms.Lambda(make_transform(transformations)),
        ])
        
    def __getitem__(self, idx):
        image = self.preprocess(self.data["image"].iloc[idx])
        return image
    
    def preprocess(self, image):
        image = os.path.join(self.base_path, image)
        image = cv2.imread(image)[:, :, ::-1]
        if image.shape[0] < CFG.IMAGE_SIZE or image.shape[1] < CFG.IMAGE_SIZE:
            image = cv2.resize(image, (CFG.IMAGE_SIZE, CFG.IMAGE_SIZE), cv2.INTER_CUBIC)
        if self.transforms is not None:
            image = self.transforms(image)["image"]
        return image
    
    def __len__(self):
        return len(self.data)

In [None]:
# Prediction on the training data
pred_loader = torch.utils.data.DataLoader(
    WandDIDPred(data, "train_images"),
    batch_size=CFG.BATCH_SIZE * 6,
    num_workers=CFG.NUM_WORKERS,
    pin_memory=True,
    shuffle=False
)
preds = trainer.predict(model, dataloaders=pred_loader)
preds = torch.cat(preds, dim=0)
train_data = data.copy()
train_data["embedding"] = preds.tolist()
train_data.to_csv("train.csv")
train_data

In [None]:
# Prediction on test data
test_data = pd.read_csv(os.path.join(BASE_PATH, "sample_submission.csv"), index_col="image")
if "inference_image" in test_data.columns:
    test_data["image"] = test_data["inference_image"]
else:
    test_data["image"] = test_data.index
    
pred_loader = torch.utils.data.DataLoader(
    WandDIDPred(test_data, "test_images"),
    batch_size=CFG.BATCH_SIZE * 6,
    num_workers=CFG.NUM_WORKERS,
    pin_memory=True,
    shuffle=False
)
preds = trainer.predict(model, dataloaders=pred_loader)
preds = torch.cat(preds, dim=0)
test_data["embedding"] = preds.tolist()
test_data.to_csv("test.csv")
test_data

In [None]:
# Log our model in wandb and finish the run
wandb_logger.finalize("success")
wandb.finish()

In [None]:
train_data["individual_id_integer"] = train_data["individual_id"].map(individual_mapping).fillna(-1)
train_embeddings = np.array(train_data["embedding"].values.tolist())
test_embeddings = np.array(test_data["embedding"].values.tolist())
class_centers = model.metric_classify.weight.detach().numpy()

In [None]:
class Solution:
    def __init__(self, database, q_prior):
        self.database_embeddings = np.array(database["embeddings"]["embedding"].values.tolist())
        self.database_individuals = database["embeddings"]["individual_id"].values
        self.q_prior = q_prior
        self.embed_neigh = NearestNeighbors(n_neighbors=CFG.KNN,metric='cosine')
        self.embed_neigh.fit(self.database_embeddings)
        self.class_neigh = NearestNeighbors(n_neighbors=CFG.KNN,metric='cosine')
        self.class_neigh.fit(database["class_centers"])
        self.default = ['938b7e931166', '5bf17305f073', '7593d2aee842', '7362d7a01d00','956562ff2888']        
    
    def predict(self, queries):
        embed_distances, embed_idxs = self.embed_neigh.kneighbors(queries, CFG.KNN, return_distance=True)
        class_distances, class_idxs = self.class_neigh.kneighbors(queries, CFG.KNN, return_distance=True)
        
        class_individuals = np.repeat(list(individual_mapping), CFG.CENTERS_PER_CLASS)[class_idxs]
        embed_individuals = self.database_individuals[embed_idxs]
        
        n = embed_distances.size
        embeddings_df = pd.DataFrame(data={
            'distance': embed_distances.ravel(),
            'individual': embed_individuals.ravel(),
            'query_id': np.repeat(np.arange(len(queries)), CFG.KNN)
        }, index=np.arange(n))
        
        class_df = pd.DataFrame(data={
            'distance': class_distances.ravel(),
            'individual': class_individuals.ravel(),
            'query_id': np.repeat(np.arange(len(queries)), CFG.KNN)
        }, index=np.arange(n))
        
        embeddings_topk = embeddings_df.groupby(["query_id", "individual"]).agg("min")['distance'].groupby('query_id', group_keys=False).nsmallest(5)
        class_topk = class_df.groupby(["query_id", "individual"]).agg("min")['distance'].groupby('query_id', group_keys=False).nsmallest(5)
        embeddings_topk = embeddings_topk.reset_index().groupby("query_id").agg(list)
        class_topk = class_topk.reset_index().groupby("query_id").agg(list)
        class_t_new = np.quantile(class_topk["distance"].apply(lambda x: x[0]), 1 - self.q_prior)
        embeddings_t_new = np.quantile(embeddings_topk["distance"].apply(lambda x: x[0]), 1 - self.q_prior)
        
        def insert_new_individuals(x):
            m = np.array(x["distance"]) > class_t_new
            preds = x["individual"]
            if m.any():
                preds.insert(np.argmax(m), "new_individual")
            preds = preds + [y for y in self.default if y not in preds]
            return preds[:5]
        
        preds = class_topk.apply(insert_new_individuals, axis=1)
        return preds.values.tolist()

<p style="background: #ff6363; padding: 25px; border-radius: 10px">
    🚨 The current solution does not utilize image samples (uses only class centers) so their is room for improvement here ! 🚨
</p>

In [None]:
solution = Solution({
    "embeddings": train_data,
    "class_centers": class_centers
}, CFG.Q_NEW)
predictions = solution.predict(test_embeddings)
predictions = pd.Series(predictions, test_data.index, name="predictions").map(lambda x: " ".join(x))
predictions.to_csv("submission.csv")
predictions

![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Thats_all_folks.svg/2560px-Thats_all_folks.svg.png)