# BiteMe | Train

This notebook includes the most important part of the project - the modelling. The notebook tests methodologies for training, and in it the chosen algorithm is decided. Validation also occurs before final testing, which is conducted in the test notebook. This stage is highly iterative, so all model artefacts, logs and configurations are recorded and saved to disk automatically. This initial setup of what will eventually become MLOps for the final product will be really useful, and helps keep track of what is successful and what isn't.

Models to try:

~~[SE-ResNet50](https://github.com/Cadene/pretrained-models.pytorch#senet)~~
~~[SE-ResNet101](https://github.com/Cadene/pretrained-models.pytorch#senet)~~
~~[SE-ResNet152](https://github.com/Cadene/pretrained-models.pytorch#senet)~~
~~[SENet154](https://github.com/Cadene/pretrained-models.pytorch#senet)~~
~~[ResNet34](https://github.com/Cadene/pretrained-models.pytorch#torchvision)~~
~~[ResNet50](https://github.com/Cadene/pretrained-models.pytorch#torchvision)~~
~~[ResNet101](https://github.com/Cadene/pretrained-models.pytorch#torchvision)~~
~~[ResNet152](https://github.com/Cadene/pretrained-models.pytorch#torchvision)~~
~~[FBResNet152](https://github.com/Cadene/pretrained-models.pytorch#facebook-resnet)~~
~~[PolyNet](https://github.com/Cadene/pretrained-models.pytorch#polynet)~~
~~[InceptionV4](https://github.com/Cadene/pretrained-models.pytorch#inception)~~
~~[BNInception](https://github.com/Cadene/pretrained-models.pytorch#bninception)~~
 - [InceptionResNetV2](https://github.com/Cadene/pretrained-models.pytorch#inception)
 - [Xception](https://github.com/Cadene/pretrained-models.pytorch#xception)
 - [ResNeXt101_32x4d](https://github.com/Cadene/pretrained-models.pytorch#resnext)
 - [ResNeXt101_64x4d](https://github.com/Cadene/pretrained-models.pytorch#resnext)
 - [SE-ResNeXt50_32x4d](https://github.com/Cadene/pretrained-models.pytorch#senet)
 - [SE-ResNeXt101_32x4d](https://github.com/Cadene/pretrained-models.pytorch#senet)
 - [DenseNet121](https://github.com/Cadene/pretrained-models.pytorch#torchvision)
 - [DenseNet161](https://github.com/Cadene/pretrained-models.pytorch#torchvision)
 - [DenseNet169](https://github.com/Cadene/pretrained-models.pytorch#torchvision)
 - [DenseNet201](https://github.com/Cadene/pretrained-models.pytorch#torchvision)
 - [DualPathNet68](https://github.com/Cadene/pretrained-models.pytorch#dualpathnetworks)
 - [DualPathNet92](https://github.com/Cadene/pretrained-models.pytorch#dualpathnetworks)
 - [DualPathNet98](https://github.com/Cadene/pretrained-models.pytorch#dualpathnetworks)
 - [DualPathNet107](https://github.com/Cadene/pretrained-models.pytorch#dualpathnetworks)
 - [DualPathNet131](https://github.com/Cadene/pretrained-models.pytorch#dualpathnetworks)
 - [NASNet-A-Large](https://github.com/Cadene/pretrained-models.pytorch#nasnet)
 - [PNASNet-5-Large](https://github.com/Cadene/pretrained-models.pytorch#pnasnet)


 - efficientnet_b0
 - efficientnet_b1
 - efficientnet_b2
 - efficientnet_b3
 - efficientnet_b4
 - efficientnet_b5

Initial model work is done by using simple, typical image recognition models (CNN architectures) to see how effective these models can be for the problem. Although I don't expect them to be particularly successful, it's important to establish baselines and take a holistic approach to modelling when it's possible.

In [1]:
# Basic imports
import pandas as pd
import numpy as np
import os
import sys
from argparse import ArgumentParser
import datetime
from time import time
import gc
from tqdm import tqdm

# Data visualisation
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn

# Image processing
import cv2
import albumentations as A
import imgaug as ia
import imgaug.augmenters as iaa

# Model evaluation
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, recall_score, precision_score, roc_auc_score, f1_score

import torch
import pretrainedmodels
import pytorch_lightning as pl
from pytorch_lightning.callbacks import EarlyStopping, ModelCheckpoint

# Local imports
sys.path.append("..")
from utils.dataset import generate_transforms, generate_dataloaders
from models.models import *
from utils.loss_function import CrossEntropyLossOneHot
from utils.lrs_scheduler import WarmRestart, warm_restart
from utils.utils import read_images, augs, get_augs, seed_reproducer, init_logger
from utils.constants import *

plt.rcParams["figure.figsize"] = (14, 8)

In [2]:
# Define directories
base_dir_path = "../"

data_dir_path = os.path.join(base_dir_path, "data")
data_preprocessed_dir_path = os.path.join(data_dir_path, "preprocessed")
data_preprocessed_train_dir_path = os.path.join(data_dir_path, "preprocessed/train")

data_dir = os.listdir(data_dir_path)
data_preprocessed_dir = os.listdir(data_preprocessed_dir_path)
data_preprocessed_train_dir = os.listdir(data_preprocessed_train_dir_path)

metadata_preprocessed_path = os.path.join(data_preprocessed_dir_path, "metadata.csv")
metadata = pd.read_csv(metadata_preprocessed_path)
# Subset to train only
metadata = metadata.loc[metadata.split == "train"]

metadata.head()

Unnamed: 0,img_name,img_path,label,split
0,7059b14d2aa03ed6c4de11afa32591995181d31c.jpg,../data/cleaned/none/7059b14d2aa03ed6c4de11afa...,none,train
1,ea1b100b581fcdb7ddfae52cc62347a99e304ba4.jpg,../data/cleaned/none/ea1b100b581fcdb7ddfae52cc...,none,train
2,6eac051b9c45ff6821ec8675216f371711b7cea9.jpg,../data/cleaned/none/6eac051b9c45ff6821ec86752...,none,train
3,fc72767f8520df9b2b83941077dc0ee013eb9399.jpg,../data/cleaned/none/fc72767f8520df9b2b8394107...,none,train
4,49850884a00703afe5aab78c3ce074d2d4acae30.jpg,../data/cleaned/none/49850884a00703afe5aab78c3...,none,train


In [3]:
# Read in train images
X_train = read_images(
    data_dir_path=data_preprocessed_train_dir_path, 
    rows=ROWS, 
    cols=COLS, 
    channels=CHANNELS, 
    write_images=False, 
    output_data_dir_path=None,
    verbose=VERBOSE
)

# Get labels
y_train = np.array(pd.get_dummies(metadata["label"]))

Reading images from: ../data/preprocessed/train
Rows set to 1024
Columns set to 1024
Channels set to 3
Writing images is set to: False
Reading images...


100%|███████████████████████████████████████████| 27/27 [00:00<00:00, 47.33it/s]
100%|███████████████████████████████████████████| 55/55 [00:02<00:00, 20.33it/s]
100%|███████████████████████████████████████████| 21/21 [00:01<00:00, 13.28it/s]
100%|███████████████████████████████████████████| 46/46 [00:04<00:00, 10.52it/s]
100%|███████████████████████████████████████████| 25/25 [00:02<00:00,  8.43it/s]
100%|███████████████████████████████████████████| 21/21 [00:02<00:00,  7.47it/s]
100%|███████████████████████████████████████████| 58/58 [00:09<00:00,  6.26it/s]
100%|███████████████████████████████████████████| 46/46 [00:08<00:00,  5.14it/s]


Image reading complete.
Image array shape: (299, 1024, 1024, 3)


## Set Parameters

In [4]:
# Choose augmentations to use in preprocessing
# For full list see helpers.py
#augs_to_select = [
#    "Resize",
#    "HorizontalFlip", 
#    "VerticalFlip",
#    "Normalize"
#]
## Subset augs based on those selected
#AUGS = dict((aug_name, augs[aug_name]) for aug_name in augs_to_select)


def init_hparams():
    """
    Initialise hyperparameters for modelling.
    
    Returns
    ---------
    hparams : argparse.Namespace
        Parsed hyperparameters
    """
    parser = ArgumentParser(add_help=False)
    parser.add_argument("-backbone", "--backbone", type=str, default=MODEL_NAME)
    parser.add_argument("-device_name", type=str, default=DEVICE_NAME)
    parser.add_argument("--gpus", default=[0])
    parser.add_argument("--n_workers", type=int, default=N_WORKERS)
    parser.add_argument("--image_size", nargs="+", default=[ROWS, COLS])
    parser.add_argument("--seed", type=int, default=SEED)
    parser.add_argument("--min_epochs", type=int, default=MIN_EPOCHS)
    parser.add_argument("--max_epochs", type=int, default=MAX_EPOCHS)
    parser.add_argument("--patience", type=str, default=PATIENCE)    
    parser.add_argument("-tbs", "--train_batch_size", type=int, default=TRAIN_BATCH_SIZE)
    parser.add_argument("-vbs", "--val_batch_size", type=int, default=VAL_BATCH_SIZE)
    parser.add_argument("--n_splits", type=int, default=N_SPLITS)
    parser.add_argument("--test_size", type=float, default=TEST_SIZE)
    parser.add_argument("--lr", type=float, default=LEARNING_RATE)
    parser.add_argument("--weight_decay", type=float, default=WEIGHT_DECAY)
    parser.add_argument("--epsilon", type=float, default=EPSILON)
    parser.add_argument("--amsgrad", type=bool, default=AMSGRAD)
    parser.add_argument("--betas",
                        default=BETAS)
    parser.add_argument("--eta_min", type=float, default=ETA_MIN)
    parser.add_argument("--precision", type=int, default=PRECISION)
    parser.add_argument("--gradient_clip_val", type=float, default=GRADIENT_CLIP_VAL)
    parser.add_argument("--verbose", type=str, default=VERBOSE)
    parser.add_argument("--log_dir", type=str, default=LOG_DIR)
    parser.add_argument("--log_name", type=str, default=LOG_NAME)
    
    
    try:
        hparams, unknown = parser.parse_known_args()
    except:
        hparams, unknown = parser.parse_args([])

    if len(hparams.gpus) == 1:
        hparams.gpus = [int(hparams.gpus[0])]
    else:
        hparams.gpus = [int(gpu) for gpu in hparams.gpus]

    hparams.image_size = [int(size) for size in hparams.image_size]
    
    return hparams

### Create Model

In [5]:
class CoolSystem(pl.LightningModule):
    def __init__(self, hparams):
        super().__init__()
        self.hparams = hparams

        seed_reproducer(self.hparams.seed)

        self.model = inceptionresnetv2()
        self.criterion = CrossEntropyLossOneHot()
        self.logger_kun = init_logger(
            hparams.log_name, 
            hparams.log_dir
        )

    def forward(self, x):
        return self.model(x)

    def configure_optimizers(self):
        self.optimizer = torch.optim.Adam(
            self.parameters(), 
            lr=self.hparams.lr, 
            betas=self.hparams.betas, 
            eps=self.hparams.epsilon, 
            weight_decay=self.hparams.weight_decay,
            amsgrad=self.hparams.amsgrad
        )
        self.scheduler = WarmRestart(
            self.optimizer, 
            T_max=15, 
            T_mult=3, 
            eta_min=self.hparams.eta_min
        )
        return [self.optimizer], [self.scheduler]

    def training_step(self, batch, batch_idx):
        step_start_time = time()
        images, labels, data_load_time = batch

        scores = self(images)
        loss = self.criterion(scores, labels)

        data_load_time = torch.sum(data_load_time)

        return {
            "loss": loss,
            "data_load_time": data_load_time,
            "batch_run_time": torch.Tensor([time() - step_start_time + data_load_time]).to(
                data_load_time.device
            ),
        }

    def training_epoch_end(self, outputs):
        # outputs is the return of training_step
        train_loss_mean = torch.stack([output["loss"] for output in outputs]).mean()
        self.data_load_times = torch.stack([output["data_load_time"] for output in outputs]).sum()
        self.batch_run_times = torch.stack([output["batch_run_time"] for output in outputs]).sum()

        self.current_epoch += 1
        if self.current_epoch < (self.trainer.max_epochs - 4):
            self.scheduler = warm_restart(self.scheduler, T_mult=2)

        return {"train_loss": train_loss_mean}

    def validation_step(self, batch, batch_idx):
        step_start_time = time()
        images, labels, data_load_time = batch
        data_load_time = torch.sum(data_load_time)
        scores = self(images)
        loss = self.criterion(scores, labels)

        # must return key -> val_loss
        return {
            "val_loss": loss,
            "scores": scores,
            "labels": labels,
            "data_load_time": data_load_time,
            "batch_run_time": torch.Tensor([time() - step_start_time + data_load_time]).to(
                data_load_time.device
            ),
        }

    def validation_epoch_end(self, outputs):
        # compute loss
        val_loss_mean = torch.stack([output["val_loss"] for output in outputs]).mean()
        self.data_load_times = torch.stack([output["data_load_time"] for output in outputs]).sum()
        self.batch_run_times = torch.stack([output["batch_run_time"] for output in outputs]).sum()

        # compute roc_auc
        scores_all = torch.cat([output["scores"] for output in outputs]).cpu()
        labels_all = torch.round(torch.cat([output["labels"] for output in outputs]).cpu())

        val_roc_auc = torch.tensor(roc_auc_score(labels_all, scores_all))

        # terminal logs
        self.logger_kun.info(
            f"{self.hparams.fold_i}-{self.current_epoch} | "
            f"lr : {self.scheduler.get_lr()[0]:.6f} | "
            f"val_loss : {val_loss_mean:.4f} | "
            f"val_roc_auc : {val_roc_auc:.4f} | "
            f"data_load_times : {self.data_load_times:.2f} | "
            f"batch_run_times : {self.batch_run_times:.2f}"
        )

        return {"val_loss": val_loss_mean, "val_roc_auc": val_roc_auc}

## Cross Validation

In [6]:
# Initialise hyperparameters
hparams = init_hparams()
torch.cuda.empty_cache()

log_notes = "increased patience/min_epochs/max_epochs to 20/80/100 from 11/30/50"

# Initialise logger
logger = init_logger(hparams.log_name, hparams.log_dir)

# Log parameters
logger.info(f"backbone: {hparams.backbone}")
logger.info(f"device_name: {hparams.device_name}")
logger.info(f"gpus: {hparams.gpus}")
logger.info(f"n_workers: {hparams.n_workers}")
logger.info(f"image_size: {hparams.image_size}")
logger.info(f"seed: {hparams.seed}")
logger.info(f"min_epochs: {hparams.min_epochs}")
logger.info(f"max_epochs: {hparams.max_epochs}")
logger.info(f"patience: {hparams.patience}")
logger.info(f"train_batch_size: {hparams.train_batch_size}")
logger.info(f"val_batch_size: {hparams.val_batch_size}")
logger.info(f"n_splits: {hparams.n_splits}")
logger.info(f"test_size: {hparams.test_size}")
logger.info(f"learning rate: {hparams.lr}")
logger.info(f"weight_decay: {hparams.weight_decay}")
logger.info(f"epsilon: {hparams.epsilon}")
logger.info(f"amsgrad: {hparams.amsgrad}")
logger.info(f"betas: {hparams.betas}")
logger.info(f"precision: {hparams.precision}")
logger.info(f"gradient_clip_val: {hparams.gradient_clip_val}")
logger.info(f"eta_min: {hparams.eta_min}")
logger.info(f"log_dir: {hparams.log_dir}")
logger.info(f"log_name: {hparams.log_name}")

# Log any notes if they exist
if "log_notes" in locals():
    logger.info(f"Notes: {log_notes}")


# Create transform pipeline
transforms = generate_transforms(hparams.image_size)

# List for validation scores 
val_loss_scores = []

# Initialise cross validation
folds = StratifiedKFold(n_splits=hparams.n_splits, shuffle=True, random_state=hparams.seed)

# Start cross validation
for fold_i, (train_index, val_index) in enumerate(folds.split(metadata[["img_path"]], metadata[["label"]])):
    hparams.fold_i = fold_i
    # Split train images and validation sets
    train_data = metadata.iloc[train_index][["img_path", "label"]].reset_index(drop=True)
    train_data = pd.get_dummies(train_data, columns=["label"], prefix="", prefix_sep="")

    val_data = metadata.iloc[val_index][["img_path", "label"]].reset_index(drop=True)
    val_data = pd.get_dummies(val_data, columns=["label"], prefix="", prefix_sep="")
    
    logger.info(f"Fold {fold_i} num train records: {train_data.shape[0]}")
    logger.info(f"Fold {fold_i} num val records: {val_data.shape[0]}")
    
    train_dataloader, val_dataloader = generate_dataloaders(hparams, train_data, val_data, transforms)
    
    checkpoint_callback = ModelCheckpoint(
        monitor="val_loss",
        save_top_k=2,
        mode="min",
        filepath=os.path.join(
            hparams.log_dir, 
            hparams.log_name, 
            f"fold={fold_i}" + "-{epoch}-{val_loss:.4f}-{val_roc_auc:.4f}"
        )
    )
    
    early_stop_callback = EarlyStopping(
        monitor="val_loss", 
        patience=hparams.patience, 
        mode="min", 
        verbose=hparams.verbose
    )
    
    # Instance Model, Trainer and train model
    model = CoolSystem(hparams)
    trainer = pl.Trainer(
        gpus=hparams.gpus,
        min_epochs=hparams.min_epochs,
        max_epochs=hparams.max_epochs,
        early_stop_callback=early_stop_callback,
        checkpoint_callback=checkpoint_callback,
        progress_bar_refresh_rate=0,
        precision=hparams.precision,
        num_sanity_val_steps=0,
        profiler=False,
        weights_summary=None,
        gradient_clip_val=hparams.gradient_clip_val,
        default_root_dir=os.path.join(hparams.log_dir, hparams.log_name)
    )
    
    # Fit model
    trainer.fit(model, train_dataloader, val_dataloader)
            
    # Save val scores
    val_loss_scores.append(checkpoint_callback.best)
    
    # Cleanup
    del model
    gc.collect()
    torch.cuda.empty_cache()
    
val_loss_scores = [i.item() for i in val_loss_scores]

# Add val scores to csv with all scores
if os.path.isfile("../logs/scores.csv") == False:
    pd.DataFrame(columns=["name", "scores", "mean_score"]).to_csv("../logs/scores.csv", index=False)
    
# Append to current scores csv
all_scores_df = pd.concat([
    pd.read_csv("../logs/scores.csv"),
    pd.DataFrame.from_dict(
        {
            "name": [hparams.log_name],
            "scores": [val_loss_scores],
            "mean_score": [np.mean(val_loss_scores)]
        }
    )],
    ignore_index=True
)
# Write all scores df to csv
all_scores_df.to_csv("../logs/scores.csv", index=False)

logger.info(f"Best scores: {val_loss_scores}")
logger.info("Training complete.")

[2022-11-30 18:33:41] 2801410545.py[  11] : INFO  backbone: inceptionresnetv2
[2022-11-30 18:33:41] 2801410545.py[  12] : INFO  device_name: NVIDIA GeForce RTX 3090
[2022-11-30 18:33:41] 2801410545.py[  13] : INFO  gpus: [0]
[2022-11-30 18:33:41] 2801410545.py[  14] : INFO  n_workers: 128
[2022-11-30 18:33:41] 2801410545.py[  15] : INFO  image_size: [1024, 1024]
[2022-11-30 18:33:41] 2801410545.py[  16] : INFO  seed: 14
[2022-11-30 18:33:41] 2801410545.py[  17] : INFO  min_epochs: 80
[2022-11-30 18:33:41] 2801410545.py[  18] : INFO  max_epochs: 100
[2022-11-30 18:33:41] 2801410545.py[  19] : INFO  patience: 20
[2022-11-30 18:33:41] 2801410545.py[  20] : INFO  train_batch_size: 4
[2022-11-30 18:33:41] 2801410545.py[  21] : INFO  val_batch_size: 4
[2022-11-30 18:33:41] 2801410545.py[  22] : INFO  n_splits: 3
[2022-11-30 18:33:41] 2801410545.py[  23] : INFO  test_size: 0.1
[2022-11-30 18:33:41] 2801410545.py[  24] : INFO  learning rate: 0.0001
[2022-11-30 18:33:41] 2801410545.py[  25] : I

[2022-11-30 19:02:23] 3085653963.py[  95] : INFO  0-38 | lr : 0.000074 | val_loss : 1.4634 | val_roc_auc : 0.7939 | data_load_times : 44.15 | batch_run_times : 45.12
[2022-11-30 19:03:07] 3085653963.py[  95] : INFO  0-39 | lr : 0.000072 | val_loss : 1.5690 | val_roc_auc : 0.7817 | data_load_times : 45.56 | batch_run_times : 46.56
[2022-11-30 19:03:51] 3085653963.py[  95] : INFO  0-40 | lr : 0.000071 | val_loss : 1.6002 | val_roc_auc : 0.7974 | data_load_times : 45.39 | batch_run_times : 46.49
[2022-11-30 19:04:35] 3085653963.py[  95] : INFO  0-41 | lr : 0.000069 | val_loss : 1.4937 | val_roc_auc : 0.7705 | data_load_times : 46.71 | batch_run_times : 47.64
[2022-11-30 19:05:17] 3085653963.py[  95] : INFO  0-42 | lr : 0.000067 | val_loss : 1.5419 | val_roc_auc : 0.7925 | data_load_times : 48.10 | batch_run_times : 49.13
[2022-11-30 19:06:01] 3085653963.py[  95] : INFO  0-43 | lr : 0.000066 | val_loss : 1.4510 | val_roc_auc : 0.8038 | data_load_times : 46.20 | batch_run_times : 47.20
[202

[2022-11-30 19:30:38] 3085653963.py[  95] : INFO  0-77 | lr : 0.000098 | val_loss : 1.9764 | val_roc_auc : 0.7591 | data_load_times : 43.31 | batch_run_times : 44.26
Trainer was signaled to stop but required minimum epochs (80) or minimum steps (None) has not been met. Training will continue...
[2022-11-30 19:31:21] 3085653963.py[  95] : INFO  0-78 | lr : 0.000098 | val_loss : 2.1951 | val_roc_auc : 0.7517 | data_load_times : 46.07 | batch_run_times : 47.02
Trainer was signaled to stop but required minimum epochs (80) or minimum steps (None) has not been met. Training will continue...
[2022-11-30 19:32:04] 3085653963.py[  95] : INFO  0-79 | lr : 0.000098 | val_loss : 1.9606 | val_roc_auc : 0.7684 | data_load_times : 44.76 | batch_run_times : 45.71
Epoch 00080: early stopping triggered.
[2022-11-30 19:32:05] 2801410545.py[  59] : INFO  Fold 1 num train records: 199
[2022-11-30 19:32:05] 2801410545.py[  60] : INFO  Fold 1 num val records: 100
GPU available: True, used: True
TPU available

[2022-11-30 20:07:07] 3085653963.py[  95] : INFO  1-43 | lr : 0.000066 | val_loss : 1.7264 | val_roc_auc : 0.7469 | data_load_times : 49.54 | batch_run_times : 50.50
[2022-11-30 20:07:54] 3085653963.py[  95] : INFO  1-44 | lr : 0.000064 | val_loss : 1.5124 | val_roc_auc : 0.7927 | data_load_times : 46.83 | batch_run_times : 47.78
[2022-11-30 20:08:41] 3085653963.py[  95] : INFO  1-45 | lr : 0.000063 | val_loss : 1.6474 | val_roc_auc : 0.7335 | data_load_times : 48.49 | batch_run_times : 49.47
[2022-11-30 20:09:28] 3085653963.py[  95] : INFO  1-46 | lr : 0.000061 | val_loss : 1.5564 | val_roc_auc : 0.7568 | data_load_times : 47.93 | batch_run_times : 48.91
[2022-11-30 20:10:16] 3085653963.py[  95] : INFO  1-47 | lr : 0.000060 | val_loss : 1.5577 | val_roc_auc : 0.7780 | data_load_times : 47.77 | batch_run_times : 48.80
[2022-11-30 20:11:03] 3085653963.py[  95] : INFO  1-48 | lr : 0.000058 | val_loss : 1.6000 | val_roc_auc : 0.7744 | data_load_times : 47.71 | batch_run_times : 48.69
[202

[2022-11-30 20:33:53] 3085653963.py[  95] : INFO  1-77 | lr : 0.000098 | val_loss : 1.9744 | val_roc_auc : 0.7580 | data_load_times : 48.48 | batch_run_times : 49.48
Trainer was signaled to stop but required minimum epochs (80) or minimum steps (None) has not been met. Training will continue...
[2022-11-30 20:34:39] 3085653963.py[  95] : INFO  1-78 | lr : 0.000098 | val_loss : 1.9268 | val_roc_auc : 0.7719 | data_load_times : 46.70 | batch_run_times : 47.70
Trainer was signaled to stop but required minimum epochs (80) or minimum steps (None) has not been met. Training will continue...
[2022-11-30 20:35:26] 3085653963.py[  95] : INFO  1-79 | lr : 0.000098 | val_loss : 1.8580 | val_roc_auc : 0.7807 | data_load_times : 47.29 | batch_run_times : 48.35
Epoch 00080: early stopping triggered.
[2022-11-30 20:35:27] 2801410545.py[  59] : INFO  Fold 2 num train records: 200
[2022-11-30 20:35:27] 2801410545.py[  60] : INFO  Fold 2 num val records: 99
GPU available: True, used: True
TPU available:

[2022-11-30 21:10:49] 3085653963.py[  95] : INFO  2-43 | lr : 0.000066 | val_loss : 1.3990 | val_roc_auc : 0.8198 | data_load_times : 46.80 | batch_run_times : 47.80
[2022-11-30 21:11:36] 3085653963.py[  95] : INFO  2-44 | lr : 0.000064 | val_loss : 1.4790 | val_roc_auc : 0.7807 | data_load_times : 46.35 | batch_run_times : 47.27
[2022-11-30 21:12:24] 3085653963.py[  95] : INFO  2-45 | lr : 0.000063 | val_loss : 1.3991 | val_roc_auc : 0.7922 | data_load_times : 46.82 | batch_run_times : 47.77
[2022-11-30 21:13:12] 3085653963.py[  95] : INFO  2-46 | lr : 0.000061 | val_loss : 1.3821 | val_roc_auc : 0.8242 | data_load_times : 45.89 | batch_run_times : 46.83
[2022-11-30 21:14:00] 3085653963.py[  95] : INFO  2-47 | lr : 0.000060 | val_loss : 1.3582 | val_roc_auc : 0.8432 | data_load_times : 45.67 | batch_run_times : 46.64
[2022-11-30 21:14:47] 3085653963.py[  95] : INFO  2-48 | lr : 0.000058 | val_loss : 1.4523 | val_roc_auc : 0.7855 | data_load_times : 44.37 | batch_run_times : 45.45
[202

Trainer was signaled to stop but required minimum epochs (80) or minimum steps (None) has not been met. Training will continue...
[2022-11-30 21:39:23] 3085653963.py[  95] : INFO  2-79 | lr : 0.000098 | val_loss : 1.7681 | val_roc_auc : 0.7599 | data_load_times : 46.40 | batch_run_times : 47.39
Epoch 00080: early stopping triggered.
[2022-11-30 21:39:23] 2801410545.py[ 131] : INFO  Best scores: [1.4510170221328735, 1.4473986625671387, 1.341477632522583]
[2022-11-30 21:39:23] 2801410545.py[ 132] : INFO  Training complete.


## Validation Inference

In [7]:
# Get model run path and define chosen fold
log_dir = "../logs/logs"
#model_run = "2022_11_08_14:57:52"
model_run = hparams.log_name
model_run_path = os.path.join(log_dir, model_run)
#best_fold = 1
best_fold = val_loss_scores.index(min(val_loss_scores))

# Get best model for chosen fold
model_run_dir = os.listdir(model_run_path)
model_folds = [i for i in model_run_dir if i.startswith(f"fold={best_fold}")]
model_folds_scores = [float(i.split("val_loss=")[1].split("-")[0]) for i in model_folds]
model_name = model_folds[model_folds_scores.index(min(model_folds_scores))]
model_path = os.path.join(model_run_path, model_name)

# Load fold's model
model = CoolSystem(hparams)
model.load_state_dict(
    torch.load(model_path)["state_dict"]
)
model.eval()

# Retrieve validation indices for chosen fold
for fold_i, (train_index, val_index) in enumerate(folds.split(metadata[["img_path"]], metadata[["label"]])):
    if fold_i == best_fold:
        break

# Select fold validation images
X_val = torch.from_numpy(X_train[val_index]).permute(0, 3, 1, 2).float()

# Create predictions looped by batch
counter = 0
val_i_batch = []
val_idx_batch = []
scores_df = pd.DataFrame()

for i, idx in tqdm(enumerate(val_index)):
    counter += 1
    val_i_batch.append(i) # arrays don't preserve index so need ordered index values
    val_idx_batch.append(idx) # for preserved index
    
    # Run inference for val_batch_size
    if counter == hparams.val_batch_size:
        preds = model(X_val[val_i_batch])
        
        # Create activation output
        log_softmax = torch.nn.LogSoftmax(dim=-1)

        # Convert raw output to probabilities
        preds = np.exp(log_softmax(preds).detach().numpy())

        # Create df with img paths and predicted label probs
        scores_df_batch = pd.DataFrame(preds, columns=val_data.columns[1:])
        scores_df_batch = pd.merge(
            metadata.iloc[val_idx_batch, 1:3].reset_index(drop=True),
            scores_df_batch, 
            left_index=True,
            right_index=True
        )
        scores_df = pd.concat([scores_df, scores_df_batch], ignore_index=True, axis=0)

        # Cleanup
        gc.collect()
        torch.cuda.empty_cache()
        # Reset counter and batch
        counter = 0
        val_i_batch = []
        val_idx_batch = []
        
    # Run inference for remaining batch
    elif idx == val_index[-1]:
        preds = model(X_val[val_i_batch])
        
        # Create activation output
        log_softmax = torch.nn.LogSoftmax(dim=-1)

        # Convert raw output to probabilities
        preds = np.exp(log_softmax(preds).detach().numpy())

        # Create df with img paths and predicted label probs
        scores_df_batch = pd.DataFrame(preds, columns=val_data.columns[1:])
        scores_df_batch = pd.merge(
            metadata.iloc[val_idx_batch, 1:3].reset_index(drop=True),
            scores_df_batch, 
            left_index=True,
            right_index=True
        )
        scores_df = pd.concat([scores_df, scores_df_batch], ignore_index=True, axis=0)

        # Cleanup
        gc.collect()
        torch.cuda.empty_cache()

        
# Write predictions to log
scores_df.to_csv(
    os.path.join(model_run_path, f"{model_run}_preds_fold_{best_fold}.csv"),
    index=False
)

99it [04:29,  2.72s/it]


In [8]:
scores_df

Unnamed: 0,img_path,label,ant,bedbug,bee,horsefly,mite,mosquito,none,tick
0,../data/cleaned/none/74c8654309dbd09440342475d...,none,0.112028,0.243844,0.092944,0.112761,0.162847,0.059513,0.125881,0.090182
1,../data/cleaned/none/9bac4720af91cc18252051d7f...,none,0.111094,0.243905,0.092935,0.113534,0.162828,0.058932,0.126486,0.090285
2,../data/cleaned/none/c0b5bea99fc035e3f866248c1...,none,0.111682,0.243649,0.092991,0.113168,0.162897,0.059349,0.126215,0.090048
3,../data/cleaned/none/86526ab4cf5a3497b3023b50b...,none,0.112879,0.243216,0.093149,0.112542,0.162840,0.060014,0.125566,0.089793
4,../data/cleaned/none/773cb2eaaccc890f795bd41c6...,none,0.111761,0.243554,0.092980,0.113169,0.162940,0.059393,0.126066,0.090136
...,...,...,...,...,...,...,...,...,...,...
94,../data/cleaned/ant/8b40d17bf065d297f372ad607b...,ant,0.111448,0.245068,0.092901,0.112444,0.163038,0.059228,0.125602,0.090272
95,../data/cleaned/ant/66cd2ff3237ef14490c0804123...,ant,0.110862,0.246055,0.092702,0.112187,0.162866,0.059095,0.125724,0.090510
96,../data/cleaned/ant/eba0beda4c8c60dd7dda15dec3...,ant,0.112633,0.243175,0.093180,0.112962,0.162929,0.059673,0.125829,0.089619
97,../data/cleaned/ant/dea1299d7838a747161b70c820...,ant,0.111836,0.243871,0.092964,0.112817,0.162998,0.059500,0.125945,0.090070


## Validation Analysis

In [9]:
print(f"{len(scores_df['img_path'].unique())} unique image paths.")

99 unique image paths.


In [10]:
print("Validation label counts:")
print(scores_df["label"].value_counts())

Validation label counts:
bedbug      19
tick        18
mosquito    15
ant         15
none         9
horsefly     9
mite         7
bee          7
Name: label, dtype: int64


In [11]:
print("Validation prediction counts:")
print(
    pd.melt(
        scores_df,
        id_vars=["img_path", "label"],
        value_vars=["ant", "bedbug", "bee", "horsefly", "mite", "mosquito" ,"none", "tick"],
        var_name="pred_label",
        value_name="pred_prob"
    ).sort_values(["img_path", "pred_prob"], ascending=False) \
    .groupby(["img_path", "label"]).first()["pred_label"] \
    .value_counts()
)

Validation prediction counts:
bedbug    99
Name: pred_label, dtype: int64


In [12]:
# Probability stats by label
pd.concat(
    [
        pd.DataFrame(scores_df.iloc[:, 2:].mean(), columns=["mean"]),
        pd.DataFrame(scores_df.iloc[:, 2:].std(), columns=["std"]),
        pd.DataFrame(scores_df.iloc[:, 2:].min(), columns=["min"]),
        pd.DataFrame(scores_df.iloc[:, 2:].quantile(0.25)),
        pd.DataFrame(scores_df.iloc[:, 2:].median(), columns=["median"]),
        pd.DataFrame(scores_df.iloc[:, 2:].quantile(0.75)),
        pd.DataFrame(scores_df.iloc[:, 2:].max(), columns=["max"]),
        pd.DataFrame(scores_df.iloc[:, 2:].max() - scores_df.iloc[:, 2:].min(), columns=["range"])
    ], 
    axis=1
)

Unnamed: 0,mean,std,min,0.25,median,0.75,max,range
ant,0.111652,0.001212,0.108801,0.110946,0.111591,0.112599,0.113951,0.00515
bedbug,0.244884,0.001659,0.242249,0.243742,0.244465,0.24572,0.251237,0.008987
bee,0.092858,0.000286,0.091503,0.092723,0.092929,0.093054,0.093277,0.001775
horsefly,0.112425,0.000592,0.110014,0.112088,0.112544,0.112814,0.113534,0.00352
mite,0.162991,0.000332,0.162441,0.162831,0.162897,0.163037,0.165235,0.002794
mosquito,0.059444,0.000503,0.058261,0.059155,0.059413,0.059757,0.060681,0.00242
none,0.125614,0.000528,0.123254,0.125436,0.125731,0.125943,0.126486,0.003231
tick,0.090131,0.000526,0.087915,0.089855,0.090154,0.090436,0.091295,0.003381


In [13]:
pd.melt(
    scores_df,
    id_vars=["img_path", "label"],
    value_vars=["ant", "bedbug", "bee", "horsefly", "mite", "mosquito" ,"none", "tick"],
    var_name="pred_label",
    value_name="pred_prob"
).pivot_table(
    index=["label"],
    columns=["pred_label"],
    aggfunc="mean"
)

Unnamed: 0_level_0,pred_prob,pred_prob,pred_prob,pred_prob,pred_prob,pred_prob,pred_prob,pred_prob
pred_label,ant,bedbug,bee,horsefly,mite,mosquito,none,tick
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
ant,0.11165,0.24501,0.092853,0.112335,0.162975,0.059415,0.1256,0.090163
bedbug,0.11204,0.244261,0.09297,0.112555,0.162894,0.059619,0.125683,0.089978
bee,0.110871,0.245665,0.092755,0.112431,0.162908,0.059109,0.125803,0.090458
horsefly,0.110795,0.246465,0.092589,0.11207,0.163342,0.059104,0.125252,0.090383
mite,0.111578,0.245908,0.092638,0.111929,0.163203,0.059321,0.125299,0.090123
mosquito,0.111625,0.244859,0.092893,0.112412,0.162944,0.059514,0.125552,0.0902
none,0.112208,0.243379,0.093062,0.112977,0.162843,0.059609,0.125967,0.089955
tick,0.111748,0.244716,0.092875,0.112466,0.162994,0.059493,0.125659,0.090049


In [14]:
pd.melt(
    scores_df,
    id_vars=["img_path", "label"],
    value_vars=["ant", "bedbug", "bee", "horsefly", "mite", "mosquito" ,"none", "tick"],
    var_name="pred_label",
    value_name="pred_prob"
).sort_values(["img_path", "pred_prob"], ascending=False).groupby(["img_path", "label"]).first()

Unnamed: 0_level_0,Unnamed: 1_level_0,pred_label,pred_prob
img_path,label,Unnamed: 2_level_1,Unnamed: 3_level_1
../data/cleaned/ant/09201674df1942ec6433a487f195cca68f23310b.jpg,ant,bedbug,0.248366
../data/cleaned/ant/0cf3675aae8601ba680bd9585f8023f66d56a771.jpg,ant,bedbug,0.247571
../data/cleaned/ant/100eac5fb92b879adc0bda8e26b65f9b9fed5099.jpg,ant,bedbug,0.243765
../data/cleaned/ant/4adf2283fed4f3060a8be1e516d1005e21dd1c42.jpg,ant,bedbug,0.244802
../data/cleaned/ant/62830f28c6bfb81efd3d9bf0a3db6f667dec2dad.jpg,ant,bedbug,0.244445
...,...,...,...
../data/cleaned/tick/9e71cd0ae23e77c93c596c65f2e44b6b14dbc62b.jpg,tick,bedbug,0.244748
../data/cleaned/tick/a5db00f3302356054e3f1ba1fb557511faf905a4.jpg,tick,bedbug,0.243202
../data/cleaned/tick/df5025f8bc1363330c965e950fd4aa950efee715.jpg,tick,bedbug,0.243624
../data/cleaned/tick/e014b379de6300f8120a839a97ba142cfe1ba34a.jpg,tick,bedbug,0.243196
