# Assignment — Humpback Whale Identification (Kaggle) + MLflow (OSS)

This notebook is **separate** from your fish notebook (big change in dataset + evaluation).

Important nuance:
- **Train with CrossEntropyLoss** (differentiable → backprop works)
- **Evaluate with MAP@5** (ranking metric used by the Kaggle competition; not differentiable, so it’s an eval metric, not a training loss)

You’ll get:
- Kaggle competition download via `kagglehub`
- CSV-based dataset (not ImageFolder)
- Train/Val/Test split + optional KFold on a filtered subset (needed because many classes have too few samples)
- MLflow logging (params, metrics, artifacts)


In [None]:
%pip -q install kagglehub mlflow torch torchvision pandas scikit-learn pillow tqdm matplotlib ipywidgets

# If you run into permission / env issues:
# - restart the kernel after install


In [None]:
import os
import time
import random
from pathlib import Path
from collections import Counter

import numpy as np
import pandas as pd
from PIL import Image

import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader, Subset

import torchvision
import torchvision.models as models
from torchvision import transforms

from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold
from sklearn.metrics import accuracy_score

from tqdm.auto import tqdm

RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

device = "cuda" if torch.cuda.is_available() else "cpu"
device

In [None]:
import mlflow
import mlflow.pytorch

# ---- MLflow (OSS) tracking ----
# Option A (simple): local filesystem runs under ./mlruns
mlflow.set_tracking_uri("file:" + str(Path.cwd() / "mlruns"))

# Option B (remote server): uncomment if you're running an MLflow server elsewhere
# mlflow.set_tracking_uri("http://127.0.0.1:8080")

mlflow.set_experiment("assignment_humpback_whale_map5")
print("MLflow tracking URI:", mlflow.get_tracking_uri())

## 1) Download the Kaggle competition dataset

This uses `kagglehub.competition_download(...)`.

You must have Kaggle credentials configured locally (same as usual Kaggle API usage).


In [None]:
import kagglehub
kagglehub.login()


comp_root = Path(kagglehub.competition_download(
    "humpback-whale-identification"))
print("Competition files at:", comp_root)

TRAIN_CSV = comp_root / "train.csv"
TRAIN_DIR = comp_root / "train"
TEST_DIR = comp_root / "test"
SAMPLE_SUB = comp_root / "sample_submission.csv"

print("TRAIN_CSV:", TRAIN_CSV)
print("TRAIN_DIR:", TRAIN_DIR)
print("TEST_DIR :", TEST_DIR)
print("SAMPLE_SUB:", SAMPLE_SUB)

## 2) Load train.csv + (important) filter classes so KFold is possible

This competition has **many whale IDs with very few images** (including `new_whale`).
Stratified KFold requires each class to have enough samples, otherwise it breaks (or is meaningless).

So we do:
- Optionally drop `new_whale`
- Keep only classes with at least `MIN_SAMPLES_PER_CLASS` samples
- Optionally keep only the top-N most frequent IDs (to keep training lightweight)


In [None]:
df = pd.read_csv(TRAIN_CSV)

# Expected columns in train.csv for this comp are usually: Image, Id
# We'll be defensive:
assert {"Image", "Id"}.issubset(
    df.columns), f"Unexpected train.csv columns: {df.columns.tolist()}"

df["path"] = df["Image"].apply(lambda x: str(TRAIN_DIR / x))

print("Raw rows:", len(df))
print("Unique IDs:", df["Id"].nunique())
print(df.head())

In [None]:
# ---- Filtering knobs (tune as needed) ----
DROP_NEW_WHALE = True
MIN_SAMPLES_PER_CLASS = 5   # must be >= K for KFold to be valid
# set None to keep all filtered classes (can be huge)
TOP_N_CLASSES = 200

if DROP_NEW_WHALE:
    df = df[df["Id"] != "new_whale"].copy()

counts = df["Id"].value_counts()
keep_ids = counts[counts >= MIN_SAMPLES_PER_CLASS].index
df = df[df["Id"].isin(keep_ids)].copy()

if TOP_N_CLASSES is not None:
    top_ids = df["Id"].value_counts().head(TOP_N_CLASSES).index
    df = df[df["Id"].isin(top_ids)].copy()

df = df.reset_index(drop=True)

print("After filtering rows:", len(df))
print("After filtering unique IDs:", df["Id"].nunique())
df["Id"].value_counts().head(10)

## 3) Dataset + transforms

We are **not** using ImageFolder (labels are in CSV). We build a small Dataset wrapper.


In [None]:
# Label mapping
classes = sorted(df["Id"].unique().tolist())
id2idx = {c: i for i, c in enumerate(classes)}
idx2id = {i: c for c, i in id2idx.items()}
df["y"] = df["Id"].map(id2idx).astype(int)

num_classes = len(classes)
print("num_classes:", num_classes)

In [None]:
class WhaleDataset(Dataset):
    def __init__(self, dataframe: pd.DataFrame, transform=None):
        self.df = dataframe.reset_index(drop=True)
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, i):
        row = self.df.iloc[i]
        img = Image.open(row["path"]).convert("RGB")
        if self.transform is not None:
            img = self.transform(img)
        y = int(row["y"])
        return img, y

In [None]:
IMG_SIZE = 224

train_tf = transforms.Compose([
    transforms.Resize((IMG_SIZE, IMG_SIZE)),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(degrees=20),
    transforms.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1),
    transforms.ToTensor(),
])

eval_tf = transforms.Compose([
    transforms.Resize((IMG_SIZE, IMG_SIZE)),
    transforms.ToTensor(),
])

## 4) Train/Val/Test split + loaders

We make a **fixed test split** (20%), then do KFold on the remaining pool.


In [None]:
# indices and targets for stratification
y_all = df["y"].to_numpy()
idx_all = np.arange(len(df))

# Fixed TEST split
sss = StratifiedShuffleSplit(
    n_splits=1, test_size=0.2, random_state=RANDOM_SEED)
train_pool_idx, test_idx = next(sss.split(idx_all, y_all))

y_train_pool = y_all[train_pool_idx]

print("Train pool:", len(train_pool_idx), "Test:", len(test_idx))

In [None]:
def make_loaders_from_indices(tr_idx, va_idx, te_idx, batch_size=32, num_workers=2):
    train_ds = WhaleDataset(df.iloc[tr_idx], transform=train_tf)
    val_ds = WhaleDataset(df.iloc[va_idx], transform=eval_tf)
    test_ds = WhaleDataset(df.iloc[te_idx], transform=eval_tf)

    train_loader = DataLoader(
        train_ds, batch_size=batch_size, shuffle=True,  num_workers=num_workers)
    val_loader = DataLoader(val_ds,   batch_size=batch_size,
                            shuffle=False, num_workers=num_workers)
    test_loader = DataLoader(
        test_ds,  batch_size=batch_size, shuffle=False, num_workers=num_workers)
    return train_loader, val_loader, test_loader

## 5) MAP@5 metric (competition metric)

For each sample:
- take top-5 predicted classes
- if true class is at rank r (1..5): score = 1/r
- else score = 0
Then average across samples.


In [None]:
def map_at_k_from_probs(y_true: np.ndarray, probs: np.ndarray, k: int = 5) -> float:
    topk = np.argsort(-probs, axis=1)[:, :k]  # (N, k)
    scores = []
    for i in range(len(y_true)):
        true = y_true[i]
        row = topk[i]
        hit = np.where(row == true)[0]
        if len(hit) == 0:
            scores.append(0.0)
        else:
            rank = int(hit[0]) + 1  # 1-based
            scores.append(1.0 / rank)
    return float(np.mean(scores))

## 6) Model: pretrained ResNet18 (simple baseline)

(You can swap later to EfficientNet / DenseNet / etc.)


In [None]:
def make_model_resnet18(num_classes: int):
    weights = models.ResNet18_Weights.DEFAULT
    model = models.resnet18(weights=weights)
    in_f = model.fc.in_features
    model.fc = nn.Linear(in_f, num_classes)
    return model

## 7) Training/eval utilities (loss = CrossEntropy, metrics include MAP@5)

In [None]:
def run_epoch(model, loader, criterion, optimizer=None):
    is_train = optimizer is not None
    model.train() if is_train else model.eval()

    losses = []
    all_probs = []
    all_targets = []

    with torch.set_grad_enabled(is_train):
        for x, y in tqdm(loader, leave=False):
            x, y = x.to(device), y.to(device)
            logits = model(x)
            loss = criterion(logits, y)

            if is_train:
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            losses.append(loss.item())
            probs = torch.softmax(logits, dim=1).detach().cpu().numpy()
            all_probs.append(probs)
            all_targets.append(y.detach().cpu().numpy())

    probs = np.vstack(all_probs)
    targets = np.concatenate(all_targets)

    pred = probs.argmax(axis=1)
    acc = float(accuracy_score(targets, pred))
    map5 = map_at_k_from_probs(targets, probs, k=5)

    return float(np.mean(losses)), acc, map5, probs, targets

## 8) KFold training on train_pool + MLflow logging

In [None]:
# ---- Settings ----
K = 5          # set 2 for faster debugging
EPOCHS = 2     # set 1 for faster debugging
BATCH_SIZE = 32
LR = 1e-4
WEIGHT_DECAY = 1e-4

skf = StratifiedKFold(n_splits=K, shuffle=True, random_state=RANDOM_SEED)

fold_rows = []
fold_test_probs = []
test_targets_global = None

for fold, (tr_rel, va_rel) in enumerate(skf.split(np.zeros(len(train_pool_idx)), y_train_pool), start=1):
    tr_idx = train_pool_idx[tr_rel]
    va_idx = train_pool_idx[va_rel]

    train_loader, val_loader, test_loader = make_loaders_from_indices(
        tr_idx, va_idx, test_idx,
        batch_size=BATCH_SIZE, num_workers=2
    )

    model = make_model_resnet18(num_classes).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.AdamW(
        model.parameters(), lr=LR, weight_decay=WEIGHT_DECAY)

    run_name = f"fold_{fold}_resnet18"
    with mlflow.start_run(run_name=run_name):
        mlflow.log_params({
            "dataset": "humpback-whale-identification",
            "model": "resnet18",
            "num_classes": num_classes,
            "K": K,
            "epochs": EPOCHS,
            "batch_size": BATCH_SIZE,
            "lr": LR,
            "weight_decay": WEIGHT_DECAY,
            "drop_new_whale": DROP_NEW_WHALE,
            "min_samples_per_class": MIN_SAMPLES_PER_CLASS,
            "top_n_classes": TOP_N_CLASSES,
            "img_size": IMG_SIZE,
        })

        for ep in range(1, EPOCHS + 1):
            tr_loss, tr_acc, tr_map5, _, _ = run_epoch(
                model, train_loader, criterion, optimizer)
            va_loss, va_acc, va_map5, _, _ = run_epoch(
                model, val_loader, criterion)

            print(f"Fold {fold} | Epoch {ep:02d} | "
                  f"train: loss={tr_loss:.4f} acc={tr_acc:.4f} map5={tr_map5:.4f} | "
                  f"val:   loss={va_loss:.4f} acc={va_acc:.4f} map5={va_map5:.4f}")

            mlflow.log_metrics({
                "train_loss": tr_loss,
                "train_acc": tr_acc,
                "train_map5": tr_map5,
                "val_loss": va_loss,
                "val_acc": va_acc,
                "val_map5": va_map5,
            }, step=ep)

        te_loss, te_acc, te_map5, te_probs, te_targets = run_epoch(
            model, test_loader, criterion)
        print(
            f"Fold {fold} TEST: loss={te_loss:.4f} acc={te_acc:.4f} map5={te_map5:.4f}")

        mlflow.log_metrics({
            "test_loss": te_loss,
            "test_acc": te_acc,
            "test_map5": te_map5,
        })

        mlflow.pytorch.log_model(model, artifact_path="model")

        fold_test_probs.append(te_probs)
        test_targets_global = te_targets

        fold_rows.append({
            "fold": fold,
            "val_acc_last": va_acc,
            "val_map5_last": va_map5,
            "test_acc": te_acc,
            "test_map5": te_map5,
        })

df_folds = pd.DataFrame(fold_rows)
df_folds

## 9) Mean-of-folds ensemble on TEST (average probabilities)

In [None]:
fold_test_probs_arr = np.stack(fold_test_probs, axis=0)  # (K, Ntest, C)
mean_probs = fold_test_probs_arr.mean(axis=0)              # (Ntest, C)

mean_pred = mean_probs.argmax(axis=1)
ensemble_acc = float(accuracy_score(test_targets_global, mean_pred))
ensemble_map5 = map_at_k_from_probs(test_targets_global, mean_probs, k=5)

print("Ensemble TEST acc :", ensemble_acc)
print("Ensemble TEST map5:", ensemble_map5)

with mlflow.start_run(run_name="ensemble_mean_of_folds"):
    mlflow.log_params({
        "dataset": "humpback-whale-identification",
        "model": "resnet18",
        "ensemble": "mean_probs",
        "K": K,
        "epochs": EPOCHS,
        "batch_size": BATCH_SIZE,
        "lr": LR,
        "weight_decay": WEIGHT_DECAY,
        "num_classes": num_classes,
    })
    mlflow.log_metrics({
        "ensemble_test_acc": ensemble_acc,
        "ensemble_test_map5": ensemble_map5,
    })

df_compare = df_folds.copy()
df_compare["ensemble_test_acc"] = ensemble_acc
df_compare["ensemble_test_map5"] = ensemble_map5
df_compare

## 10) Viewing MLflow results

If you used `file:.../mlruns`, open a terminal in this notebook folder and run:

```bash
mlflow ui
```

Then open the printed local URL in your browser.
