1. Environment Setup — Colab Mount, Paths, Assertions, Self-Test
2. Imports, Reproducibility, Device Selection, Self-Check
3. Training Shards — Discovery, Local Staging, Concatenation, Normalization Wrapper
4. FER2013 CSV Splits — Validation/Test Datasets with Consistent Preprocessing
5. DataLoaders and Optional Visualization Hook
6. Metrics — Accuracy, Class Weights, and Composite Losses (Label Smoothing + Focal)
7. MixUp and CutMix Utilities (+ Mixed-Criterion Wrapper)
8. EMA (Weights) — Backup/Restore and Safety Checks
9. Base Training/Evaluation Mixin (training_step/validation_step/aggregation)
10. Training Hyperparameters and Global Knobs
11. Model Definition — EfficientNet-B0 + CBAM + optional Sobel
12. Optimizer & Scheduler (Warmup-Cosine) & EarlyStopping
13. Training Loop — AMP, EMA, MixUp/CutMix, Save-Best Checkpoint
14. Launch Training — Build Optimizer/Scheduler/EMA and Fit
15. Evaluation Utilities — Base / EMA / EMA + TTA
16. Run Evaluation — Validation and Test (Base, EMA, EMA+TTA)
17. FLOPs Measurement (fvcore) for 96×96
18. Efficiency Summary — Map Metrics, Select Best, Compute Accuracy/GFLOPs
19. Save Final Checkpoint — Best Weights + Timestamped Copy
20. Reload Checkpoint for Inference — Sanity Forward Pass
21. FLOPs Calculation (fvcore) — Standalone Report Cell
22. Efficiency Report Preparation — Name Mapping & Best Selection
23. Efficiency Function — Accuracy(%) per GFLOP Utility
24. Run Efficiency Report — Compute and Print Efficiency Score
25. Save Final Checkpoint — Stage C/Final Model Artifact
26. Wrap-Up — Final Metrics Summary and Completion Banner
27. Stage A (AffectNet) — RGB, ImageNet Normalization, Training
28. Stage A Save — Write AffectNet Checkpoint
29. Stage B (RAF-DB) — Load Stage A, Fine-Tune on RAF-DB
30. Stage B Save — Write RAF-DB Checkpoint
31. Stage C (FER2013) — Load Stage B, Final Fine-Tune on FER2013
32. Stage C Save — Write FER2013 Checkpoint (Multi-Stage Complete)

Stage 0 — Boot & Storage Safety

Cell 01 — Folder & Storage Debug (Auto-Create + Quota Warning)

In [1]:
# Creates core directories locally and prints disk quota.

import os, shutil
from pathlib import Path

PROJECT_NAME  = "facial_expression_recognition_2025"
PROJECT_ROOT  = Path("./project")
DATA_ROOT     = Path("./data")
CKPT_DIR      = PROJECT_ROOT / "checkpoints"
LOG_DIR       = PROJECT_ROOT / "logs"
MIN_FREE_GB   = 2.0  # warn if below this free space

for p in [PROJECT_ROOT, DATA_ROOT, CKPT_DIR, LOG_DIR]:
    p.mkdir(parents=True, exist_ok=True)

usage = shutil.disk_usage(str(PROJECT_ROOT))
free_gb = usage.free / (1024**3)
print(f"[Stage0] Disk @ {PROJECT_ROOT.resolve()} — total={usage.total/(1024**3):.2f} GB, "
      f"used={usage.used/(1024**3):.2f} GB, free={free_gb:.2f} GB")
if free_gb < MIN_FREE_GB:
    print(f"[Stage0][WARN] Low free space (< {MIN_FREE_GB:.1f} GB). Consider cleaning Drive.")


[Stage0] Disk @ /content/project — total=112.64 GB, used=38.79 GB, free=73.84 GB


#Cell 02 — Environment Setup (Colab Mount, Paths, Assertions)

In [2]:
import sys
from pathlib import Path

def in_colab() -> bool:
    return "google.colab" in sys.modules

if in_colab():
    from google.colab import drive

    # First unmount if already mounted
    try:
        drive.flush_and_unmount()
        print("[Stage0] Drive unmounted.")
    except Exception as e:
        print(f"[Stage0] No previous mount or already unmounted. ({e})")

    # Now remount
    drive.mount("/content/drive", force_remount=True)
    print("[Stage0] Google Drive mounted.")

# Path to FER2013
FER_CSV_PATH = Path("/content/drive/MyDrive/fer2013.csv")
print(f"[Stage0] FER CSV: {FER_CSV_PATH if FER_CSV_PATH.exists() else 'NOT FOUND'}")


Drive not mounted, so nothing to flush and unmount.
[Stage0] Drive unmounted.
Mounted at /content/drive
[Stage0] Google Drive mounted.
[Stage0] FER CSV: /content/drive/MyDrive/fer2013.csv


#Cell 03 — Global Switches & Run Config (Single Source of Truth)

In [3]:
# --- CONFIG (single source of truth) — UPDATED ---
from pathlib import Path

# Ensure FER CSV path points to your Drive root
FER_CSV_PATH = Path("/content/drive/MyDrive/fer2013.csv")

CONFIG = {
    # === Feature toggles ===
    "USE_AUG": True,          # enable training-time augmentation
    "USE_AUG_ADV": True,      # advanced FER policy (AugMixLite + occlusion/elastic etc.)
    "AUG_ALPHA": 0.65,        # blend coefficient for AugMixLite
    "USE_MIXUP": True,
    "USE_CUTMIX": True,
    "USE_EMA": True,
    "USE_TTA": False,         # keep Val clean; enable TTA only for Test in Cell 27

    # === Late-phase controls ===
    "AUG_CAP_LATE": True,     # cap augmentation strength in the last ~30% epochs
    "TAPER_MIX_LATE": True,   # taper MixUp/CutMix late

    # === Run routing ===
    "RUN_FER": True,
    "RUN_STAGE_A": False,
    "RUN_STAGE_B": False,
    "RUN_STAGE_C": False,
    "RUN_ALL": False,
    "DRY_RUN": False,

    # === Dataloading & reproducibility (throughput tuned for Colab 83GB/40GB) ===
    "SEED": 42,
    "NUM_WORKERS": 6,         # try 6; if GPU still starves, try 8. If RAM spikes, drop to 4.
    "BATCH_SIZE": 192,        # safe at 96x96 with 40GB GPU; if OOM, use 176/160/128
    "IMG_SIZE": 96,

    # === Paths ===
    "PROJECT_ROOT": PROJECT_ROOT,
    "DATA_ROOT": DATA_ROOT,
    "CKPT_DIR": CKPT_DIR,
    "LOG_DIR": LOG_DIR,
    "SAVE_BEST_PATH": CKPT_DIR / "best_fer.pth",
    "FER_CSV_PATH": FER_CSV_PATH,
}

# Nice, aligned snapshot
print("[Stage0] CONFIG snapshot:")
for k in sorted(CONFIG.keys()):
    print(f"  - {k:16s}: {CONFIG[k]}")


[Stage0] CONFIG snapshot:
  - AUG_ALPHA       : 0.65
  - AUG_CAP_LATE    : True
  - BATCH_SIZE      : 192
  - CKPT_DIR        : project/checkpoints
  - DATA_ROOT       : data
  - DRY_RUN         : False
  - FER_CSV_PATH    : /content/drive/MyDrive/fer2013.csv
  - IMG_SIZE        : 96
  - LOG_DIR         : project/logs
  - NUM_WORKERS     : 6
  - PROJECT_ROOT    : project
  - RUN_ALL         : False
  - RUN_FER         : True
  - RUN_STAGE_A     : False
  - RUN_STAGE_B     : False
  - RUN_STAGE_C     : False
  - SAVE_BEST_PATH  : project/checkpoints/best_fer.pth
  - SEED            : 42
  - TAPER_MIX_LATE  : True
  - USE_AUG         : True
  - USE_AUG_ADV     : True
  - USE_CUTMIX      : True
  - USE_EMA         : True
  - USE_MIXUP       : True
  - USE_TTA         : False


#Cell 04 — Imports, Versions, Reproducibility, Device Self-Check

In [4]:
import math, random, warnings
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torchvision
import torchvision.transforms.functional as VF
from torchvision.utils import make_grid
import matplotlib.pyplot as plt

warnings.filterwarnings("ignore")
print(f"[Stage0] torch={torch.__version__}, torchvision={torchvision.__version__}, numpy={np.__version__}")

SEED = int(CONFIG["SEED"])
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

torch.backends.cudnn.benchmark = True

if torch.cuda.is_available():
    dev_id = torch.cuda.current_device()
    props = torch.cuda.get_device_properties(dev_id)
    print(f"[Stage0] CUDA:{dev_id} — {props.name} — {props.total_memory/(1024**3):.1f} GB VRAM")
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    print("[Stage0] Device: Apple MPS")
else:
    print("[Stage0] Device: CPU")


[Stage0] torch=2.8.0+cu126, torchvision=0.23.0+cu126, numpy=2.0.2
[Stage0] CUDA:0 — NVIDIA A100-SXM4-40GB — 39.6 GB VRAM


#Cell 05 — Training Shards: Discovery & Local Staging (No Normalize)

In [5]:
# Optional shard discovery; safe no-op if you do not pre-generate shards.

import shutil
from pathlib import Path

SHARDS_ROOT = CONFIG["DATA_ROOT"] / "augmented_data"
LOCAL_STAGE = CONFIG["DATA_ROOT"] / "_local_stage"
LOCAL_STAGE.mkdir(parents=True, exist_ok=True)

if SHARDS_ROOT.exists():
    shards = sorted([p for p in SHARDS_ROOT.rglob("*.pt")])
    print(f"[Stage0] Found {len(shards)} shards under {SHARDS_ROOT}")
    if len(shards) > 0:
        sample_dst = LOCAL_STAGE / shards[0].name
        if not sample_dst.exists():
            try:
                shutil.copy2(shards[0], sample_dst)
                print(f"[Stage0] Staged sample shard → {sample_dst}")
            except Exception as e:
                print(f"[Stage0][WARN] Could not stage shard: {e}")
else:
    print("[Stage0] No shards directory present (this is fine).")


[Stage0] No shards directory present (this is fine).


#Cell 06 — FER2013 CSV Parse & Split (Robust)

In [6]:
# Loads FER2013 from the fixed path and makes train/val/test DataFrames.

from pathlib import Path

assert CONFIG["RUN_FER"], "RUN_FER=False; skip if not using FER2013."
assert CONFIG["FER_CSV_PATH"] is not None and Path(CONFIG["FER_CSV_PATH"]).exists(), \
    "FER2013 CSV not found at /content/drive/MyDrive/fer2013.csv"

fer_df = pd.read_csv(CONFIG["FER_CSV_PATH"])
expected_cols = {"emotion", "pixels"}
assert expected_cols.issubset({c.lower() for c in fer_df.columns}), \
    f"FER CSV missing required columns {expected_cols}. Found: {fer_df.columns.tolist()}"

if "Usage" in fer_df.columns:
    tr_df = fer_df[fer_df["Usage"] == "Training"].reset_index(drop=True)
    va_df = fer_df[fer_df["Usage"] == "PublicTest"].reset_index(drop=True)
    te_df = fer_df[fer_df["Usage"] == "PrivateTest"].reset_index(drop=True)
else:
    perm = np.random.permutation(len(fer_df))
    n = len(fer_df); n_tr = int(0.8*n); n_va = int(0.1*n)
    idx_tr, idx_va = perm[:n_tr], perm[n_tr:n_tr+n_va]
    mask = np.zeros(n, dtype=bool); mask[idx_tr]=True; mask[idx_va]=True
    tr_df = fer_df.iloc[idx_tr].reset_index(drop=True)
    va_df = fer_df.iloc[idx_va].reset_index(drop=True)
    te_df = fer_df.loc[~mask].reset_index(drop=True)

print(f"[Stage0] FER splits — train={len(tr_df)}, val={len(va_df)}, test={len(te_df)}")


[Stage0] FER splits — train=28709, val=3589, test=3589


#Cell 07 — Dataset Definition (48×48 → 96×96, No Normalize Yet)

In [7]:
# CSV-backed dataset converting space-separated pixels → [1,H,W] float tensor.
# Resize to 96×96 here; normalization is applied later (in aug or eval transform).

class FER2013Dataset(Dataset):
    def __init__(self, df, img_size=96):
        self.df = df.reset_index(drop=True)
        self.img_size = int(img_size)
        if len(self.df) > 0:
            _ = self._row_to_tensor(0)

    def _row_to_tensor(self, idx: int) -> torch.Tensor:
        px = self.df.iloc[idx]["pixels"]
        arr = np.fromstring(str(px), sep=" ", dtype=np.float32)
        assert arr.size == 48*48, f"Row {idx}: expected 2304 pixels, got {arr.size}"
        img = torch.from_numpy(arr.reshape(48, 48)).unsqueeze(0)  # [1,48,48]
        img = VF.resize(img, [self.img_size, self.img_size],
                        interpolation=torchvision.transforms.InterpolationMode.BILINEAR,
                        antialias=True)
        return img  # still [0..255]

    def __len__(self): return len(self.df)

    def __getitem__(self, idx):
        x = self._row_to_tensor(idx)
        y = int(self.df.iloc[idx]["emotion"])
        return x, y

IMG_SIZE = int(CONFIG["IMG_SIZE"])
train_ds = FER2013Dataset(tr_df, img_size=IMG_SIZE)
valid_ds = FER2013Dataset(va_df, img_size=IMG_SIZE)
test_ds  = FER2013Dataset(te_df, img_size=IMG_SIZE)
print("[Stage0] Dataset objects built.")


[Stage0] Dataset objects built.


#Cell 08 — DataLoaders (No Aug Yet)

In [8]:
BATCH_SIZE  = int(CONFIG["BATCH_SIZE"])
NUM_WORKERS = int(CONFIG["NUM_WORKERS"])

train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True,
                      num_workers=NUM_WORKERS, pin_memory=True,
                      persistent_workers=(NUM_WORKERS>0))
valid_dl = DataLoader(valid_ds, batch_size=BATCH_SIZE*2, shuffle=False,
                      num_workers=NUM_WORKERS, pin_memory=True,
                      persistent_workers=(NUM_WORKERS>0))
test_dl  = DataLoader(test_ds,  batch_size=BATCH_SIZE*2, shuffle=False,
                      num_workers=NUM_WORKERS, pin_memory=True,
                      persistent_workers=(NUM_WORKERS>0))

xb, yb = next(iter(train_dl))
print(f"[Stage0] Train batch shape: {xb.shape}, {yb.shape}")


[Stage0] Train batch shape: torch.Size([192, 1, 96, 96]), torch.Size([192])


#Cell 09 — Optional Visualization Hook

In [9]:
from torchvision.utils import make_grid

def show_batch(dl, nrow=16, title='Raw FER2013 batch (resized)'):
    imgs, labels = next(iter(dl))
    grid = make_grid(imgs[:nrow], nrow=nrow)
    plt.figure(figsize=(12,5))
    plt.axis('off'); plt.title(title)
    plt.imshow(grid.permute(1,2,0).squeeze(), cmap='gray')
    plt.show()

# Uncomment to preview
#show_batch(train_dl)


#Cell 10 — Augmentation Primitives (Grayscale-Friendly)

In [10]:
# ===== Advanced grayscale operators (tensor I/O) =====
# INPUT  : x in [C=1,H,W], values in [0,255]
# OUTPUT : same shape/range unless otherwise noted

import math, random
import torch
import torch.nn.functional as F
import torchvision.transforms.functional as VF
from PIL import Image, ImageFilter, ImageOps
import io

# ---- Small utilities ----
def _to_pil_gray(x255: torch.Tensor) -> Image.Image:
    # [1,H,W] -> PIL 'L'
    x = x255.clamp(0, 255).to(torch.uint8).squeeze(0).cpu().numpy()
    return Image.fromarray(x, mode='L')

def _from_pil_gray(img: Image.Image) -> torch.Tensor:
    # PIL 'L' -> [1,H,W] float32 in [0,255]
    return torch.tensor(np.array(img, dtype=np.uint8), dtype=torch.float32).unsqueeze(0)

def _clamp255(x): return x.clamp(0.0, 255.0)

# ---- Photometric / intensity ----
def gauss_noise(x, sigma=0.02):
    n = torch.randn_like(x) * (sigma * 255.0)
    return _clamp255(x + n)

def rand_gamma(x, gmin=0.8, gmax=1.25):
    g = random.uniform(gmin, gmax)
    x01 = (x / 255.0).clamp(0,1)
    return (x01.pow(g) * 255.0)

def rand_contrast(x, scale=0.25):
    c = 1.0 + random.uniform(-scale, scale)
    mean = x.mean(dim=(1,2), keepdim=True)
    return _clamp255((x - mean) * c + mean)

def rand_equalize(x):
    img = _to_pil_gray(x); img = ImageOps.equalize(img)
    return _from_pil_gray(img).to(x.dtype).to(x.device)

def rand_jpeg(x, qmin=45, qmax=85):
    img = _to_pil_gray(x)
    buf = io.BytesIO()
    img.save(buf, format='JPEG', quality=random.randint(qmin, qmax))
    buf.seek(0)
    img2 = Image.open(buf).convert('L')
    return _from_pil_gray(img2).to(x.dtype).to(x.device)

def rand_vignette(x, strength=0.25):
    _, H, W = x.shape
    yy, xx = torch.meshgrid(torch.linspace(-1,1,H,device=x.device),
                            torch.linspace(-1,1,W,device=x.device), indexing='ij')
    r = torch.sqrt(xx**2 + yy**2)
    mask = 1.0 - strength * (r / r.max()).clamp(0,1)
    return _clamp255(x * mask.unsqueeze(0))

def rand_blur(x, k=3):
    return VF.gaussian_blur(x, kernel_size=k)

# ---- Geometric ----
def rand_affine_small(x, max_rot=12.0, max_trans=0.08, max_shear=8.0, max_scale=0.08):
    H, W = x.shape[-2:]
    angle = random.uniform(-max_rot, max_rot)
    trans = [int(random.uniform(-max_trans, max_trans) * W),
             int(random.uniform(-max_trans, max_trans) * H)]
    scale = 1.0 + random.uniform(-max_scale, max_scale)
    shear = [random.uniform(-max_shear, max_shear), 0.0]
    return VF.affine(x, angle=angle, translate=trans, scale=scale, shear=shear)

def rand_pad_crop(x, pad=3):
    # pad then random crop back to original size
    _, H, W = x.shape
    xpad = F.pad(x, (pad, pad, pad, pad), mode='reflect')
    i = random.randint(0, 2*pad); j = random.randint(0, 2*pad)
    return xpad[:, i:i+H, j:j+W]

def rand_hflip(x, p=0.5):
    return VF.hflip(x) if random.random() < p else x

# Elastic deformation: small, smoothed displacement field
def rand_elastic(x, alpha=1.0, sigma=4.0):
    _, H, W = x.shape
    # displacement fields
    dx = torch.randn(1,1,H,W, device=x.device)
    dy = torch.randn(1,1,H,W, device=x.device)
    # smooth them by Gaussian
    def _gauss_kernel(k=21, s=sigma):
        ax = torch.arange(k, device=x.device) - (k-1)/2
        ker = torch.exp(-(ax**2)/(2*s*s)); ker = ker/ker.sum()
        return ker
    k = 21
    gx = _gauss_kernel(k).view(1,1,1,k)
    gy = _gauss_kernel(k).view(1,1,k,1)
    dx = F.conv2d(dx, gx, padding=(0,k//2)); dx = F.conv2d(dx, gy, padding=(k//2,0))
    dy = F.conv2d(dy, gx, padding=(0,k//2)); dy = F.conv2d(dy, gy, padding=(k//2,0))
    dx = dx.squeeze(0).squeeze(0) * alpha
    dy = dy.squeeze(0).squeeze(0) * alpha

    # grid in [-1,1]
    yy, xx = torch.meshgrid(torch.linspace(-1,1,H,device=x.device),
                            torch.linspace(-1,1,W,device=x.device), indexing='ij')
    xx = (xx + dx / (W/2)).clamp(-1,1)
    yy = (yy + dy / (H/2)).clamp(-1,1)
    grid = torch.stack([xx, yy], dim=-1).unsqueeze(0)  # [1,H,W,2]
    return F.grid_sample(x.unsqueeze(0), grid, mode='bilinear', padding_mode='border', align_corners=True).squeeze(0)

# ---- Occlusion (domain-specific) ----
def band_occlusion(x, mode='eyes', frac=0.18):
    _, H, W = x.shape
    band_h = max(1, int(frac*H))
    y0 = {
        'eyes': int(0.35*H) - band_h//2,
        'mouth': int(0.75*H) - band_h//2,
        'top': int(0.15*H) - band_h//2
    }[mode]
    y1 = max(0, y0); y2 = min(H, y0 + band_h)
    x = x.clone()
    x[:, y1:y2, :] = x[:, y1:y2, :].mean()  # neutral occluder (gray)
    return x

def localized_erasing(x, min_frac=0.01, max_frac=0.05):
    C, H, W = x.shape
    area = random.uniform(min_frac, max_frac) * H * W
    side = int(max(2, math.sqrt(area)))
    cx, cy = random.randint(0,W-1), random.randint(0,H-1)
    x[:, max(0,cy-side//2):min(H,cy+side//2), max(0,cx-side//2):min(W,cx+side//2)] = 127.5
    return x


#Cell 11 — Augmentation Pipeline Builder (Curriculum; Output Normalized [-1,1])

In [11]:
# ===== AugMixLite: blend multiple sub-policies with the original =====
def _apply_bank(x, bank, k=2):
    y = x.clone()
    for _ in range(k):
        op = random.choice(bank)
        y = op(y)
    return y

def augmix_lite(x, banks, alpha=0.65, branches=2, depth=2):
    mix = x.clone()
    for _ in range(branches):
        b = random.choice(banks)
        y = _apply_bank(x, b, k=depth)
        mix = mix + y
    mix = mix / (branches + 1.0)
    return (1 - alpha) * x + alpha * mix

# ===== Advanced augmentation builder =====
def build_advanced_fer_augment(strength: float):
    """
    strength s in [0,1]: controls probabilities/magnitudes.
    Returns f(x255->[1,H,W]) -> x_norm in [-1,1].
    """
    s = float(max(0.0, min(1.0, strength)))
    # Probabilities
    p_photo = 0.7 * (0.5 + 0.5*s)
    p_geom  = 0.6 * (0.5 + 0.5*s)
    p_occl  = 0.40 * (0.5 + 0.5*s)
    p_equal = 0.20 * s
    p_blur  = 0.15 * s
    # Magnitudes
    gamma_rng = (0.85 - 0.15*s, 1.20 + 0.05*s)
    contrast_scale = 0.20 + 0.10*s
    jpeg_q = (55 - int(10*s), 85)
    vignette_str = 0.15 + 0.20*s
    elastic_alpha = 0.6 + 0.8*s
    rot = 10 + 5*s
    shear = 6 + 4*s
    trans = 0.06 + 0.03*s
    scale = 0.06 + 0.04*s

    # Banks
    photometric_bank = [
        lambda z: gauss_noise(z, sigma=0.015 + 0.02*s),
        lambda z: rand_gamma(z, *gamma_rng),
        lambda z: rand_contrast(z, scale=contrast_scale),
        lambda z: rand_jpeg(z, qmin=jpeg_q[0], qmax=jpeg_q[1]),
        lambda z: rand_vignette(z, strength=vignette_str),
    ]
    geometric_bank = [
        lambda z: rand_affine_small(z, max_rot=rot, max_trans=trans, max_shear=shear, max_scale=scale),
        lambda z: rand_pad_crop(z, pad=3),
        lambda z: rand_hflip(z, p=0.5),
        lambda z: rand_elastic(z, alpha=elastic_alpha, sigma=4.0),
    ]
    occlusion_bank = [
        lambda z: band_occlusion(z, mode='eyes',  frac=0.16 + 0.06*s),
        lambda z: band_occlusion(z, mode='mouth', frac=0.16 + 0.06*s),
        lambda z: band_occlusion(z, mode='top',   frac=0.14 + 0.06*s),
        lambda z: localized_erasing(z, min_frac=0.01, max_frac=0.05),
    ]
    banks = [photometric_bank, geometric_bank, occlusion_bank]

    def _norm_to_m11(x255):
        x01 = (x255 / 255.0).clamp(0,1)
        return (x01 - 0.5) * 2.0

    def _augment(x):
        # 1) pre-crop/pad + optional blur
        if random.random() < p_geom:  x = rand_pad_crop(x, pad=3)
        if random.random() < p_blur:  x = rand_blur(x, k=3)

        # 2) photometric block
        if random.random() < p_photo: x = random.choice(photometric_bank)(x)

        # 3) AugMixLite composite (2 branches × depth=2)
        x = augmix_lite(x, banks=banks, alpha=CONFIG.get("AUG_ALPHA", 0.65),
                        branches=2, depth=2)

        # 4) more geometric and occlusion chance
        if random.random() < p_geom:  x = random.choice(geometric_bank)(x)
        if random.random() < p_occl:  x = random.choice(occlusion_bank)(x)

        # 5) occasional histogram equalization near the end
        if random.random() < p_equal: x = rand_equalize(x)

        # Normalize to [-1,1] for the model
        return _norm_to_m11(x)

    return _augment

# Keep legacy factory for fallback
FER_AUG_FACTORY = build_advanced_fer_augment if CONFIG.get("USE_AUG_ADV", False) else build_fer_augment


#Cell 12 — Augmentation Debug Cell

In [12]:
if CONFIG["USE_AUG"]:
    xs, ys = next(iter(train_dl))
    xs = xs[:8].to('cpu')
    for s in [0.0, 0.5, 1.0]:
        f = FER_AUG_FACTORY(s)
        x_aug = torch.stack([f(x) for x in xs])  # [-1,1]
        assert x_aug.shape == xs.shape and torch.isfinite(x_aug).all()
        print(f"[AugDebug] s={s}: min={x_aug.min().item():.3f}, max={x_aug.max().item():.3f}")
        # Optional quick glance
        # grid = (x_aug * 0.5 + 0.5).clamp(0,1)
        # grid = make_grid(grid, nrow=8)
        # plt.figure(figsize=(12,3)); plt.axis('off'); plt.title(f'Advanced Aug s={s}')
        # plt.imshow(grid.permute(1,2,0).squeeze(), cmap='gray'); plt.show()


[AugDebug] s=0.0: min=-1.000, max=0.978
[AugDebug] s=0.5: min=-1.000, max=0.910
[AugDebug] s=1.0: min=-1.000, max=1.000


#Cell 13 — Metrics: Accuracy & Class Weights

In [13]:
from collections import Counter

def accuracy(logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
    return (logits.argmax(dim=1) == targets).float().mean()

def compute_class_weights(df) -> torch.Tensor:
    counts = Counter(int(e) for e in df["emotion"].tolist())
    total = sum(counts.values())
    weights = [total / max(1, counts.get(c, 1)) for c in range(7)]
    w = torch.tensor(weights, dtype=torch.float32)
    return w / w.mean()

CLASS_WEIGHTS = compute_class_weights(tr_df)
print(f"[Stage1] Class weights: {CLASS_WEIGHTS.tolist()}")


[Stage1] Class weights: [0.4800022542476654, 4.398185729980469, 0.4680519998073578, 0.26578086614608765, 0.39702051877975464, 0.6047332286834717, 0.3862254023551941]


#Cell 14 — Losses: Label-Smoothed CE, Focal, Composite

In [14]:
# Replace your CompositeLoss with this “SmoothedFocal” preset.
class LabelSmoothingCE(nn.Module):
    def __init__(self, eps=0.10, reduction='mean'):
        super().__init__(); self.eps=eps; self.reduction=reduction
    def forward(self, logits, targets):
        n = logits.size(-1)
        logp = F.log_softmax(logits, dim=-1)
        with torch.no_grad():
            true = torch.zeros_like(logp).fill_(self.eps/(n-1))
            true.scatter_(1, targets.unsqueeze(1), 1.0 - self.eps)
        loss = -(true * logp).sum(dim=1)
        return loss.mean() if self.reduction=='mean' else loss

class FocalLoss(nn.Module):
    def __init__(self, gamma=1.5, reduction='mean'):
        super().__init__(); self.g=gamma; self.reduction=reduction
    def forward(self, logits, targets):
        ce = F.cross_entropy(logits, targets, reduction='none')
        pt = torch.exp(-ce)
        fl = ((1-pt)**self.g) * ce
        return fl.mean() if self.reduction=='mean' else fl

class SmoothedFocal(nn.Module):
    def __init__(self, eps=0.10, gamma=1.5, alpha=0.70, weight=None):
        super().__init__(); self.a=alpha; self.w = weight
        self.lsce = LabelSmoothingCE(eps)
        self.focal= FocalLoss(gamma)
    def forward(self, logits, targets):
        if self.w is not None:
            # weight affects CE inside focal; apply by hand
            ce = F.cross_entropy(logits, targets, reduction='none', weight=self.w.to(logits.device))
            pt = torch.exp(-ce); fl = ((1-pt)**1.5) * ce
            ls = self.lsce(logits, targets)
            return self.a*ls + (1-self.a)*fl.mean()
        return self.a*self.lsce(logits, targets) + (1-self.a)*self.focal(logits, targets)


#Cell 15 — MixUp & CutMix Utilities + Mixed Criterion

In [15]:
def mixup_data(x, y, alpha=0.2):
    if alpha <= 0.0:
        return x, y, 1.0, None
    lam = np.random.beta(alpha, alpha)
    index = torch.randperm(x.size(0), device=x.device)
    mixed_x = lam * x + (1 - lam) * x[index, :]
    y_a, y_b = y, y[index]
    return mixed_x, (y_a, y_b), lam, index

def cutmix_data(x, y, alpha=1.0, min_lam=0.3, max_lam=0.7):
    if alpha <= 0.0:
        return x, y, 1.0, None
    lam = np.random.beta(alpha, alpha)
    lam = float(max(min_lam, min(max_lam, lam)))
    B, C, H, W = x.size()
    index = torch.randperm(B, device=x.device)
    cut_w = int(W * math.sqrt(1 - lam))
    cut_h = int(H * math.sqrt(1 - lam))
    cx = np.random.randint(W)
    cy = np.random.randint(H)
    x1 = np.clip(cx - cut_w // 2, 0, W)
    x2 = np.clip(cx + cut_w // 2, 0, W)
    y1 = np.clip(cy - cut_h // 2, 0, H)
    y2 = np.clip(cy + cut_h // 2, 0, H)
    x[:, :, y1:y2, x1:x2] = x[index, :, y1:y2, x1:x2]
    lam = 1 - ((x2 - x1) * (y2 - y1) / (W * H + 1e-9))
    y_a, y_b = y, y[index]
    return x, (y_a, y_b), lam, index

def mixed_criterion(criterion, logits, targets_mix, lam):
    if isinstance(targets_mix, tuple):
        y_a, y_b = targets_mix
        return lam * criterion(logits, y_a) + (1 - lam) * criterion(logits, y_b)
    else:
        return criterion(logits, targets_mix)


#Cell 16 — EMA (Exponential Moving Average) for Weights

In [16]:
class EMA:
    def __init__(self, model: nn.Module, decay: float = 0.999):
        self.decay = float(decay)
        self.shadow = {}
        self.backup = {}
        for name, param in model.named_parameters():
            if param.requires_grad:
                self.shadow[name] = param.data.clone()

    def update(self, model: nn.Module):
        for name, param in model.named_parameters():
            if not param.requires_grad:
                continue
            self.shadow[name] = (1.0 - self.decay) * param.data + self.decay * self.shadow[name]

    def apply_shadow(self, model: nn.Module):
        self.backup = {}
        for name, param in model.named_parameters():
            if not param.requires_grad:
                continue
            self.backup[name] = param.data.clone()
            param.data = self.shadow[name].clone()

    def restore(self, model: nn.Module):
        for name, param in model.named_parameters():
            if not param.requires_grad:
                continue
            param.data = self.backup[name].clone()
        self.backup = {}


#Cell 17 — Base Training/Evaluation Mixin

In [17]:
class TrainingMixin:
    def training_step(self, batch, criterion):
        x, y = batch
        x, y = x.to(self.device), y.to(self.device)
        logits = self(x)
        loss = criterion(logits, y)
        return loss

    @torch.no_grad()
    def validation_step(self, batch, criterion):
        x, y = batch
        x, y = x.to(self.device), y.to(self.device)
        logits = self(x)
        loss = criterion(logits, y)
        acc = accuracy(logits, y)
        return loss.detach(), acc.detach()

    @torch.no_grad()
    def evaluate_loader(self, loader, criterion):
        self.eval()
        losses, accs = [], []
        for batch in loader:
            l, a = self.validation_step(batch, criterion)
            losses.append(l.item()); accs.append(a.item())
        return float(np.mean(losses)), float(np.mean(accs))


#Cell 18 — Training Hyperparameters & Global Knobs

In [18]:
# === Squeezed HP (late-improvement friendly) ===
HP = {
    "EPOCHS": 70,          # give the cosine tail more room
    "LR": 3e-4,            # good starting LR for AdamW fine-tune
    "WD": 5e-5,            # was 1e-4; slightly lower to reduce underfit in the tail
    "WARMUP_EPOCHS": 4,    # keep warmup
    "LR_MIN": 5e-5,        # raise tail floor so updates don’t die out (you were still learning at ~3e-5)
    "PATIENCE": 18,        # was 8; your run was still improving at ep40 → avoid premature stop
    "EMA_DECAY": 0.9995,   # a touch slower EMA helps late stability
    "MIXUP_ALPHA": 0.3,
    "CUTMIX_ALPHA": 1.0,
    "AUG_RAMP_EPOCHS": 0.25, # ramp a bit faster; we’ll taper later anyway
}
print("[Stage1] HP snapshot:", HP)

# --- Late-phase tweaks (outside HP; keep your existing fit(...) call) ---
# Use these variables if your training loop supports them; otherwise just apply the logic in code.
P_MIX, P_CUT = 0.30, 0.30           # keep some mixing early
TAPER_START_FRAC, TAPER_END_FRAC = 0.60, 0.90   # linearly taper mix/aug to ~0 by last 10%
LABEL_SMOOTH_EPS = 0.05             # feed into your loss if supported (CrossEntropy w/ label smoothing)

# Example (conceptual) inside your epoch loop:
# frac = epoch / HP["EPOCHS"]
# p_mix_now = P_MIX if frac < TAPER_START_FRAC else max(0.0, P_MIX * (1 - (frac - TAPER_START_FRAC) / (TAPER_END_FRAC - TAPER_START_FRAC)))
# p_cut_now = P_CUT if frac < TAPER_START_FRAC else max(0.0, P_CUT * (1 - (frac - TAPER_START_FRAC) / (TAPER_END_FRAC - TAPER_START_FRAC)))


[Stage1] HP snapshot: {'EPOCHS': 70, 'LR': 0.0003, 'WD': 5e-05, 'WARMUP_EPOCHS': 4, 'LR_MIN': 5e-05, 'PATIENCE': 18, 'EMA_DECAY': 0.9995, 'MIXUP_ALPHA': 0.3, 'CUTMIX_ALPHA': 1.0, 'AUG_RAMP_EPOCHS': 0.25}


#Cell 19 — CBAM Block (Attention Module)

In [19]:
class CBAM(nn.Module):
    def __init__(self, ch, r=8):
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Conv2d(ch, max(1, ch//r), 1, bias=True), nn.ReLU(inplace=True),
            nn.Conv2d(max(1, ch//r), ch, 1, bias=True)
        )
        self.spatial = nn.Sequential(
            nn.Conv2d(2, 1, kernel_size=7, padding=3, bias=False),
            nn.Sigmoid()
        )
        self.sigmoid = nn.Sigmoid()
    def forward(self, x):
        ca = F.adaptive_avg_pool2d(x, 1) + F.adaptive_max_pool2d(x, 1)
        ca = self.sigmoid(self.mlp(ca))
        x = x * ca
        ms = torch.cat([x.mean(1, keepdim=True), x.max(1, keepdim=True)[0]], dim=1)
        sa = self.spatial(ms)
        return x * sa


#Cell 20 — Optional Sobel Stem (for Grayscale Edges)

In [20]:
class SobelStem(nn.Module):
    def __init__(self):
        super().__init__()
        kernel_x = torch.tensor([[1,0,-1],[2,0,-2],[1,0,-1]], dtype=torch.float32)
        kernel_y = torch.tensor([[1,2,1],[0,0,0],[-1,-2,-1]], dtype=torch.float32)
        self.register_buffer('kx', kernel_x.view(1,1,3,3))
        self.register_buffer('ky', kernel_y.view(1,1,3,3))
    def forward(self, x):
        gx = F.conv2d(x, self.kx, padding=1)
        gy = F.conv2d(x, self.ky, padding=1)
        g = torch.sqrt(gx**2 + gy**2 + 1e-6)
        return torch.cat([x, g], dim=1)  # [B,2,H,W]


#Cell 21 — HybridEffNet Model Definition

In [21]:
# ===== Cell 21 — HybridEffNet Model Definition (REPLACE) =====
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision.models import efficientnet_b0, EfficientNet_B0_Weights

# Pull optional knobs from CONFIG with safe defaults
CLASSIFIER_DROPOUT = float(CONFIG.get("CLASSIFIER_DROPOUT", 0.30))
USE_CBAM = bool(CONFIG.get("USE_CBAM", True))  # keep True to match prior runs

class SobelLayer(nn.Module):
    def __init__(self):
        super().__init__()
        kx = torch.tensor([[1,0,-1],[2,0,-2],[1,0,-1]], dtype=torch.float32)
        ky = torch.tensor([[1,2,1],[0,0,0],[-1,-2,-1]], dtype=torch.float32)
        w  = torch.stack([kx, ky]).unsqueeze(1)  # (2,1,3,3)
        self.register_buffer('w', w)

    def forward(self, x):            # x:[B,1,H,W]
        edges = F.conv2d(x, self.w, padding=1)   # [B,2,H,W]
        return torch.cat([x, edges], dim=1)      # [B,3,H,W]

class HybridEffNet(nn.Module, TrainingMixin):
    def __init__(self, num_classes=7, classifier_dropout=CLASSIFIER_DROPOUT, use_cbam=USE_CBAM):
        super().__init__()
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

        base = efficientnet_b0(weights=EfficientNet_B0_Weights.DEFAULT)

        # Sobel expands 1→3 channels so we can keep EfficientNet stem unchanged
        self.sobel    = SobelLayer()
        self.features = base.features
        self.pool     = nn.AdaptiveAvgPool2d(1)

        # Optional CBAM on the final feature map (requires Cell 19)
        self.cbam = CBAM(1280) if use_cbam else None

        in_features = 1280  # EfficientNet-B0 penultimate dim
        self.bn   = nn.BatchNorm1d(in_features)
        self.drop = nn.Dropout(p=classifier_dropout)
        self.head = nn.Linear(in_features, num_classes)

        self.to(self.device)

    def forward(self, x1):           # x1 in [-1,1], shape [B,1,H,W]
        x3 = self.sobel(x1)          # [B,3,H,W]
        f  = self.features(x3)       # [B,1280,h,w]
        if self.cbam is not None:
            f = self.cbam(f)
        f  = self.pool(f).flatten(1) # [B,1280]
        f  = self.bn(f)
        f  = self.drop(f)
        return self.head(f)


#Cell 22 — Optimizer, Warmup-Cosine Scheduler, EarlyStopping

In [22]:
def make_adamw(params, lr, wd):
    return torch.optim.AdamW(params, lr=lr, weight_decay=wd)

class WarmupCosine:
    def __init__(self, optimizer, warmup_epochs, max_epochs, lr_min=1e-6, lr_max=None):
        self.opt = optimizer
        self.warmup = max(1, int(warmup_epochs))
        self.maxe = int(max_epochs)
        self.t = 0
        self.lr_min = lr_min
        self.lr_max = lr_max if lr_max is not None else max(g['lr'] for g in optimizer.param_groups)
    def step(self):
        self.t += 1
        if self.t <= self.warmup:
            lr = self.lr_min + (self.lr_max - self.lr_min) * (self.t / self.warmup)
        else:
            tt = (self.t - self.warmup) / max(1, (self.maxe - self.warmup))
            lr = self.lr_min + 0.5*(self.lr_max - self.lr_min)*(1 + math.cos(math.pi*tt))
        for g in self.opt.param_groups:
            g['lr'] = lr
        return lr

class EarlyStopping:
    def __init__(self, patience=8, min_delta=1e-4):
        self.patience = int(patience)
        self.min_delta = float(min_delta)
        self.best = float('inf')
        self.bad = 0
    def step(self, val_loss):
        if val_loss < self.best - self.min_delta:
            self.best = val_loss
            self.bad = 0
            return False
        self.bad += 1
        return self.bad >= self.patience


#Cell 23 — fit_with_aug(): Training Loop with Aug/Mix/EMA

In [23]:
# === Cell 23 — fit_with_aug(): advanced aug + late-phase taper + AMP + EMA ===
def fit_with_aug(model: nn.Module, train_dl, valid_dl, hp, config):
    device = model.device

    # Loss with class weights (computed earlier)
    weight = CLASS_WEIGHTS.to(device)
    criterion = SmoothedFocal(eps=0.10, gamma=1.5, alpha=0.70, weight=weight)

    # Optimizer / Scheduler / EarlyStop / EMA / AMP
    opt   = make_adamw(model.parameters(), lr=hp["LR"], wd=hp["WD"])
    sched = WarmupCosine(opt, warmup_epochs=hp["WARMUP_EPOCHS"],
                         max_epochs=hp["EPOCHS"], lr_min=hp["LR_MIN"])
    stopper = EarlyStopping(patience=hp["PATIENCE"], min_delta=1e-4)
    ema   = EMA(model, decay=hp["EMA_DECAY"]) if config["USE_EMA"] else None
    scaler = torch.cuda.amp.GradScaler(enabled=torch.cuda.is_available())

    # Curriculum ramp for augmentation strength
    total_epochs  = int(hp["EPOCHS"])
    aug_ramp_frac = float(hp.get("AUG_RAMP_EPOCHS", 0.30))
    ramp_epochs   = max(1, int(aug_ramp_frac * total_epochs))

    # Feature flags for late-phase controls
    cap_late   = bool(config.get("AUG_CAP_LATE", True))
    taper_late = bool(config.get("TAPER_MIX_LATE", True))

    history = []
    best_val = float("inf")  # track best val_loss for reference/logging

    for epoch in range(1, total_epochs + 1):
        model.train()
        epoch_loss_sum = 0.0
        num_seen = 0

        # ----- Build augmentation for this epoch -----
        if config["USE_AUG"]:
            # curriculum strength in [0.2, 0.8]
            s = 0.2 + 0.6 * min(1.0, epoch / ramp_epochs)
            # optional cap in the final 30% to better match val distribution
            if cap_late and epoch >= int(0.7 * total_epochs):
                s = min(s, 0.6)
            augment = FER_AUG_FACTORY(s)
        else:
            augment = None

        # ----- Late-phase taper for mixing (MixUp/CutMix) -----
        mixup_alpha  = float(hp["MIXUP_ALPHA"])
        cutmix_alpha = float(hp["CUTMIX_ALPHA"])
        use_cutmix   = bool(config["USE_CUTMIX"])

        if taper_late and epoch >= int(0.5 * total_epochs):
            mixup_alpha  = max(0.1, mixup_alpha * 0.5)   # softer mixing
            cutmix_alpha = max(0.5, cutmix_alpha * 0.5) # smaller boxes
        if taper_late and epoch >= int(0.7 * total_epochs):
            use_cutmix = False  # sharpen decision boundaries late

        # ===================== Train loop =====================
        for xb, yb in train_dl:
            xb, yb = xb.to(device, non_blocking=True), yb.to(device, non_blocking=True)

            # Apply augmentation or deterministic normalization
            if augment is not None:
                # advanced policy returns tensors already normalized to [-1,1]
                xb = torch.stack([augment(x) for x in xb])  # [-1,1]
            else:
                xb = ((xb / 255.0) - 0.5) * 2.0

            opt.zero_grad(set_to_none=True)

            with torch.autocast(device_type="cuda", dtype=torch.float16, enabled=torch.cuda.is_available()):
                if config["USE_MIXUP"] or use_cutmix:
                    if use_cutmix and random.random() < 0.5:
                        xb, targets_mix, lam, _ = cutmix_data(
                            xb, yb, alpha=cutmix_alpha, min_lam=0.3, max_lam=0.7
                        )
                    else:
                        xb, targets_mix, lam, _ = mixup_data(xb, yb, alpha=mixup_alpha)
                    logits = model(xb)
                    loss   = mixed_criterion(criterion, logits, targets_mix, lam)
                else:
                    logits = model(xb)
                    loss   = criterion(logits, yb)

            scaler.scale(loss).backward()
            if torch.cuda.is_available():
                scaler.unscale_(opt)
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
            scaler.step(opt)
            scaler.update()

            if ema is not None:
                ema.update(model)

            bs = xb.size(0)
            epoch_loss_sum += loss.item() * bs
            num_seen += bs

        # ===================== Validation =====================
        @torch.no_grad()
        def _eval(loader):
            model.eval()
            losses, accs = [], []
            for xb, yb in loader:
                xb, yb = xb.to(device, non_blocking=True), yb.to(device, non_blocking=True)
                xb = ((xb / 255.0) - 0.5) * 2.0  # eval is augmentation-free
                logits = model(xb)
                l = criterion(logits, yb)
                a = accuracy(logits, yb)
                losses.append(l.item()); accs.append(a.item())
            return float(np.mean(losses)), float(np.mean(accs))

        val_loss, val_acc = _eval(valid_dl)

        # Scheduler step after seeing validation
        lr_now = sched.step()

        # Epoch bookkeeping
        train_loss = epoch_loss_sum / max(1, num_seen)
        history.append({
            "epoch": epoch,
            "train_loss": train_loss,
            "val_loss": val_loss,
            "val_acc": val_acc,
            "lr": lr_now
        })

        # Logging with best marker on val_loss (early stop follows this)
        is_best = ""
        if val_loss < best_val - 1e-6:
            best_val = val_loss
            is_best = " *best*"
        print(f"[Epoch {epoch:03d}/{total_epochs}] "
              f"train_loss={train_loss:.4f}  val_loss={val_loss:.4f}  "
              f"val_acc={val_acc:.4f}  lr={lr_now:.2e}{is_best}")

        # Early stopping on validation loss
        if stopper.step(val_loss):
            print("[EarlyStopping] Patience exhausted; stopping training.")
            break

    return history, ema


#Cell 24 — Integration Debug: One Forward/Backward Probe

In [24]:
# Build the model exactly as we intend to train/evaluate it.
# Assumes HybridEffNet(num_classes=7, classifier_dropout=0.30, use_cbam=True) from Cell 21.

model = HybridEffNet(num_classes=7, classifier_dropout=0.30, use_cbam=True)
model.train()

# Quick param count banner
total_params = sum(p.numel() for p in model.parameters()) / 1e6
print(f"[Probe] model params ≈ {total_params:.2f}M, device={model.device}")

# One mini-batch probe (sanity check on shapes, loss, and gradients)
xb, yb = next(iter(train_dl))
xb, yb = xb.to(model.device), yb.to(model.device)

# Deterministic normalization for the probe (no augmentation here)
xb = ((xb / 255.0) - 0.5) * 2.0

# Mixed precision probe if CUDA is available
model.zero_grad(set_to_none=True)
with torch.autocast(device_type='cuda', dtype=torch.float16, enabled=torch.cuda.is_available()):
    logits = model(xb)
    loss   = F.cross_entropy(logits, yb)

loss.backward()
head_grad_norm = model.head.weight.grad.norm().item()
print(f"[Probe] logits={tuple(logits.shape)}, loss={loss.item():.4f}, grad_head_norm={head_grad_norm:.3f}")


Downloading: "https://download.pytorch.org/models/efficientnet_b0_rwightman-7f5810bc.pth" to /root/.cache/torch/hub/checkpoints/efficientnet_b0_rwightman-7f5810bc.pth


100%|██████████| 20.5M/20.5M [00:00<00:00, 141MB/s] 


[Probe] model params ≈ 4.43M, device=cuda
[Probe] logits=(192, 7), loss=2.1656, grad_head_norm=3.032


#Cell 25 — Launch Training (FER-only Path)

In [25]:
# Trains with the current switches and HP (keeps your dict name "HP").
# Uses EMA if CONFIG["USE_EMA"] is True; history and ema_obj are returned
# from fit_with_aug() and can be reused by evaluation cells.

if CONFIG["RUN_FER"] and not CONFIG["DRY_RUN"]:
    print("[Stage3] Starting training…")
    history, ema_obj = fit_with_aug(model, train_dl, valid_dl, HP, CONFIG)
else:
    print("[Stage3] Skipping training due to DRY_RUN or RUN_FER=False.")


[Stage3] Starting training…
[Epoch 001/70] train_loss=1.4763  val_loss=1.2285  val_acc=0.4506  lr=1.12e-04 *best*
[Epoch 002/70] train_loss=1.3703  val_loss=1.1638  val_acc=0.4867  lr=1.75e-04 *best*
[Epoch 003/70] train_loss=1.3286  val_loss=1.1130  val_acc=0.5229  lr=2.37e-04 *best*
[Epoch 004/70] train_loss=1.3095  val_loss=1.0740  val_acc=0.5462  lr=3.00e-04 *best*
[Epoch 005/70] train_loss=1.2607  val_loss=1.0314  val_acc=0.5633  lr=3.00e-04 *best*
[Epoch 006/70] train_loss=1.2619  val_loss=1.0262  val_acc=0.5782  lr=2.99e-04 *best*
[Epoch 007/70] train_loss=1.2251  val_loss=0.9901  val_acc=0.5967  lr=2.99e-04 *best*
[Epoch 008/70] train_loss=1.2309  val_loss=0.9807  val_acc=0.6066  lr=2.98e-04 *best*
[Epoch 009/70] train_loss=1.1971  val_loss=0.9548  val_acc=0.6149  lr=2.96e-04 *best*
[Epoch 010/70] train_loss=1.1863  val_loss=0.9519  val_acc=0.6227  lr=2.95e-04 *best*
[Epoch 011/70] train_loss=1.1606  val_loss=0.9327  val_acc=0.6333  lr=2.93e-04 *best*
[Epoch 012/70] train_loss=

In [36]:
# A) Does history say we ever reached ~0.70 on VAL?
best_hist_val = max([h.get('val_acc', float('-inf')) for h in history]) if 'history' in globals() else None
print("history_best_val =", best_hist_val)

# B) Are we holding the best checkpoint weights?
#    If you saved to SAVE_BEST_PATH earlier, load it now.
from pathlib import Path
if 'SAVE_BEST_PATH' in globals() and Path(SAVE_BEST_PATH).exists():
    ckpt = torch.load(SAVE_BEST_PATH, map_location='cpu')
    model.load_state_dict(ckpt['model_state'])
    model.to(next(iter(model.parameters())).device)
    print("Reloaded model from", SAVE_BEST_PATH)
else:
    print("WARNING: best checkpoint not found. Make sure you didn't re‑init the model.")


history_best_val = 0.7004484057426452


In [34]:
# Should print ~0.69–0.70 if everything else is consistent
val_acc_now = eval_top1(model, valid_dl)
print("val_acc_now =", val_acc_now)


RuntimeError: DataLoader worker (pid(s) 1223, 1224, 1225, 1226, 1227, 1228) exited unexpectedly

#25B

In [27]:
#25B

In [28]:
%pip install fvcore


Collecting fvcore
  Downloading fvcore-0.1.5.post20221221.tar.gz (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.2/50.2 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting yacs>=0.1.6 (from fvcore)
  Downloading yacs-0.1.8-py3-none-any.whl.metadata (639 bytes)
Collecting iopath>=0.1.7 (from fvcore)
  Downloading iopath-0.1.10.tar.gz (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting portalocker (from iopath>=0.1.7->fvcore)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Downloading yacs-0.1.8-py3-none-any.whl (14 kB)
Downloading portalocker-3.2.0-py3-none-any.whl (22 kB)
Building wheels for collected packages: fvcore, iopath
  Building wheel for fvcore (setup.py) ... [?25l[?25hdone
  Created wheel for fvcore: filename=fvcore-0.1.5.

===== Cell 28 — FLOPs (auto‑install fvcore) =====

===== Cell 29 — Efficiency report =====