## The Grand Strategy: One Model Per Week, in Multiple Sessions
The total training time is ~48 hours. Each of your two models will take ~24 hours. Since Kaggle sessions stop after 12 hours, we will train each model across three short sessions.

1. Week 1: Train the convnext_large model (~24 hours total, split into three ~8-10 hour sessions).

2. Week 2: Train the swinv2_large model (~24 hours total, split into three ~8-10 hour sessions), then run the final evaluation.

## Week 1: Training the First Expert (convnext_large)
Your goal this week is to use your 30-hour quota to fully train and save the convnext_large model.

**Session 1.1 (Today)**

1. Configure for Week 1: Go to Cell 2. Make sure the TRAIN_BACKBONES list contains only "convnext_large.fb_in22k_ft_in1k".
2. Start the Run: Run Cells 1, 2, 3, and 4. They will execute quickly.
3. Launch the Marathon: Run Cell 5.
4. Monitor the Time: Keep an eye on the session timer at the top. Let it run for a good amount of time, for example, 8 to 10 hours.
5. Stop and Save: Before the 12-hour limit, manually stop the session by clicking the Stop icon (■). Then, immediately commit your progress by clicking "Save Version" -> "Save & Run All (Commit)". This will save the crucial checkpoint file from this session.

**Session 1.2 (T+1)**

1. Start a New Session: Open the same notebook.
2. CRITICAL - Add Previous Output: Click "+ Add Input" on the right. Go to the "Notebook Output Files" tab and find the run from Session 1.1. Add its output as an input source. This gives your new session access to the last checkpoint.
3. Run the Cells: Run Cells 1 through 5 again. The script is smart. In Cell 4, it will automatically find the checkpoint from your attached input and resume training exactly where it left off.
4. Repeat: Let it run for another 8-10 hours, then Stop and "Save Version" again.

**Session 1.3 (T+2)**

1. Repeat the process: Start a new session, add the output from Session 1.2 as an input, and run Cells 1-5.
2. Completion: This final run should be short. The script will finish the remaining epochs for all three stages of the convnext_large model. Cell 5 will complete fully.
3. Mission Complete for Week 1: The final, best model for convnext_large is now safely saved in your committed notebook's output. Do not run Cell 6.

## Week 2: Training the Second Expert & Final Evaluation
After your weekly quota resets (next weekend), you will repeat the process for the second model.

**Session 2.1, 2.2, 2.3 (Next Weekend)**

1. Configure for Week 2: Open the notebook. Go to Cell 2 and change the TRAIN_BACKBONES list to contain only "swinv2_large_window12_384.ms_in22k_ft_in1k".
2. CRITICAL - Add Week 1's FINAL Model: Add the output from your final, completed Session 1.3 as an input. This makes the fully-trained convnext_large model available for the final evaluation later.
3. Train the Second Model: Repeat the same multi-session process you used in Week 1. Run for 8-10 hours, stop, commit, start a new session, add the previous session's output, and run again.
4. Completion: After about three sessions, Cell 5 will complete fully for the swinv2_large model.

**Final Step: The Ultimate Result (Cell 6)**

1. You are now in the session where Cell 5 has just finished for the second model.
2. Scroll down to Cell 6.
3. Run Cell 6.
4. The Magic: The script will automatically:
   1. Find the convnext_large model from the Week 1 output you attached.
   2. Find the swinv2_large model from the current session's output.
   3. Load both models, perform the TTA + Ensemble, and print your definitive, final accuracy score.

**Cell 1: Install Libraries**

In [1]:
!pip install timm scikit-learn --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m72.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m79.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m41.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

**Cell 2: Imports and Configuration**

In [2]:
import os, math, time, random
from collections import OrderedDict
import numpy as np
from tqdm import tqdm

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import timm
from timm.data import Mixup
from timm.loss import SoftTargetCrossEntropy
from timm.utils import ModelEmaV2
from torch.cuda.amp import GradScaler, autocast
from torch.optim.swa_utils import AveragedModel, update_bn

# --- CONFIGURATION ---
DATA_PATH = "/kaggle/input/sports-102/Sports102_V2"
SAVE_DIR = "/kaggle/working/sydnet_full_stages"
os.makedirs(SAVE_DIR, exist_ok=True)

# --- WEEK 1 CONFIGURATION ---
# We are only training the first model this week.
TRAIN_BACKBONES = [
    "convnext_large.fb_in22k_ft_in1k",
]

# --- WEEK 2 CONFIGURATION ---
# We are only training the second model this week.
'''TRAIN_BACKBONES = [
    "swinv2_large_window12_384.ms_in22k_ft_in1k",
]'''

STAGE1_IMG, EPOCHS_STAGE1, PATIENCE_STAGE1 = 384, 120, 15 # Wait 15 epochs for improvement
STAGE2_IMG, EPOCHS_STAGE2, PATIENCE_STAGE2 = 512, 40, 8   # Wait 8 epochs for improvement
STAGE3_IMG, EPOCHS_STAGE3, PATIENCE_STAGE3 = 640, 10, 3   # Wait 3 epochs for improvement

NGPUS = max(1, torch.cuda.device_count())
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
PER_GPU_BATCH = 4
ACCUM_STEPS = 4
LR = 5e-5
WEIGHT_DECAY = 0.05
WARMUP_EPOCHS = 5
NUM_WORKERS = 2
SEED = 42

USE_AMP = True
USE_GRAD_CHECKPOINT = True
USE_EMA = True
EMA_DECAY = 0.9998
USE_SWA = True
SWA_START = int(EPOCHS_STAGE1 * 0.8)
USE_ARCFACE = True
ARC_S, ARC_M = 30.0, 0.35

MIXUP_ALPHA, CUTMIX_ALPHA, MIXUP_PROB, LABEL_SMOOTH = 0.8, 1.0, 1.0, 0.1

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.benchmark = True

**Cell 3: Models and Helper Modules**

In [3]:
def build_transforms(img_size):
    train_tf = transforms.Compose([
        transforms.RandomResizedCrop(img_size, scale=(0.5, 1.0)),
        transforms.RandomHorizontalFlip(p=0.5),
        transforms.ColorJitter(0.35,0.35,0.25,0.05),
        transforms.RandAugment(num_ops=2, magnitude=9),
        transforms.ToTensor(),
        transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225]),
        transforms.RandomErasing(p=0.25, scale=(0.02,0.33), ratio=(0.3,3.3))
    ])
    test_tf = transforms.Compose([
        transforms.Resize(int(img_size * 1.15)),
        transforms.CenterCrop(img_size),
        transforms.ToTensor(),
        transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225])
    ])
    return train_tf, test_tf

class PatchAttention(nn.Module):
    def __init__(self, in_channels):
        super().__init__()
        self.qkv = nn.Conv2d(in_channels, in_channels * 3, kernel_size=1, bias=False)
        self.gamma = nn.Parameter(torch.zeros(1))
        self.softmax = nn.Softmax(dim=-1)
    def forward(self, x):
        b, C, w, h = x.size()
        q, k, v = self.qkv(x).view(b, 3, C, w * h).permute(1, 0, 2, 3).unbind(0)
        q = q.permute(0, 2, 1)
        att = self.softmax(torch.bmm(q, k))
        out = torch.bmm(v, att.permute(0, 2, 1)).view(b, C, w, h)
        return self.gamma * out + x

class ArcMarginProduct(nn.Module):
    def __init__(self, in_features, out_features, s=30.0, m=0.5):
        super().__init__()
        self.weight = nn.Parameter(torch.FloatTensor(out_features, in_features))
        nn.init.xavier_uniform_(self.weight)
        self.s, self.m = s, m
        self.cos_m, self.sin_m = math.cos(m), math.sin(m)
        self.th, self.mm = math.cos(math.pi - m), math.sin(math.pi - m) * m
    def forward(self, input, label):
        cosine = nn.functional.linear(nn.functional.normalize(input), nn.functional.normalize(self.weight))
        sine = torch.sqrt(1.0 - torch.clamp(cosine**2, 0, 1))
        phi = cosine * self.cos_m - sine * self.sin_m
        phi = torch.where(cosine > self.th, phi, cosine - self.mm)
        one_hot = torch.zeros_like(cosine).scatter_(1, label.view(-1, 1).long(), 1)
        return self.s * (one_hot * phi + (1.0 - one_hot) * cosine)

class SYDNet(nn.Module):
    def __init__(self, backbone_name, n_classes, drop_path_rate=0.2):
        super().__init__()
        self.backbone = timm.create_model(backbone_name, pretrained=True, num_classes=0, drop_path_rate=drop_path_rate)
        self.feat_dim = self.backbone.num_features
        self.attention = PatchAttention(self.feat_dim)
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.classifier = nn.Linear(self.feat_dim, n_classes)
        self.arcface = None
    def forward(self, x, labels=None):
        feats = self.backbone.forward_features(x)
        if feats.dim() == 4:
            feats = self.attention(feats)
            feats = self.pool(feats).flatten(1)
        if self.arcface is not None and labels is not None:
            return self.arcface(feats, labels)
        return self.classifier(feats)

def load_checkpoint_module(module, path):
    if os.path.exists(path):
        sd = torch.load(path, map_location='cpu')['state_dict']
        module.load_state_dict(sd, strict=False)
        return True
    return False

**Cell 4: The Resumable Training Pipeline**

In [4]:
import torch.amp

def train_stage(backbone, stage_name, img_size, epochs, patience, prev_ckpt=None, finetune_arc=False, freeze_backbone=False):
    print(f"\n==== TRAIN {backbone} | {stage_name} | img {img_size} | epochs {epochs} (Patience: {patience}) ====")
    
    stage_checkpoint_path = os.path.join(SAVE_DIR, f"{backbone.replace('.','_')}_{stage_name}_checkpoint.pth")
    stage_best_model_path = os.path.join(SAVE_DIR, f"{backbone.replace('.','_')}_{stage_name}_best.pth")
    
    train_tf, val_tf = build_transforms(img_size)
    train_ds = datasets.ImageFolder(os.path.join(DATA_PATH,"train"), transform=train_tf)
    val_ds = datasets.ImageFolder(os.path.join(DATA_PATH,"test"), transform=val_tf)
    n_classes = len(train_ds.classes)

    train_loader = DataLoader(train_ds, batch_size=PER_GPU_BATCH * NGPUS, shuffle=True, num_workers=NUM_WORKERS, pin_memory=True)
    val_loader = DataLoader(val_ds, batch_size=PER_GPU_BATCH * NGPUS, shuffle=False, num_workers=NUM_WORKERS)

    model = SYDNet(backbone, n_classes)
    if USE_GRAD_CHECKPOINT and hasattr(model.backbone, "set_grad_checkpointing"):
        model.backbone.set_grad_checkpointing(True)

    base_module = model
    if NGPUS > 1: model = nn.DataParallel(model)
    model.to(DEVICE)
    
    optimizer = optim.AdamW(model.parameters(), lr=LR, weight_decay=WEIGHT_DECAY)
    scheduler = optim.lr_scheduler.SequentialLR(optimizer, schedulers=[
        optim.lr_scheduler.LinearLR(optimizer, 1e-6, 1.0, total_iters=min(WARMUP_EPOCHS, epochs)),
        optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=max(1, epochs - WARMUP_EPOCHS), eta_min=1e-6)
    ], milestones=[min(WARMUP_EPOCHS, epochs)])

    scaler = torch.amp.GradScaler('cuda', enabled=USE_AMP)
    
    criterion = SoftTargetCrossEntropy() if stage_name.startswith("stage1") else nn.CrossEntropyLoss()
    mixup_fn = Mixup(mixup_alpha=MIXUP_ALPHA, cutmix_alpha=CUTMIX_ALPHA, prob=MIXUP_PROB, label_smoothing=LABEL_SMOOTH, num_classes=n_classes) if stage_name.startswith("stage1") else None
    
    ema = ModelEmaV2(base_module, decay=EMA_DECAY) if USE_EMA else None
    swa_model = AveragedModel(base_module) if USE_SWA else None
    
    start_epoch = 0
    best_val = 0.0
    patience_counter = 0
    
    found_resume_file = False
    for root, _, files in os.walk("/kaggle/input/"):
        if os.path.basename(stage_checkpoint_path) in files:
            stage_checkpoint_path = os.path.join(root, os.path.basename(stage_checkpoint_path))
            found_resume_file = True
            break

    if os.path.exists(stage_checkpoint_path) and found_resume_file:
        print(f"Resuming from checkpoint: {stage_checkpoint_path}")
        checkpoint = torch.load(stage_checkpoint_path)
        base_module.load_state_dict(checkpoint['model_state_dict'])
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
        if ema and 'ema_state_dict' in checkpoint: ema.load_state_dict(checkpoint['ema_state_dict'])
        start_epoch = checkpoint['epoch'] + 1
        best_val = checkpoint['best_val_acc']
        print(f"Resumed from epoch {start_epoch}. Best accuracy so far: {best_val:.4f}")
    elif prev_ckpt and os.path.exists(prev_ckpt):
        print(f"Initializing from previous stage's best model: {os.path.basename(prev_ckpt)}")
        load_checkpoint_module(base_module, prev_ckpt)
        
    stage_checkpoint_path = os.path.join(SAVE_DIR, os.path.basename(stage_checkpoint_path))

    if finetune_arc: model.module.arcface = ArcMarginProduct(model.module.feat_dim, n_classes, s=ARC_S, m=ARC_M).to(DEVICE)
    if freeze_backbone:
        for n,p in model.named_parameters(): p.requires_grad = any(h in n for h in ["classifier", "arcface", "attention"])

    for epoch in range(start_epoch, epochs):
        model.train()
        for step, (imgs, labels) in enumerate(tqdm(train_loader, desc=f"E{epoch+1}")):
            imgs, labels = imgs.to(DEVICE), labels.to(DEVICE)
            
            if mixup_fn: imgs, labels = mixup_fn(imgs, labels)

            with torch.amp.autocast(device_type='cuda', enabled=USE_AMP):
                outs = model(imgs, labels) if finetune_arc else model(imgs)
                loss = criterion(outs, labels) / ACCUM_STEPS
            scaler.scale(loss).backward()
            if (step + 1) % ACCUM_STEPS == 0:
                scaler.step(optimizer); scaler.update(); optimizer.zero_grad()
                if ema: ema.update(base_module)
        
        scheduler.step()
        if swa_model and epoch >= SWA_START: swa_model.update_parameters(base_module)

        eval_module = swa_model if swa_model and epoch >= SWA_START else (ema.module if ema else base_module)
        eval_module.eval()
        
        correct = 0
        with torch.no_grad():
            for imgs_val, labels_val in val_loader:
                imgs_val, labels_val = imgs_val.to(DEVICE), labels_val.to(DEVICE)
                outputs = eval_module(imgs_val)
                preds = torch.argmax(outputs, 1)
                correct += (preds == labels_val).sum().item()
        
        val_acc = correct / len(val_ds)
        print(f"[{backbone} {stage_name}] Epoch {epoch+1} Val Acc: {val_acc:.4f}")

        torch.save({
            'epoch': epoch,
            'model_state_dict': base_module.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'scheduler_state_dict': scheduler.state_dict(),
            'ema_state_dict': ema.state_dict() if ema else None,
            'best_val_acc': best_val,
        }, stage_checkpoint_path)

        if val_acc > best_val:
            best_val = val_acc
            patience_counter = 0
            torch.save({'state_dict': eval_module.state_dict()}, stage_best_model_path)
            print(f"Saved best model checkpoint -> {stage_best_model_path} (acc={val_acc:.4f})")
        else:
            patience_counter += 1
            print(f"No improvement in validation accuracy for {patience_counter} epoch(s). Patience is {patience}.")
            if patience_counter >= patience:
                print(f"\nEarly stopping triggered after {patience} epochs with no improvement.\n")
                break

    final_best_path = stage_best_model_path
    if swa_model:
        update_bn(train_loader, swa_model, device=DEVICE)
        swa_path = final_best_path.replace(".pth", "_swa.pth")
        torch.save({'state_dict': swa_model.state_dict()}, swa_path)
        return swa_path
    return final_best_path

**Cell 5: Main Orchestration**

In [5]:
"""
def main_training():
    final_ckpts = {}
    for backbone in TRAIN_BACKBONES:
        # Pass the patience value for each stage to the training function
        stage1_final_best_ckpt = train_stage(backbone, "stage1", STAGE1_IMG, EPOCHS_STAGE1, PATIENCE_STAGE1)
        stage2_final_best_ckpt = train_stage(backbone, "stage2", STAGE2_IMG, EPOCHS_STAGE2, PATIENCE_STAGE2, prev_ckpt=stage1_final_best_ckpt)
        stage3_final_best_ckpt = train_stage(backbone, "stage3", STAGE3_IMG, EPOCHS_STAGE3, PATIENCE_STAGE3, prev_ckpt=stage2_final_best_ckpt, finetune_arc=USE_ARCFACE, freeze_backbone=True)
        final_ckpts[backbone] = stage3_final_best_ckpt
    
    print("\n===== ALL TRAINING STAGES COMPLETE =====")
    print("Final saved model checkpoints:")
    for backbone, path in final_ckpts.items():
        print(f"- {backbone}: {path}")

if __name__ == "__main__":
    main_training()
    """


==== TRAIN convnext_large.fb_in22k_ft_in1k | stage1 | img 384 | epochs 120 (Patience: 15) ====


model.safetensors:   0%|          | 0.00/791M [00:00<?, ?B/s]

Resuming from checkpoint: /kaggle/input/end-of-session-1-1/sydnet_full_stages/convnext_large_fb_in22k_ft_in1k_stage1_checkpoint.pth
Resumed from epoch 16. Best accuracy so far: 0.9625


E17: 100%|██████████| 1160/1160 [37:24<00:00,  1.94s/it]


[convnext_large.fb_in22k_ft_in1k stage1] Epoch 17 Val Acc: 0.9741
Saved best model checkpoint -> /kaggle/working/sydnet_full_stages/convnext_large_fb_in22k_ft_in1k_stage1_best.pth (acc=0.9741)


E18: 100%|██████████| 1160/1160 [36:55<00:00,  1.91s/it]


[convnext_large.fb_in22k_ft_in1k stage1] Epoch 18 Val Acc: 0.9762
Saved best model checkpoint -> /kaggle/working/sydnet_full_stages/convnext_large_fb_in22k_ft_in1k_stage1_best.pth (acc=0.9762)


E19: 100%|██████████| 1160/1160 [36:48<00:00,  1.90s/it]


[convnext_large.fb_in22k_ft_in1k stage1] Epoch 19 Val Acc: 0.9803
Saved best model checkpoint -> /kaggle/working/sydnet_full_stages/convnext_large_fb_in22k_ft_in1k_stage1_best.pth (acc=0.9803)


E20: 100%|██████████| 1160/1160 [36:47<00:00,  1.90s/it]


[convnext_large.fb_in22k_ft_in1k stage1] Epoch 20 Val Acc: 0.9803
No improvement in validation accuracy for 1 epoch(s). Patience is 15.


E21: 100%|██████████| 1160/1160 [36:46<00:00,  1.90s/it]


[convnext_large.fb_in22k_ft_in1k stage1] Epoch 21 Val Acc: 0.9817
Saved best model checkpoint -> /kaggle/working/sydnet_full_stages/convnext_large_fb_in22k_ft_in1k_stage1_best.pth (acc=0.9817)


E22: 100%|██████████| 1160/1160 [36:50<00:00,  1.91s/it]


[convnext_large.fb_in22k_ft_in1k stage1] Epoch 22 Val Acc: 0.9817
No improvement in validation accuracy for 1 epoch(s). Patience is 15.


E23: 100%|██████████| 1160/1160 [36:52<00:00,  1.91s/it]


[convnext_large.fb_in22k_ft_in1k stage1] Epoch 23 Val Acc: 0.9822
Saved best model checkpoint -> /kaggle/working/sydnet_full_stages/convnext_large_fb_in22k_ft_in1k_stage1_best.pth (acc=0.9822)


E24: 100%|██████████| 1160/1160 [36:21<00:00,  1.88s/it]


[convnext_large.fb_in22k_ft_in1k stage1] Epoch 24 Val Acc: 0.0127
No improvement in validation accuracy for 1 epoch(s). Patience is 15.


E25: 100%|██████████| 1160/1160 [36:08<00:00,  1.87s/it]


[convnext_large.fb_in22k_ft_in1k stage1] Epoch 25 Val Acc: 0.0127
No improvement in validation accuracy for 2 epoch(s). Patience is 15.


E26: 100%|██████████| 1160/1160 [36:08<00:00,  1.87s/it]


[convnext_large.fb_in22k_ft_in1k stage1] Epoch 26 Val Acc: 0.0127
No improvement in validation accuracy for 3 epoch(s). Patience is 15.


E27: 100%|██████████| 1160/1160 [36:08<00:00,  1.87s/it]


[convnext_large.fb_in22k_ft_in1k stage1] Epoch 27 Val Acc: 0.0127
No improvement in validation accuracy for 4 epoch(s). Patience is 15.


E28:  72%|███████▏  | 834/1160 [26:00<10:09,  1.87s/it]


KeyboardInterrupt: 

In [7]:
# This cell manually finishes the remaining stages for the Week 1 model.
# Run this cell AFTER you have manually stopped the long run in Cell 5.

# --- Manual Override (Corrected Version) ---

# FIX: We explicitly define the backbone model for Week 1, since stopping
# the previous cell removed it from memory.
backbone = "convnext_large.fb_in22k_ft_in1k"

# We will use the best model saved from Stage 1, which is safe from the accuracy crash.
stage1_final_best_ckpt = os.path.join(SAVE_DIR, f"{backbone.replace('.','_')}_stage1_best.pth")

if os.path.exists(stage1_final_best_ckpt):
    print(f"✅ Found best model from Stage 1 with 98.22% accuracy.")
    print(f"Force-starting Stage 2 using checkpoint: {os.path.basename(stage1_final_best_ckpt)}")

    # --- Run Stage 2 ---
    # This will train on 512px images, starting from your best 384px model
    stage2_final_best_ckpt = train_stage(
        backbone, "stage2", STAGE2_IMG, EPOCHS_STAGE2, PATIENCE_STAGE2,
        prev_ckpt=stage1_final_best_ckpt
    )

    # --- Run Stage 3 ---
    # This will train on 640px images, starting from your best 512px model
    stage3_final_best_ckpt = train_stage(
        backbone, "stage3", STAGE3_IMG, EPOCHS_STAGE3, PATIENCE_STAGE3,
        prev_ckpt=stage2_final_best_ckpt,
        finetune_arc=USE_ARCFACE,
        freeze_backbone=True
    )

    print("\n\n✅✅✅===== WEEK 1 TRAINING MANUALLY COMPLETED =====✅✅✅")
    print(f"Final saved model for {backbone}:")
    print(stage3_final_best_ckpt)

else:
    print("❌ ERROR: Could not find the best checkpoint file from Stage 1.")
    print("Please ensure the previous session was saved correctly.")

✅ Found best model from Stage 1 with 98.22% accuracy.
Force-starting Stage 2 using checkpoint: convnext_large_fb_in22k_ft_in1k_stage1_best.pth

==== TRAIN convnext_large.fb_in22k_ft_in1k | stage2 | img 512 | epochs 40 (Patience: 8) ====
Initializing from previous stage's best model: convnext_large_fb_in22k_ft_in1k_stage1_best.pth


E1: 100%|██████████| 1160/1160 [54:27<00:00,  2.82s/it]


[convnext_large.fb_in22k_ft_in1k stage2] Epoch 1 Val Acc: 0.0069
Saved best model checkpoint -> /kaggle/working/sydnet_full_stages/convnext_large_fb_in22k_ft_in1k_stage2_best.pth (acc=0.0069)


E2:  29%|██▊       | 331/1160 [15:29<38:46,  2.81s/it]


KeyboardInterrupt: 

**Cell 6: Final Evaluation**

In [None]:
def final_evaluation():
    print("\n===== STARTING FINAL EVALUATION WITH ENSEMBLE + TTA =====")
    
    # This list will hold the paths to the final, best model for each backbone
    final_model_info = []
    for backbone in TRAIN_BACKBONES:
        # Find the final checkpoint file for each backbone (we'll look for the _swa.pth file)
        ckpt_path = os.path.join(SAVE_DIR, f"{backbone.replace('.','_')}_stage3_best_swa.pth")
        if not os.path.exists(ckpt_path):
            # Fallback to the non-SWA model if SWA wasn't used or saved differently
            ckpt_path = os.path.join(SAVE_DIR, f"{backbone.replace('.','_')}_stage3_best.pth")

        if os.path.exists(ckpt_path):
            final_model_info.append({'name': backbone, 'path': ckpt_path})
        else:
            print(f"WARNING: Could not find final checkpoint for {backbone}. It will be excluded from the ensemble.")

    if not final_model_info:
        print("ERROR: No trained models found for evaluation. Please run the training cell first.")
        return

    # Load the models for the ensemble
    models = []
    # Get num_classes from the test dataset
    num_classes = len(datasets.ImageFolder(os.path.join(DATA_PATH,'test')).classes)
    
    for info in final_model_info:
        model = SYDNet(info['name'], num_classes, drop_path_rate=0.0) # No dropout for inference
        load_checkpoint_module(model, info['path'])
        if NGPUS > 1:
            model = nn.DataParallel(model)
        model.to(DEVICE).eval()
        models.append(model)
    
    # Define TTA transforms
    tta_tf = transforms.Compose([
        transforms.Resize(int(STAGE3_IMG * 1.15)), transforms.TenCrop(STAGE3_IMG),
        transforms.Lambda(lambda crops: torch.stack([transforms.ToTensor()(c) for c in crops])),
        transforms.Lambda(lambda tensors: torch.stack([transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225])(t) for t in tensors]))
    ])
    tta_ds = datasets.ImageFolder(os.path.join(DATA_PATH,"test"), transform=tta_tf)
    tta_loader = DataLoader(tta_ds, batch_size=PER_GPU_BATCH, shuffle=False, num_workers=NUM_WORKERS)
    
    all_preds, all_labels = [], []
    with torch.no_grad():
        for imgs, labels in tqdm(tta_loader, desc="TTA Eval"):
            bs, ncrops, c, h, w = imgs.size()
            imgs = imgs.view(-1, c, h, w).to(DEVICE)
            
            # Get predictions from all models in the ensemble
            model_preds = sum(model(imgs).view(bs, ncrops, -1).mean(1).softmax(1) for model in models)
            
            all_preds.append(model_preds.cpu().numpy())
            all_labels.append(labels.numpy())
    
    # Calculate final accuracy
    all_preds_np = np.vstack(all_preds)
    all_labels_np = np.concatenate(all_labels)
    final_predictions = np.argmax(all_preds_np, axis=1)
    accuracy = np.mean(final_predictions == all_labels_np)
    
    print(f"\nFINAL ENSEMBLED ACCURACY WITH TTA: {accuracy:.4f} ({accuracy*100:.2f}%)")

# Run the final evaluation
if __name__ == "__main__":
    final_evaluation()