# üß¨ PanDerm Tri-Modal Fusion - Training on Google Colab A100

This notebook trains the **Tri-Modal PanDerm Fusion** model for skin lesion classification using the MILK10k dataset on Google Colab with A100 GPU.

## üìã Overview

- **Model**: Tri-Modal PanDerm Fusion (DermLIP ViT-B/16 dual backbone)
- **Architecture**:
  - Dual DermLIP encoders (clinical + dermoscopic)
  - MONET concept embedding tokens
  - Tri-Modal Cross-Attention Transformer (TMCT) fusion blocks with **Stochastic Depth**
  - Global context pooling with auxiliary heads
- **Dataset**: MILK10k (4,192 train + 1,048 validation)
- **Task**: Multi-label classification (11 diagnosis categories)
- **Loss**: Compound Loss (Focal + Soft F1) with **Label Smoothing**
- **Training**: Mixed precision (bf16) + Layer-wise LR decay + **EMA**
- **GPU**: Optimized for A100 40GB

## ‚ú® Enhanced Training Features (v2)

- **Regularization**: Dropout 0.3, DropPath 0.1, Attention Dropout 0.1
- **Augmentation**: Mixup (Œ±=0.8), CutMix (Œ±=1.0), Enhanced CoarseDropout
- **EMA**: Exponential Moving Average with warmup for better generalization
- **Scheduler**: CosineAnnealingWarmRestarts (T_0=10, T_mult=2) with 5-epoch warmup
- **Loss**: Label smoothing 0.1, adjusted weights (focal=0.4, soft_f1=0.6)
- **Patience**: Early stopping after 20 epochs without improvement

## üöÄ Before Running

1. **Set Runtime to GPU**: Runtime ‚Üí Change runtime type ‚Üí GPU (A100 recommended)
2. **Upload to Google Drive**:
   - `preprocessed_data/` folder (train_data.csv, val_data.csv, class_weights.json)
   - `src/` folder (all Python source files)
   - Dataset images (MILK10k_Training_Input/)

---


## 1Ô∏è‚É£ Setup Google Colab Environment

Check GPU availability and system specifications.


In [1]:
# Check GPU availability
!nvidia-smi

# Check CUDA version
!nvcc --version

# Check disk space
!df -h | grep -E 'Filesystem|/content'

# Check RAM
!free -h


Wed Dec 10 12:55:58 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   33C    P0             52W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

## 2Ô∏è‚É£ Mount Google Drive

Mount Google Drive to access datasets and save results.


In [2]:
from google.colab import drive
drive.mount('/content/drive')

# Set up project paths (adjust these to match your Google Drive structure)
import os
DRIVE_ROOT = '/content/drive/MyDrive/MILK10k_Project'
os.makedirs(DRIVE_ROOT, exist_ok=True)

print(f"‚úÖ Google Drive mounted!")
print(f"üìÅ Project root: {DRIVE_ROOT}")


Mounted at /content/drive
‚úÖ Google Drive mounted!
üìÅ Project root: /content/drive/MyDrive/MILK10k_Project


## 3Ô∏è‚É£ Install Dependencies

Install PyTorch with CUDA support, OpenCLIP for DermLIP, and other required packages.


In [3]:
# Install required packages for PanDerm
!pip install -q timm albumentations tensorboard scikit-learn
!pip install -q open_clip_torch  # For DermLIP encoder

# Verify installations
import torch
import torchvision
import timm
import albumentations as A

print(f"‚úÖ PyTorch: {torch.__version__}")
print(f"‚úÖ TorchVision: {torchvision.__version__}")
print(f"‚úÖ Timm: {timm.__version__}")
print(f"‚úÖ Albumentations: {A.__version__}")

# Check OpenCLIP
try:
    import open_clip
    print(f"‚úÖ OpenCLIP: {open_clip.__version__}")
except ImportError:
    print("‚ö†Ô∏è OpenCLIP not installed. Will use timm fallback.")

print(f"‚úÖ CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"‚úÖ GPU: {torch.cuda.get_device_name(0)}")
    print(f"‚úÖ CUDA Version: {torch.version.cuda}")
    print(f"‚úÖ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")


[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/1.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.5/1.5 MB[0m [31m77.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/44.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m44.8/44.8 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h‚úÖ PyTorch: 2.9.0+cu126
‚úÖ TorchVision: 0.24.0+cu126
‚úÖ Timm: 1.0.22
‚úÖ Albumentations: 2.0.8
‚úÖ OpenCLIP: 3.2.0
‚úÖ CUDA Available: True
‚úÖ GPU: NVIDIA A100-SXM4-40GB
‚úÖ CUDA Version: 12.6
‚úÖ GPU Memor

## 4Ô∏è‚É£ Setup Project Files

Copy project files from Google Drive to Colab workspace.


In [4]:
import shutil
from pathlib import Path

# Create working directory
WORK_DIR = '/content/MILK10k_PanDerm'
os.makedirs(WORK_DIR, exist_ok=True)
%cd {WORK_DIR}

# Copy source code from Drive
SRC_DRIVE = f'{DRIVE_ROOT}/src'
if os.path.exists(SRC_DRIVE):
    if os.path.exists(f'{WORK_DIR}/src'):
        shutil.rmtree(f'{WORK_DIR}/src')
    shutil.copytree(SRC_DRIVE, f'{WORK_DIR}/src')
    print("‚úÖ Copied src/ from Google Drive")
else:
    print("‚ö†Ô∏è src/ not found in Google Drive. Please upload it first!")
    print(f"   Expected path: {SRC_DRIVE}")

# Copy preprocessed data
PREPROCESSED_DRIVE = f'{DRIVE_ROOT}/preprocessed_data'
if os.path.exists(PREPROCESSED_DRIVE):
    if os.path.exists(f'{WORK_DIR}/preprocessed_data'):
        shutil.rmtree(f'{WORK_DIR}/preprocessed_data')
    shutil.copytree(PREPROCESSED_DRIVE, f'{WORK_DIR}/preprocessed_data')
    print("‚úÖ Copied preprocessed_data/ from Google Drive")
else:
    print("‚ö†Ô∏è preprocessed_data/ not found. Please upload it first!")
    print(f"   Expected path: {PREPROCESSED_DRIVE}")

# Use dataset directly from Google Drive (symlink to avoid copying large files)
DATASET_DRIVE = f'{DRIVE_ROOT}/dataset/MILK10k_Training_Input'
if os.path.exists(DATASET_DRIVE):
    os.makedirs(f'{WORK_DIR}/dataset', exist_ok=True)
    symlink_path = f'{WORK_DIR}/dataset/MILK10k_Training_Input'
    if os.path.islink(symlink_path) or os.path.exists(symlink_path):
        os.remove(symlink_path) if os.path.islink(symlink_path) else shutil.rmtree(symlink_path)
    os.symlink(DATASET_DRIVE, symlink_path)
    print(f"‚úÖ Linked dataset from Google Drive (no copy needed)")
else:
    print(f"‚ö†Ô∏è Dataset not found at: {DATASET_DRIVE}")
    print("   Please upload MILK10k_Training_Input/ to your Google Drive!")

# Create necessary directories
os.makedirs('models', exist_ok=True)
os.makedirs('logs', exist_ok=True)
os.makedirs('results', exist_ok=True)

print(f"\nüìÅ Working directory: {WORK_DIR}")
print(f"üìÇ Contents:")
!ls -la
print(f"\nüì∏ Dataset path: {WORK_DIR}/dataset/MILK10k_Training_Input")


/content/MILK10k_PanDerm
‚úÖ Copied src/ from Google Drive
‚úÖ Copied preprocessed_data/ from Google Drive
‚úÖ Linked dataset from Google Drive (no copy needed)

üìÅ Working directory: /content/MILK10k_PanDerm
üìÇ Contents:
total 32
drwxr-xr-x 8 root root 4096 Dec 10 12:57 .
drwxr-xr-x 1 root root 4096 Dec 10 12:56 ..
drwxr-xr-x 2 root root 4096 Dec 10 12:57 dataset
drwxr-xr-x 2 root root 4096 Dec 10 12:57 logs
drwxr-xr-x 2 root root 4096 Dec 10 12:57 models
drwx------ 2 root root 4096 Dec  3 08:28 preprocessed_data
drwxr-xr-x 2 root root 4096 Dec 10 12:57 results
drwx------ 3 root root 4096 Dec 10 09:41 src

üì∏ Dataset path: /content/MILK10k_PanDerm/dataset/MILK10k_Training_Input


## 5Ô∏è‚É£ Load Configuration

Import PanDerm configuration and verify settings.


In [5]:
import sys
sys.path.insert(0, f'{WORK_DIR}/src')

# Import PanDerm specific modules
from config import (
    MODEL_CONFIG_PANDERM, IMAGE_CONFIG_PANDERM, TRAIN_CONFIG_PANDERM,
    LOSS_CONFIG_PANDERM, DIAGNOSIS_CATEGORIES, MONET_FEATURES
)
from utils import set_seed, get_device, count_parameters, save_checkpoint

# Display PanDerm configuration
print("=" * 60)
print("PANDERM TRAINING CONFIGURATION")
print("=" * 60)

print(f"\nüß¨ Model Config:")
for key, value in MODEL_CONFIG_PANDERM.items():
    print(f"  {key}: {value}")

print(f"\nüéØ Training Config:")
for key, value in TRAIN_CONFIG_PANDERM.items():
    print(f"  {key}: {value}")

print(f"\nüñºÔ∏è Image Config:")
for key, value in IMAGE_CONFIG_PANDERM.items():
    print(f"  {key}: {value}")

print(f"\n‚öñÔ∏è Loss Config:")
for key, value in LOSS_CONFIG_PANDERM.items():
    print(f"  {key}: {value}")

print(f"\nüìÇ Diagnosis Categories ({len(DIAGNOSIS_CATEGORIES)}):")
for cat in DIAGNOSIS_CATEGORIES:
    print(f"  - {cat}")

print(f"\nüî¨ MONET Features ({len(MONET_FEATURES)}):")
for feat in MONET_FEATURES:
    print(f"  - {feat}")


PANDERM TRAINING CONFIGURATION

üß¨ Model Config:
  model_name: redlessone/DermLIP_PanDerm-base-w-PubMed-256
  embed_dim: 768
  num_heads: 8
  num_classes: 11
  dropout: 0.1
  freeze_clinical: 6
  freeze_dermoscopic: 4
  num_concept_tokens: 11
  concept_hidden_dim: 256
  tmct_num_layers: 2
  mlp_ratio: 4.0
  use_auxiliary_heads: True
  aux_loss_weight: 0.3

üéØ Training Config:
  batch_size: 32
  num_epochs: 60
  gradient_accumulation: 2
  base_lr: 0.0001
  backbone_lr_decay: 0.9
  min_lr: 1e-07
  weight_decay: 0.05
  scheduler: cosine_warmup
  warmup_epochs: 3
  early_stopping_patience: 12
  gradient_clip: 1.0
  mixed_precision: bf16
  modality_dropout: 0.2
  concept_dropout: 0.1
  random_seed: 42
  num_workers: 8
  save_every: 5
  checkpoint_dir: /content/MILK10k_PanDerm/models
  log_dir: /content/MILK10k_PanDerm/logs
  use_ema: True
  ema_decay: 0.9999

üñºÔ∏è Image Config:
  image_size: 224
  normalize_mean: [0.48145466, 0.4578275, 0.40821073]
  normalize_std: [0.26862954, 0.261

## 6Ô∏è‚É£ Load Dataset

Load preprocessed training and validation data with MONET features.


In [6]:
import pandas as pd
import json
import numpy as np

# Load preprocessed data
print("Loading preprocessed data...")
train_df = pd.read_csv('preprocessed_data/train_data.csv')
val_df = pd.read_csv('preprocessed_data/val_data.csv')

print(f"‚úÖ Training samples: {len(train_df):,}")
print(f"‚úÖ Validation samples: {len(val_df):,}")

# Check for MONET features
monet_cols = [col for col in train_df.columns if 'MONET_' in col]
print(f"\nüî¨ MONET columns found: {len(monet_cols)}")
for col in monet_cols[:7]:  # Show first 7
    print(f"  - {col}")

# Load class weights
try:
    with open('preprocessed_data/class_weights.json', 'r') as f:
        class_weights = json.load(f)
    print(f"\n‚öñÔ∏è Class Weights loaded")
except FileNotFoundError:
    print(f"\n‚ö†Ô∏è class_weights.json not found, will compute from data")
    class_weights = None

# Compute samples per class for class-balanced loss
print(f"\nüìà Label Distribution (Training):")
samples_per_class = []
for cat in DIAGNOSIS_CATEGORIES:
    if cat in train_df.columns:
        count = int(train_df[cat].sum())
        samples_per_class.append(count)
        pct = (count / len(train_df)) * 100
        print(f"  {cat}: {count:,} ({pct:.2f}%)")
    else:
        samples_per_class.append(100)  # Default

print(f"\nüìä Sample columns:")
print(train_df.columns.tolist()[:20])


Loading preprocessed data...
‚úÖ Training samples: 4,192
‚úÖ Validation samples: 1,048

üî¨ MONET columns found: 14
  - clinical_MONET_ulceration_crust
  - clinical_MONET_hair
  - clinical_MONET_vasculature_vessels
  - clinical_MONET_erythema
  - clinical_MONET_pigmented
  - clinical_MONET_gel_water_drop_fluid_dermoscopy_liquid
  - clinical_MONET_skin_markings_pen_ink_purple_pen

‚öñÔ∏è Class Weights loaded

üìà Label Distribution (Training):
  AKIEC: 242 (5.77%)
  BCC: 2,018 (48.14%)
  BEN_OTH: 35 (0.83%)
  BKL: 435 (10.38%)
  DF: 42 (1.00%)
  INF: 40 (0.95%)
  MAL_OTH: 7 (0.17%)
  MEL: 360 (8.59%)
  NV: 597 (14.24%)
  SCCKA: 378 (9.02%)
  VASC: 38 (0.91%)

üìä Sample columns:
['lesion_id', 'AKIEC', 'BCC', 'BEN_OTH', 'BKL', 'DF', 'INF', 'MAL_OTH', 'MEL', 'NV', 'SCCKA', 'VASC', 'age_approx', 'sex', 'skin_tone_class', 'site', 'clinical_isic_id', 'clinical_MONET_ulceration_crust', 'clinical_MONET_hair', 'clinical_MONET_vasculature_vessels']


### Fix Image Paths for Colab

The preprocessed CSV files contain Windows absolute paths. We need to fix them for Colab's Linux environment.


In [7]:
import os
from pathlib import Path
import re

def fix_image_paths(df, dataset_root):
    """
    Fix Windows absolute paths to work with Colab's dataset location.

    Extracts only the relative path (lesion_id/image.jpg) and reconstructs
    with the correct dataset root path.
    """
    df = df.copy()

    for col in ['clinical_image_path', 'dermoscopic_image_path']:
        if col in df.columns:
            df[col] = df[col].apply(lambda x: extract_relative_path(str(x), dataset_root))

    return df

def extract_relative_path(path_str, dataset_root):
    """Extract lesion_id/image.jpg from any path format (Windows or Linux)."""
    # Use regex to extract the lesion_id and image filename
    match = re.search(r'(IL_\d+)[/\\](ISIC_\d+\.jpg)', path_str)

    if match:
        lesion_id = match.group(1)
        image_file = match.group(2)
        return os.path.join(dataset_root, 'MILK10k_Training_Input', lesion_id, image_file)
    else:
        # Fallback: extract last 2 parts
        parts = re.split(r'[/\\]', path_str)
        parts = [p for p in parts if p]
        if len(parts) >= 2:
            return os.path.join(dataset_root, 'MILK10k_Training_Input', parts[-2], parts[-1])
        else:
            raise ValueError(f"Cannot extract lesion_id and image from path: {path_str}")

# Dataset root in Colab
DATASET_ROOT = f'{WORK_DIR}/dataset'

print("Fixing image paths for Colab environment...")
print(f"Dataset root: {DATASET_ROOT}")

train_df = fix_image_paths(train_df, DATASET_ROOT)
val_df = fix_image_paths(val_df, DATASET_ROOT)

print(f"\n‚úÖ Image paths updated!")
print(f"\nüì∏ Example corrected paths:")
print(f"  Clinical: {train_df['clinical_image_path'].iloc[0]}")
print(f"  Dermoscopic: {train_df['dermoscopic_image_path'].iloc[0]}")

# Save corrected CSV files
print(f"\nüíæ Saving corrected CSV files...")
train_df.to_csv('preprocessed_data/train_data.csv', index=False)
val_df.to_csv('preprocessed_data/val_data.csv', index=False)
print(f"‚úÖ Saved corrected CSVs to preprocessed_data/")

# Also save to Google Drive for persistence
os.makedirs(f'{DRIVE_ROOT}/preprocessed_data', exist_ok=True)
train_df.to_csv(f'{DRIVE_ROOT}/preprocessed_data/train_data_colab.csv', index=False)
val_df.to_csv(f'{DRIVE_ROOT}/preprocessed_data/val_data_colab.csv', index=False)
print(f"‚úÖ Saved corrected CSVs to Google Drive")

# Verify paths exist
sample_clinical = train_df['clinical_image_path'].iloc[0]
sample_dermoscopic = train_df['dermoscopic_image_path'].iloc[0]

print(f"\nüîç Verifying image files...")
if os.path.exists(sample_clinical):
    print(f"‚úÖ Sample clinical image exists!")
else:
    print(f"‚ö†Ô∏è WARNING: Clinical image not found at: {sample_clinical}")
    print(f"\nüîß Debugging:")
    symlink_path = f'{WORK_DIR}/dataset/MILK10k_Training_Input'
    if os.path.islink(symlink_path):
        print(f"  - Dataset symlink target: {os.readlink(symlink_path)}")
    else:
        print(f"  - Directory exists: {os.path.exists(symlink_path)}")

if os.path.exists(sample_dermoscopic):
    print(f"‚úÖ Sample dermoscopic image exists!")
else:
    print(f"‚ö†Ô∏è WARNING: Dermoscopic image not found at: {sample_dermoscopic}")


Fixing image paths for Colab environment...
Dataset root: /content/MILK10k_PanDerm/dataset

‚úÖ Image paths updated!

üì∏ Example corrected paths:
  Clinical: /content/MILK10k_PanDerm/dataset/MILK10k_Training_Input/IL_8583674/ISIC_8570261.jpg
  Dermoscopic: /content/MILK10k_PanDerm/dataset/MILK10k_Training_Input/IL_8583674/ISIC_7454892.jpg

üíæ Saving corrected CSV files...
‚úÖ Saved corrected CSVs to preprocessed_data/
‚úÖ Saved corrected CSVs to Google Drive

üîç Verifying image files...
‚úÖ Sample clinical image exists!
‚úÖ Sample dermoscopic image exists!


## 7Ô∏è‚É£ Create PanDerm DataLoaders

Create training and validation dataloaders with PanDerm-specific transforms and MONET features.


In [8]:
from train_panderm import PanDermDataset, get_panderm_transforms, get_panderm_dataloaders

# Optimized for A100 40GB GPU with PanDerm dual-backbone
# PanDerm uses 224x224 images (ViT native resolution)
BATCH_SIZE = 48  # Conservative for dual ViT backbones (try 48 for max utilization)
NUM_WORKERS = 8  # A100 instances have more CPU cores
IMAGE_SIZE = IMAGE_CONFIG_PANDERM['image_size']  # 224

print(f"Creating PanDerm dataloaders...")
print(f"  Image size: {IMAGE_SIZE}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Num workers: {NUM_WORKERS}")

# Create dataloaders
train_loader, val_loader = get_panderm_dataloaders(
    train_df,
    val_df,
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    image_size=IMAGE_SIZE
)

print(f"\n‚úÖ Train DataLoader: {len(train_loader)} batches")
print(f"‚úÖ Val DataLoader: {len(val_loader)} batches")

# Test dataloader - PanDerm expects (clinical_img, dermoscopic_img, monet_scores, metadata, labels)
print(f"\nüß™ Testing PanDerm dataloader...")
for batch in train_loader:
    clinical_img, dermoscopic_img, monet_scores, metadata, labels = batch
    print(f"  Clinical image shape: {clinical_img.shape}")
    print(f"  Dermoscopic image shape: {dermoscopic_img.shape}")
    print(f"  MONET scores shape: {monet_scores.shape}")
    print(f"  Metadata shape: {metadata.shape}")
    print(f"  Labels shape: {labels.shape}")
    break

print("\n‚úÖ DataLoader test successful!")


Creating PanDerm dataloaders...
  Image size: 224
  Batch size: 48
  Num workers: 8

‚úÖ Train DataLoader: 87 batches
‚úÖ Val DataLoader: 22 batches

üß™ Testing PanDerm dataloader...


  original_init(self, **validated_kwargs)
  A.CoarseDropout(max_holes=8, max_height=image_size//8, max_width=image_size//8, p=0.3),


  Clinical image shape: torch.Size([48, 3, 224, 224])
  Dermoscopic image shape: torch.Size([48, 3, 224, 224])
  MONET scores shape: torch.Size([48, 7])
  Metadata shape: torch.Size([48, 11])
  Labels shape: torch.Size([48, 11])

‚úÖ DataLoader test successful!


## 8Ô∏è‚É£ Create PanDerm Model

Initialize the **Tri-Modal PanDerm Fusion** model with:
- Dual DermLIP ViT-B/16 encoders for clinical and dermoscopic images
- MONET concept embedding for interpretable features
- TMCT (Tri-Modal Cross-attention Transformer) fusion blocks
- Global context pooling with auxiliary heads for deep supervision


In [9]:
import torch
from models_panderm import create_panderm_model, get_layer_wise_lr_params

# Get device
device = get_device()

# Create PanDerm model
print("Creating Tri-Modal PanDerm Fusion model...")
model = create_panderm_model(
    model_name=MODEL_CONFIG_PANDERM['model_name'],
    embed_dim=MODEL_CONFIG_PANDERM['embed_dim'],
    num_heads=MODEL_CONFIG_PANDERM['num_heads'],
    num_classes=MODEL_CONFIG_PANDERM['num_classes'],
    dropout=MODEL_CONFIG_PANDERM['dropout'],
    freeze_clinical=MODEL_CONFIG_PANDERM['freeze_clinical'],
    freeze_dermoscopic=MODEL_CONFIG_PANDERM['freeze_dermoscopic'],
    num_concept_tokens=MODEL_CONFIG_PANDERM['num_concept_tokens'],
    tmct_num_layers=MODEL_CONFIG_PANDERM.get('tmct_num_layers', 2),
    use_auxiliary_heads=MODEL_CONFIG_PANDERM['use_auxiliary_heads'],
    pretrained=True
)

model = model.to(device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
frozen_params = total_params - trainable_params

print(f"\n‚úÖ Model: Tri-Modal PanDerm Fusion")
print(f"‚úÖ Backbone: {MODEL_CONFIG_PANDERM['model_name']}")
print(f"‚úÖ Embed dim: {MODEL_CONFIG_PANDERM['embed_dim']}")
print(f"‚úÖ TMCT layers: {MODEL_CONFIG_PANDERM.get('tmct_num_layers', 2)}")
print(f"‚úÖ Auxiliary heads: {MODEL_CONFIG_PANDERM['use_auxiliary_heads']}")
print(f"‚úÖ Device: {device}")
print(f"\nüìä Parameters:")
print(f"  Total: {total_params:,}")
print(f"  Trainable: {trainable_params:,}")
print(f"  Frozen: {frozen_params:,}")
print(f"  Model size: {total_params * 4 / 1024 / 1024:.1f} MB (FP32)")


Using GPU: NVIDIA A100-SXM4-40GB
Creating Tri-Modal PanDerm Fusion model...


open_clip_config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

open_clip_model.safetensors:   0%|          | 0.00/784M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]


‚úÖ Model: Tri-Modal PanDerm Fusion
‚úÖ Backbone: redlessone/DermLIP_PanDerm-base-w-PubMed-256
‚úÖ Embed dim: 768
‚úÖ TMCT layers: 2
‚úÖ Auxiliary heads: True
‚úÖ Device: cuda

üìä Parameters:
  Total: 203,130,593
  Trainable: 130,768,097
  Frozen: 72,362,496
  Model size: 774.9 MB (FP32)


### Test Forward Pass

Verify the model works correctly with a sample batch.


In [None]:
# Test forward pass with a real batch
print("Testing forward pass...")
model.eval()

for batch in train_loader:
    clinical_img, dermoscopic_img, monet_scores, metadata, labels = batch
    clinical_img = clinical_img.to(device)
    dermoscopic_img = dermoscopic_img.to(device)
    monet_scores = monet_scores.to(device)
    metadata = metadata.to(device)

    with torch.no_grad():
        with torch.amp.autocast('cuda'):
            outputs = model(clinical_img, dermoscopic_img, monet_scores, metadata)

    if isinstance(outputs, dict):
        print(f"‚úÖ Output (dict) - training mode with aux heads:")
        for k, v in outputs.items():
            print(f"   {k}: {v.shape}")
    else:
        print(f"‚úÖ Output shape: {outputs.shape}")
    break

# Memory estimation
if torch.cuda.is_available():
    torch.cuda.reset_peak_memory_stats()
    model.train()
    with torch.amp.autocast('cuda'):
        outputs = model(clinical_img, dermoscopic_img, monet_scores, metadata)
    peak_memory = torch.cuda.max_memory_allocated() / 1024 / 1024 / 1024
    print(f"\nüíæ Peak GPU memory (batch={BATCH_SIZE}): {peak_memory:.2f} GB")

    # Estimate max batch size
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    estimated_max_batch = int(BATCH_SIZE * (gpu_memory * 0.8) / peak_memory)
    print(f"üìä Estimated max batch size: {estimated_max_batch}")

print("\n‚úÖ Forward pass test completed!")


## 9Ô∏è‚É£ Initialize Training Components

Setup the PanDerm trainer with:
- Compound Loss (Focal + Soft F1) with class balancing
- Layer-wise learning rate decay
- OneCycleLR scheduler with warmup
- Mixed precision training


In [None]:
from train_panderm import PanDermTrainer

# Set random seed for reproducibility
set_seed(TRAIN_CONFIG_PANDERM['random_seed'])

# Update checkpoint and log directories to save in Google Drive
CHECKPOINT_DIR = f'{DRIVE_ROOT}/models/panderm'
LOG_DIR = f'{DRIVE_ROOT}/logs/panderm'

# Create directories
os.makedirs(CHECKPOINT_DIR, exist_ok=True)
os.makedirs(LOG_DIR, exist_ok=True)

# Override config paths
TRAIN_CONFIG_PANDERM['checkpoint_dir'] = CHECKPOINT_DIR
TRAIN_CONFIG_PANDERM['log_dir'] = LOG_DIR

print(f"Creating PanDerm trainer...")
print(f"  Checkpoint dir: {CHECKPOINT_DIR}")
print(f"  Log dir: {LOG_DIR}")
print(f"  Base LR: {TRAIN_CONFIG_PANDERM['base_lr']}")
print(f"  Backbone LR decay: {TRAIN_CONFIG_PANDERM['backbone_lr_decay']}")
print(f"  Gradient accumulation: {TRAIN_CONFIG_PANDERM.get('gradient_accumulation', 1)}")
print(f"  Mixed precision: {TRAIN_CONFIG_PANDERM.get('mixed_precision', 'fp16')}")

# Create trainer
trainer = PanDermTrainer(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    samples_per_class=samples_per_class,
    device=device
)

print(f"\n‚úÖ PanDerm Trainer initialized successfully!")
print(f"\nüìà Training setup:")
print(f"  Epochs: {TRAIN_CONFIG_PANDERM['num_epochs']}")
print(f"  Early stopping patience: {TRAIN_CONFIG_PANDERM['early_stopping_patience']}")
print(f"  Warmup epochs: {TRAIN_CONFIG_PANDERM['warmup_epochs']}")


### View Layer-wise Learning Rates

PanDerm uses layer-wise learning rate decay for better fine-tuning.


In [None]:
# Show layer-wise learning rates
print("Layer-wise Learning Rates:")
print("=" * 50)
lr_params = get_layer_wise_lr_params(
    model,
    base_lr=TRAIN_CONFIG_PANDERM['base_lr'],
    decay_rate=TRAIN_CONFIG_PANDERM['backbone_lr_decay']
)

for i, group in enumerate(lr_params):
    num_params = sum(p.numel() for p in group['params'])
    print(f"  {group['name']}: lr={group['lr']:.2e}, params={num_params:,}")

print(f"\nTotal parameter groups: {len(lr_params)}")


## üîü Load TensorBoard (Optional)

Load TensorBoard extension to monitor training in real-time.


In [None]:
# Load TensorBoard extension
%load_ext tensorboard

# Start TensorBoard (will update during training)
%tensorboard --logdir {LOG_DIR}

print("‚úÖ TensorBoard loaded! View metrics above during training.")


## 1Ô∏è‚É£1Ô∏è‚É£ Start Training üöÄ

**‚ö†Ô∏è IMPORTANT**: This will take several hours depending on your GPU.

Expected training time:
- **A100 40GB**: ~3-5 hours for 60 epochs
- **V100 32GB**: ~6-10 hours
- **T4 16GB**: May need to reduce batch size to 8-16

The training will:
- Save best model (regular) to Google Drive automatically
- Save best EMA model separately (`panderm_best_ema.pth`)
- Save checkpoints every 5 epochs with EMA state
- Stop early if no improvement for **20 epochs** (increased patience)
- Use gradient accumulation for effective larger batches
- Apply **Mixup/CutMix** augmentation (50% probability)
- Track both regular and **EMA validation metrics**
- Use **CosineAnnealingWarmRestarts** scheduler with 5-epoch warmup


In [None]:
# Start training
print("üöÄ Starting PanDerm training...")
print("‚ö†Ô∏è This will take several hours. Don't close the browser tab!")
print("=" * 60)

# Train the model
history = trainer.train()

print("\n" + "=" * 60)
print("üéâ PANDERM TRAINING COMPLETED!")
print("=" * 60)


## 1Ô∏è‚É£2Ô∏è‚É£ View Training Results

Analyze training history and visualize performance.


In [None]:
import matplotlib.pyplot as plt

# Load training history
history_path = f'{CHECKPOINT_DIR}/panderm_history.csv'
if os.path.exists(history_path):
    history_df = pd.read_csv(history_path)
else:
    # Use in-memory history if file not found
    history_df = pd.DataFrame(history)

print("=" * 60)
print("PANDERM TRAINING SUMMARY (with EMA)")
print("=" * 60)
print(f"\nTotal epochs: {len(history_df)}")
print(f"Best Macro F1 (regular): {history_df['val_f1_macro'].max():.4f}")
if 'val_f1_macro_ema' in history_df.columns:
    print(f"Best Macro F1 (EMA):     {history_df['val_f1_macro_ema'].max():.4f}")
    best_model = "EMA" if history_df['val_f1_macro_ema'].max() > history_df['val_f1_macro'].max() else "Regular"
    print(f"Recommended model: {best_model}")
print(f"Final Train Loss: {history_df['train_loss'].iloc[-1]:.4f}")
print(f"Final Val Loss: {history_df['val_loss'].iloc[-1]:.4f}")

# Plot training curves
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Loss curves
axes[0, 0].plot(history_df['train_loss'], label='Train Loss', linewidth=2)
axes[0, 0].plot(history_df['val_loss'], label='Val Loss', linewidth=2)
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].set_title('Training and Validation Loss (PanDerm)')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# F1 scores (including EMA if available)
axes[0, 1].plot(history_df['val_f1_macro'], label='Macro F1', linewidth=2, color='green')
if 'val_f1_macro_ema' in history_df.columns:
    axes[0, 1].plot(history_df['val_f1_macro_ema'], label='EMA F1', linewidth=2, color='blue', linestyle='--')
    best_f1 = max(history_df['val_f1_macro'].max(), history_df['val_f1_macro_ema'].max())
else:
    best_f1 = history_df['val_f1_macro'].max()
axes[0, 1].axhline(y=best_f1, color='r', linestyle=':', label=f'Best: {best_f1:.4f}')
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('F1 Score')
axes[0, 1].set_title('Validation Macro F1 Score (Regular vs EMA)')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Learning rate (CosineAnnealingWarmRestarts)
axes[1, 0].plot(history_df['learning_rate'], linewidth=2, color='purple')
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('Learning Rate')
axes[1, 0].set_title('Learning Rate Schedule (CosineWarmRestarts)')
axes[1, 0].set_yscale('log')
axes[1, 0].grid(True, alpha=0.3)

# Loss comparison at best epoch
best_epoch = history_df['val_f1_macro'].idxmax()
axes[1, 1].bar(['Train Loss', 'Val Loss'],
               [history_df.loc[best_epoch, 'train_loss'],
                history_df.loc[best_epoch, 'val_loss']],
               color=['blue', 'orange'])
axes[1, 1].set_title(f'Loss at Best Epoch ({best_epoch+1})')
axes[1, 1].set_ylabel('Loss')
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(f'{DRIVE_ROOT}/panderm_training_curves.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"\n‚úÖ Training curves saved to: {DRIVE_ROOT}/panderm_training_curves.png")


## 1Ô∏è‚É£3Ô∏è‚É£ Save Model Info

Document training results for team reference.


In [None]:
from datetime import datetime

# Create model info file
best_epoch = history_df['val_f1_macro'].idxmax()
best_macro_f1 = history_df['val_f1_macro'].max()

model_info = f"""# PanDerm Training Results

**Date**: {datetime.now().strftime('%Y-%m-%d %H:%M')}
**GPU**: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}
**Total Epochs**: {len(history_df)}
**Best Epoch**: {best_epoch + 1}
**Best Macro F1**: {best_macro_f1:.4f}

## Model Configuration

- **Architecture**: Tri-Modal PanDerm Fusion
- **Backbone**: {MODEL_CONFIG_PANDERM['model_name']}
- **Embed Dim**: {MODEL_CONFIG_PANDERM['embed_dim']}
- **TMCT Layers**: {MODEL_CONFIG_PANDERM.get('tmct_num_layers', 2)}
- **Concept Tokens**: {MODEL_CONFIG_PANDERM['num_concept_tokens']}
- **Auxiliary Heads**: {MODEL_CONFIG_PANDERM['use_auxiliary_heads']}
- **Image Size**: {IMAGE_CONFIG_PANDERM['image_size']}
- **Batch Size**: {BATCH_SIZE}

## Training Configuration

- **Base LR**: {TRAIN_CONFIG_PANDERM['base_lr']}
- **LR Decay**: {TRAIN_CONFIG_PANDERM['backbone_lr_decay']}
- **Gradient Accumulation**: {TRAIN_CONFIG_PANDERM.get('gradient_accumulation', 1)}
- **Weight Decay**: {TRAIN_CONFIG_PANDERM['weight_decay']}

## Loss Configuration

- **Type**: Compound (Focal + Soft F1)
- **Focal Weight**: {LOSS_CONFIG_PANDERM['focal_weight']}
- **Soft F1 Weight**: {LOSS_CONFIG_PANDERM['soft_f1_weight']}
- **Aux Loss Weight**: {LOSS_CONFIG_PANDERM['aux_loss_weight']}

## Files

- Best Model: `{CHECKPOINT_DIR}/panderm_best.pth`
- Training History: `{CHECKPOINT_DIR}/panderm_history.csv`
- Training Curves: `{DRIVE_ROOT}/panderm_training_curves.png`

## Next Steps

1. Download panderm_best.pth from Google Drive
2. Run inference on test set using `src/generate_submission_panderm.py`
3. Consider ensemble with EfficientNet-B3 model
4. Share results with team
"""

# Save model info
info_path = f'{DRIVE_ROOT}/PANDERM_MODEL_INFO.md'
with open(info_path, 'w') as f:
    f.write(model_info)

print(model_info)
print(f"\n‚úÖ Model info saved to: {info_path}")


## 1Ô∏è‚É£4Ô∏è‚É£ Download Trained Model

Download the trained model and results to your local machine.


In [None]:
from google.colab import files

# Option 1: Download directly (may be slow for large files)
print("Downloading files...")
print("‚ö†Ô∏è This may take a while for large model files")

# Download best model (regular)
try:
    files.download(f'{CHECKPOINT_DIR}/panderm_best.pth')
    print("‚úÖ Downloaded: panderm_best.pth (regular model)")
except Exception as e:
    print(f"‚ö†Ô∏è Could not download model: {e}")
    print(f"üìÅ Access it in Google Drive: {CHECKPOINT_DIR}/panderm_best.pth")

# Download best EMA model
try:
    files.download(f'{CHECKPOINT_DIR}/panderm_best_ema.pth')
    print("‚úÖ Downloaded: panderm_best_ema.pth (EMA model)")
except Exception as e:
    print(f"‚ö†Ô∏è Could not download EMA model: {e}")
    print(f"üìÅ Access it in Google Drive: {CHECKPOINT_DIR}/panderm_best_ema.pth")

# Download training history
try:
    files.download(f'{CHECKPOINT_DIR}/panderm_history.csv')
    print("‚úÖ Downloaded: panderm_history.csv")
except Exception as e:
    print(f"‚ö†Ô∏è Could not download history: {e}")

# Download training curves
try:
    files.download(f'{DRIVE_ROOT}/panderm_training_curves.png')
    print("‚úÖ Downloaded: panderm_training_curves.png")
except Exception as e:
    print(f"‚ö†Ô∏è Could not download curves: {e}")

print("\n" + "=" * 60)
print("üì¶ All files are also saved in Google Drive:")
print(f"  üìÅ {DRIVE_ROOT}/")
print(f"  üìÅ {CHECKPOINT_DIR}/")
print("  üìÑ panderm_best.pth (regular model)")
print("  üìÑ panderm_best_ema.pth (EMA model - often better)")
print("=" * 60)


## üîÑ Resume Training (if interrupted)

If your training was interrupted, you can resume from a checkpoint.


In [None]:
# Uncomment and run this cell to resume training from a checkpoint

# import glob
#
# # Find the latest checkpoint
# checkpoints = glob.glob(f'{CHECKPOINT_DIR}/checkpoint_epoch_*.pth')
# if checkpoints:
#     latest_checkpoint = max(checkpoints, key=os.path.getctime)
#     print(f"Loading checkpoint: {latest_checkpoint}")
#
#     # Load checkpoint
#     checkpoint = torch.load(latest_checkpoint, map_location=device)
#     model.load_state_dict(checkpoint['model_state_dict'])
#     trainer.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
#     start_epoch = checkpoint['epoch'] + 1
#
#     print(f"Resuming from epoch {start_epoch}")
#     # Continue training...
# else:
#     print("No checkpoint found. Starting fresh training.")


---

## üéâ Training Complete!

### What to do next:

1. **Download the trained model** from Google Drive:
   - `panderm_best.pth` (regular model)
   - `panderm_best_ema.pth` (EMA model - often better generalization)
2. **Choose the best model**: Check the training summary to see which model (regular or EMA) has better F1
3. **Run inference** on test set using `src/generate_submission_panderm.py`
4. **Consider ensemble** with EfficientNet-B3 and XGBoost for better performance

### Enhanced Training Features (v2):

| Feature | Setting | Purpose |
|---------|---------|---------|
| Dropout | 0.3 | Stronger regularization |
| DropPath | 0.1 | Stochastic depth in TMCT |
| Mixup | Œ±=0.8 | Data augmentation |
| CutMix | Œ±=1.0 | Data augmentation |
| EMA | decay=0.9999 | Smoother, better generalization |
| Scheduler | CosineWarmRestarts | Escape local minima |
| Label Smoothing | 0.1 | Prevent overconfidence |
| Early Stopping | 20 epochs | Allow more recovery time |

### Tips for PanDerm Training:

- **A100 40GB**: Use batch_size=32-48, gradient_accumulation=2 for effective batch of 64-96
- **V100 32GB**: Use batch_size=24-32, gradient_accumulation=2-3
- **T4 16GB**: Use batch_size=8-12, gradient_accumulation=4-6
- **Memory**: PanDerm uses dual ViT backbones, so it's more memory-intensive than single-backbone models
- **EMA Model**: Often generalizes better than the regular model - always check both!

### Model Architecture Notes:

- **Dual DermLIP Encoders**: Separate encoders for clinical and dermoscopic images allow specialized feature extraction
- **MONET Concept Embedding**: Interpretable concept tokens from MONET probability scores
- **TMCT Fusion with DropPath**: Cross-attention between visual features and concept tokens with stochastic depth
- **Auxiliary Heads**: Deep supervision helps train the backbone more effectively

### Expected Improvements (v2 vs v1):

- **Reduced overfitting**: Train/Val gap should decrease from ~36% to <15%
- **Better F1**: Expected improvement from 0.54 to 0.58-0.62
- **More stable training**: F1 fluctuation should decrease with EMA
- **Longer training**: Model can train longer before overfitting

---

**Happy Training! üöÄ**
