### ChagaSight ‚Äî Vision Transformer (Baseline Training)

Baseline ViT training on 2D ECG contour images  
Datasets: PTB-XL (negatives), SaMi-Trop (positives), CODE-15 (soft labels)

Baseline configuration:
- 1% subset (pipeline verification)
- No data augmentation
- AMP enabled
- Strict data integrity checks


In [1]:
# =========================
# CELL 1 (Code) ‚Äî Setup, device, paths, seed
# =========================

import time, random
from pathlib import Path
from datetime import datetime

import numpy as np
import pandas as pd
import torch

start_time = time.time()

# -------------------------
# Reproducibility (baseline-safe)
# -------------------------
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

# Deterministic baseline (slower but reproducible)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# -------------------------
# Device
# -------------------------
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)
if device.type == "cuda":
    print("GPU:", torch.cuda.get_device_name(0))
    print("VRAM (GB):", torch.cuda.get_device_properties(0).total_memory / 1e9)

# -------------------------
# Project root detection (VS Code safe)
# -------------------------
def find_project_root(start: Path) -> Path:
    for p in [start] + list(start.parents):
        if (p / "data").exists():
            return p
    return start

PROJECT_ROOT = find_project_root(Path.cwd())
DATA_DIR = PROJECT_ROOT / "data" / "processed"

# -------------------------
# Experiment folder (one folder per run)
# -------------------------
EXP_NAME = "vit_baseline_1pct"
RUN_ID = datetime.now().strftime("%Y%m%d_%H%M%S")
EXP_DIR = PROJECT_ROOT / "experiments" / EXP_NAME / RUN_ID
EXP_DIR.mkdir(parents=True, exist_ok=True)

print("PROJECT_ROOT:", PROJECT_ROOT)
print("DATA_DIR:", DATA_DIR)
print("Experiment directory:", EXP_DIR)

print(f"‚è± Cell 1 time: {time.time() - start_time:.2f}s")




Device: cuda
GPU: NVIDIA GeForce RTX 3050 6GB Laptop GPU
VRAM (GB): 6.441926656
PROJECT_ROOT: d:\IIT\L6\FYP\ChagaSight
DATA_DIR: d:\IIT\L6\FYP\ChagaSight\data\processed
Experiment directory: d:\IIT\L6\FYP\ChagaSight\experiments\vit_baseline_1pct\20251226_175406
‚è± Cell 1 time: 0.04s


#### Cell 1 ‚Äî What this does
- Fixes randomness using a seed for reproducibility.
- Detects GPU and prints VRAM.
- Finds the project root robustly (works in VS Code even if the notebook is in `/notebooks`).
- Creates a unique experiment run folder under `experiments/<EXP_NAME>/<RUN_ID>/`.

#### Future improvements
- Run multiple seeds (e.g., 5 runs) and report mean ¬± std AUROC.
- Log CUDA + PyTorch versions for stronger reproducibility.
- For speed-focused runs (not baseline), enable `torch.backends.cudnn.benchmark = True`.


In [None]:
# =========================
# Cell 2 ‚Äî Metadata loading + integrity filtering + subset + splits
# =========================
import time
from sklearn.model_selection import train_test_split
from pathlib import Path

start_time = time.time()

# -------------------------
# Datasets included
# -------------------------
datasets = ["ptbxl", "sami_trop", "code15"]
dfs = []

for ds in datasets:
    csv_path = DATA_DIR / "metadata" / f"{ds}_metadata.csv"
    if not csv_path.exists():
        raise FileNotFoundError(f"Missing metadata CSV: {csv_path}")

    df = pd.read_csv(csv_path)
    df["dataset"] = ds
    dfs.append(df)

# Combine all datasets
df_all = pd.concat(dfs, ignore_index=True)
print("Total metadata rows (raw):", len(df_all))

# -------------------------
# HARD integrity filter (relative-path safe)
# -------------------------
def img_exists(p):
    return (PROJECT_ROOT / Path(p)).exists()

exists_mask = df_all["img_path"].apply(img_exists)
missing_count = (~exists_mask).sum()

if missing_count > 0:
    print(f"‚ö†Ô∏è Dropping {missing_count} rows with missing image files")
    print(df_all.loc[~exists_mask, ["dataset", "img_path"]].head())

df_all = df_all.loc[exists_mask].reset_index(drop=True)
print("Rows after integrity filter:", len(df_all))

# -------------------------
# Subset control
# -------------------------
# IMPORTANT:
#   - Use 0.01 for smoke tests
#   - Use 1.0 for full training
subset_frac = 1.0

if subset_frac < 1.0:
    df_all = df_all.sample(frac=subset_frac, random_state=SEED).reset_index(drop=True)

print(f"Subset records ({subset_frac*100:.0f}%):", len(df_all))

# -------------------------
# Binary label ONLY for stratification / metrics
# (Model still uses soft labels)
# -------------------------
df_all["label_bin"] = (df_all["label"] > 0.5).astype(int)

# -------------------------
# Train / Val / Test split (80 / 10 / 10)
# -------------------------
train_df, temp_df = train_test_split(
    df_all,
    test_size=0.2,
    stratify=df_all["label_bin"],
    random_state=SEED
)

val_df, test_df = train_test_split(
    temp_df,
    test_size=0.5,
    stratify=temp_df["label_bin"],
    random_state=SEED
)

print(f"Train: {len(train_df)} | Val: {len(val_df)} | Test: {len(test_df)}")
print(f"‚è± Cell 2 time: {time.time() - start_time:.2f}s")


Total metadata rows (raw): 63228
Rows after integrity filter: 63228
Subset records (10%): 6323
Train: 5058 | Val: 632 | Test: 633
‚è± Cell 2 time: 4.14s


#### Cell 2 ‚Äî What this does
- Loads metadata CSVs for PTB-XL, SaMi-Trop, and CODE-15.
- Drops any rows where the `.npy` image file is missing (prevents DataLoader crashes).
- Samples a 1% subset for fast validation that the pipeline is correct.
- Creates stratified Train/Val/Test splits using a binary label for metrics only.

#### Future improvements
- Scale from `subset_frac=0.01` ‚Üí `0.10` ‚Üí `1.0` once stable.
- Consider dataset-aware splits (e.g., holding out one dataset for domain generalisation testing).
- If patient IDs exist, enforce patient-wise splitting to avoid leakage.


In [None]:
# =========================
# Cell 3 ‚Äî Dataset + DataLoaders (FINAL, research-correct)
# =========================
import time
from pathlib import Path
from torch.utils.data import Dataset, DataLoader

start_time = time.time()

class ECGImageDataset(Dataset):
    """
    Dataset for 2D ECG image embeddings.

    Labels are SOFT labels used for weak supervision:
    - PTB-XL: 0.0 (definite negative)
    - SaMi-Trop: 1.0 (definite positive)
    - CODE-15: soft uncertainty labels (e.g., 0.2 / 0.8)
    """

    def __init__(self, df):
        self.df = df.reset_index(drop=True)

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img_path = PROJECT_ROOT / Path(row["img_path"])

        if not img_path.exists():
            raise FileNotFoundError(f"Missing image file: {img_path}")

        img = np.load(img_path).astype(np.float32)

        # Strict research safety
        if img.shape != (3, 24, 2048):
            raise ValueError(f"Invalid image shape {img.shape} at {img_path}")

        # üî¥ CRITICAL FIX: normalize for ViT stability
        img = img / 255.0  # scale to [0,1]

        img = torch.from_numpy(img)

        # Soft label preserved
        label = torch.tensor(row["label"], dtype=torch.float32)

        return img, label


batch_size = 16  # RTX 3050 6GB safe

train_ds = ECGImageDataset(train_df)
val_ds   = ECGImageDataset(val_df)
test_ds  = ECGImageDataset(test_df)

# Sampler logic
if subset_frac < 0.1:
    print("‚ö†Ô∏è Oversampling enabled (debug subset)")
    from torch.utils.data import WeightedRandomSampler
    weights = train_df["label"].apply(lambda x: 10.0 if x > 0.7 else 1.0).values
    sampler = WeightedRandomSampler(weights, len(weights), replacement=True)

    train_loader = DataLoader(
        train_ds,
        batch_size=batch_size,
        sampler=sampler,
        num_workers=0,
        pin_memory=True
    )
else:
    print("‚úÖ Full dataset training (natural distribution)")
    train_loader = DataLoader(
        train_ds,
        batch_size=batch_size,
        shuffle=True,
        num_workers=0,
        pin_memory=True
    )

val_loader = DataLoader(
    val_ds,
    batch_size=batch_size,
    shuffle=False,
    num_workers=0,
    pin_memory=True
)

test_loader = DataLoader(
    test_ds,
    batch_size=batch_size,
    shuffle=False,
    num_workers=0,
    pin_memory=True
)

# Sanity check
x_batch, y_batch = next(iter(train_loader))
print("‚úì Batch shape:", x_batch.shape)
print("‚úì Image range:", x_batch.min().item(), x_batch.max().item())
print("‚úì Sample labels:", y_batch[:10].tolist())
print(f"‚è± Cell 3 time: {time.time() - start_time:.2f}s")


‚úÖ Full-dataset training (natural class distribution)
‚úì Batch image shape : torch.Size([16, 3, 24, 2048])
‚úì Sample labels    : [0.0, 0.20000000298023224, 0.0, 0.0, 0.20000000298023224, 0.20000000298023224, 0.0, 0.0, 0.20000000298023224, 0.20000000298023224]
‚úì Image range      : [0.0, 255.0]
Train samples : 5058
Train batches : 317
‚è± Cell 3 time: 0.26s


### Why image normalisation is required
Vision Transformers are sensitive to input scale because attention scores
are computed directly from dot products. Normalising ECG images to [0,1]
ensures stable optimisation and fair comparison across datasets.

### Why soft labels are used
CODE-15 annotations reflect uncertainty rather than binary truth.
Soft-label training enables weak supervision and avoids forcing noisy
labels into hard categories, aligning with the referenced ECG foundation
model literature.


In [4]:
# =========================
# Cell 4 ‚Äî ViT model + forward sanity + peak memory (clean)
# =========================
import time
import torch
import torch.nn as nn

start_time = time.time()

if device.type == "cuda":
    torch.cuda.reset_peak_memory_stats()

class PatchEmbedding(nn.Module):
    def __init__(self, patch_size=16, in_ch=3, embed_dim=768):
        super().__init__()
        self.patch_size = patch_size
        self.proj = nn.Conv2d(in_ch, embed_dim, kernel_size=patch_size, stride=patch_size)
        self.num_patches = (24 // patch_size) * (2048 // patch_size)

    def forward(self, x):
        x = self.proj(x)                  # (B, E, H', W')
        x = x.flatten(2).transpose(1, 2)  # (B, N, E)
        return x

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim=768, heads=12, mlp_ratio=4.0, dropout=0.1):
        super().__init__()
        self.norm1 = nn.LayerNorm(embed_dim)
        self.attn = nn.MultiheadAttention(embed_dim, heads, dropout=dropout, batch_first=True)
        self.norm2 = nn.LayerNorm(embed_dim)

        mlp_dim = int(embed_dim * mlp_ratio)
        self.mlp = nn.Sequential(
            nn.Linear(embed_dim, mlp_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(mlp_dim, embed_dim),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        y, _ = self.attn(self.norm1(x), self.norm1(x), self.norm1(x))
        x = x + y
        x = x + self.mlp(self.norm2(x))
        return x

class ViTClassifier(nn.Module):
    def __init__(self, patch_size=16, embed_dim=768, depth=12, heads=12, mlp_ratio=4.0, dropout=0.1):
        super().__init__()
        self.patch_embed = PatchEmbedding(patch_size=patch_size, in_ch=3, embed_dim=embed_dim)
        num_patches = self.patch_embed.num_patches

        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
        self.pos_drop = nn.Dropout(dropout)

        self.blocks = nn.ModuleList([
            TransformerBlock(embed_dim=embed_dim, heads=heads, mlp_ratio=mlp_ratio, dropout=dropout)
            for _ in range(depth)
        ])

        self.norm = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, 1)

    def forward(self, x):
        B = x.shape[0]
        x = self.patch_embed(x)

        cls_tokens = self.cls_token.expand(B, -1, -1)
        x = torch.cat((cls_tokens, x), dim=1)

        x = x + self.pos_embed
        x = self.pos_drop(x)

        for blk in self.blocks:
            x = blk(x)

        x = self.norm(x[:, 0])
        return self.head(x).squeeze(-1)

# Instantiate
model = ViTClassifier().to(device)

num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"ViT trainable parameters: {num_params:,}")

# Forward sanity
model.eval()
with torch.no_grad():
    logits = model(x_batch.to(device))

print("‚úì Forward OK | logits shape:", logits.shape)

if device.type == "cuda":
    peak_mem = torch.cuda.max_memory_allocated() / 1e9
    print(f"‚úì Peak GPU memory used (GB): {peak_mem:.2f}")

print(f"‚è± Cell 4 time: {time.time() - start_time:.2f}s")


ViT trainable parameters: 85,747,201
‚úì Forward OK | logits shape: torch.Size([16])
‚úì Peak GPU memory used (GB): 0.44
‚è± Cell 4 time: 1.65s


#### Cell 4 ‚Äî What this does
- Defines a ViT-B/16-like classifier from scratch.
- Confirms that the model forward pass works on a real batch.
- Prints peak GPU memory usage for sanity.

#### Future improvements
- Try smaller models for ablation (e.g., depth=8, embed_dim=512).
- Add regularisation tuning (dropout, stochastic depth) for full-scale training.
- Consider pretraining (foundation model) before supervised fine-tuning.


In [None]:
# =========================
# Cell 5 ‚Äî Training loop FIXED (AMP warnings removed + avg loss + full logging)
# =========================
import time, json
import torch.nn as nn
import torch.optim as optim
from sklearn.metrics import roc_auc_score
from tqdm.auto import tqdm

start_time = time.time()

# -------------------------
# Training configuration
# -------------------------
num_epochs = 5
learning_rate = 3e-4
weight_decay = 0.05

criterion = nn.BCEWithLogitsLoss()
optimizer = optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_epochs)

use_amp = (device.type == "cuda")

# NEW AMP API (removes FutureWarning)
scaler = torch.amp.GradScaler("cuda", enabled=use_amp)

best_val_auc = 0.0
best_model_path = EXP_DIR / "model_best.pth"
history = []

print("Starting training...")
print("AMP enabled:", use_amp)

# Save full config
config = {
    "experiment_name": EXP_NAME,
    "run_id": RUN_ID,
    "seed": SEED,
    "datasets": ["ptbxl", "sami_trop", "code15"],
    "subset_frac": subset_frac,
    "train/val/test_sizes": {"train": len(train_df), "val": len(val_df), "test": len(test_df)},
    "batch_size": batch_size,
    "epochs": num_epochs,
    "learning_rate": learning_rate,
    "weight_decay": weight_decay,
    "optimizer": "AdamW",
    "scheduler": "CosineAnnealingLR",
    "amp": use_amp,
    "input_shape": [3, 24, 2048],
    "vit": {"patch_size": 16, "embed_dim": 768, "depth": 12, "heads": 12, "mlp_ratio": 4.0, "dropout": 0.1},
}
if device.type == "cuda":
    config["gpu_name"] = torch.cuda.get_device_name(0)
    config["vram_gb"] = float(torch.cuda.get_device_properties(0).total_memory / 1e9)

with open(EXP_DIR / "config.json", "w") as f:
    json.dump(config, f, indent=2)

# -------------------------
# Epoch loop
# -------------------------
for epoch in range(num_epochs):
    epoch_start = time.time()

    # ---- Train ----
    model.train()
    running_loss = 0.0
    n_batches = 0

    train_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs} [Train]", leave=False)

    for imgs, labels in train_bar:
        imgs = imgs.to(device, non_blocking=True)
        labels = labels.to(device, non_blocking=True)

        optimizer.zero_grad(set_to_none=True)

        with torch.amp.autocast("cuda", enabled=use_amp):
            logits = model(imgs)
            loss = criterion(logits, labels)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        running_loss += loss.item()
        n_batches += 1

        train_bar.set_postfix(loss=f"{loss.item():.4f}")

    train_loss = running_loss / max(1, n_batches)

    # ---- Val ----
    model.eval()
    val_preds, val_trues = [], []

    with torch.no_grad():
        for imgs, labels in tqdm(val_loader, desc=f"Epoch {epoch+1}/{num_epochs} [Val]", leave=False):
            imgs = imgs.to(device, non_blocking=True)
            probs = torch.sigmoid(model(imgs)).cpu().numpy()
            val_preds.extend(probs)
            val_trues.extend(labels.numpy())

    val_trues = np.asarray(val_trues)
    val_preds = np.asarray(val_preds)

    # Binary labels for metric ONLY
    val_trues_bin = (val_trues > 0.5).astype(int)
    val_auc = roc_auc_score(val_trues_bin, val_preds)

    # Save best model
    improved = ""
    if val_auc > best_val_auc:
        best_val_auc = val_auc
        torch.save(model.state_dict(), best_model_path)
        improved = "‚úÖ"

    scheduler.step()

    epoch_time = time.time() - epoch_start

    history.append({
        "epoch": epoch + 1,
        "train_loss_avg": float(train_loss),
        "val_auc": float(val_auc),
        "epoch_time_sec": float(epoch_time),
        "lr": float(optimizer.param_groups[0]["lr"]),
    })

    print(
        f"Epoch {epoch+1:02d} | "
        f"loss(avg)={train_loss:.4f} | "
        f"val AUROC={val_auc:.4f} {improved} | "
        f"time={epoch_time:.1f}s"
    )

# Save metrics.csv
pd.DataFrame(history).to_csv(EXP_DIR / "metrics.csv", index=False)

print("\nTraining complete.")
print("Best validation AUROC:", best_val_auc)
print("Saved best model to:", best_model_path)
print("Saved metrics to:", EXP_DIR / "metrics.csv")
print("Saved config to:", EXP_DIR / "config.json")
print(f"‚è± Cell 5 total time: {time.time() - start_time:.2f}s")


Starting training...
AMP enabled: True


Epoch 1/1 [Train]:   0%|          | 0/317 [00:00<?, ?it/s]

Epoch 1/1 [Val]:   0%|          | 0/40 [00:00<?, ?it/s]

Epoch 01 | loss(avg)=0.4588 | val AUROC=0.6563 ‚úÖ | time=147.2s

Training complete.
Best validation AUROC: 0.6563146997929605
Saved best model to: d:\IIT\L6\FYP\ChagaSight\experiments\vit_baseline_1pct\20251226_175406\model_best.pth
Saved metrics to: d:\IIT\L6\FYP\ChagaSight\experiments\vit_baseline_1pct\20251226_175406\metrics.csv
Saved config to: d:\IIT\L6\FYP\ChagaSight\experiments\vit_baseline_1pct\20251226_175406\config.json
‚è± Cell 5 total time: 149.29s


#### Cell 5 ‚Äî What this does
- Trains the ViT model for a fixed number of epochs.
- Uses AMP (mixed precision) to reduce VRAM usage and improve speed on the RTX 3050.
- Computes validation AUROC each epoch (binary threshold only for the metric, not for training).
- Saves a fully reproducible experiment record:
  - `config.json` (model + hyperparameters + data sizes + GPU info)
  - `metrics.csv` (epoch-by-epoch loss, AUROC, timing, LR)
  - `model_best.pth` (best checkpoint selected by validation AUROC)

#### Future improvements
- Increase epochs for full-scale training (10‚Äì50).
- Add early stopping based on AUROC plateau.
- Replace oversampling with class-weighted loss or focal loss (ablation).
- Add AUPRC and threshold metrics for clinical relevance.


In [6]:
# =========================
# Cell 6 ‚Äî Test evaluation + save test_results.json (full)
# =========================
import time, json
from sklearn.metrics import roc_auc_score
from tqdm.auto import tqdm

start_time = time.time()

# Load best model from this run folder
best_model_path = EXP_DIR / "model_best.pth"
assert best_model_path.exists(), "Best model checkpoint not found!"

model.load_state_dict(torch.load(best_model_path, map_location=device))
model.eval()

print("Loaded best model from:", best_model_path)

test_preds, test_trues = [], []

with torch.no_grad():
    for imgs, labels in tqdm(test_loader, desc="Test evaluation"):
        imgs = imgs.to(device, non_blocking=True)
        probs = torch.sigmoid(model(imgs)).cpu().numpy()
        test_preds.extend(probs)
        test_trues.extend(labels.numpy())

test_preds = np.asarray(test_preds)
test_trues = np.asarray(test_trues)

# Binary labels ONLY for metric
test_trues_bin = (test_trues > 0.5).astype(int)
test_auc = roc_auc_score(test_trues_bin, test_preds)

test_time = time.time() - start_time

print("\n=== TEST RESULTS ===")
print(f"Test AUROC : {test_auc:.4f}")
print(f"‚è± Cell 6 time: {test_time:.2f}s")

# Save test results
test_results = {
    "test_auc": float(test_auc),
    "num_test_samples": int(len(test_trues)),
    "evaluation_time_sec": float(test_time),
}
with open(EXP_DIR / "test_results.json", "w") as f:
    json.dump(test_results, f, indent=2)

print("Saved test results to:", EXP_DIR / "test_results.json")


Loaded best model from: d:\IIT\L6\FYP\ChagaSight\experiments\vit_baseline_1pct\20251226_175406\model_best.pth


Test evaluation:   0%|          | 0/40 [00:00<?, ?it/s]


=== TEST RESULTS ===
Test AUROC : 0.6843
‚è± Cell 6 time: 11.66s
Saved test results to: d:\IIT\L6\FYP\ChagaSight\experiments\vit_baseline_1pct\20251226_175406\test_results.json


##### Cell 6 ‚Äî What this does
- Loads the best checkpoint from this run folder.
- Evaluates on the held-out test split.
- Saves `test_results.json` so results are permanently stored.

##### Future improvements
- Report confidence intervals via bootstrapping (important for medical AI).
- Add subgroup evaluation by dataset (PTB-XL vs SaMi-Trop vs CODE-15).
- Add threshold-based metrics (sensitivity, specificity) for clinical interpretation.
