# **General Description**

---


# Datasets Used:
- PathMnist : a large 9 class pathology image dataset with 28x28 images and a single channel
- ChestMNIST : a chest X-ray dataset with 14 possible desease labels. each image can have multiple labels / diseases.
# Pre-Processing Steps:
- resized the images to 224x224
- converted to 3 channels where needed
- normalized with simple mean = 0,5 and std = 0.5
- loaded into batches of 256 for better training speed ( it was very slow)
- i trained on subsets of the dataser ( 10k for pathMNIST, 8k for ChestMNIST) to have run in a reasonable time frame ( i have google collab gpu usage limits & need them for other projects)

# other info

- I used Accuracy and f1 score for my performance metrics
 - i ran ResNet-18, VGG-16, ViT-Base, and DINO ViT-Small
 - i trained only the model heads
 - i used 1 epoch per module for gpu & time limits
 - optimizer was Adam, learning rate 1e-3


# Imports, configuration, and dataset metadata


In [None]:
!pip install -q medmnist timm torchinfo

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Subset
from torchvision import transforms
import timm
import medmnist
from medmnist import INFO

import numpy as np
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import scipy.stats as st
import time

print("CUDA available:", torch.cuda.is_available())
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Hyperparameters tuned for speed
BATCH_SIZE = 256
EPOCHS = 1           # increase to 2 if you have time
NUM_WORKERS = 2

DATASET_1 = "pathmnist"
DATASET_2 = "chestmnist"

info1 = INFO[DATASET_1]
info2 = INFO[DATASET_2]

DataClass1 = getattr(medmnist, info1['python_class'])
DataClass2 = getattr(medmnist, info2['python_class'])


CUDA available: True


##  MedMNIST dataset loading and preprocessing (PathMNIST + ChestMNIST)

In [None]:
# 224x224 + simple normalization for all pretrained models
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5], std=[0.5])
])

# ----- Dataset 1: PathMNIST -----
train1 = DataClass1(split='train', transform=transform, download=True)
test1  = DataClass1(split='test',  transform=transform, download=True)

# Optional: use a subset for faster training; set to None for full dataset
SUBSET_TRAIN1 = 10000   # e.g., 10k samples; change to None for full

if SUBSET_TRAIN1 is not None:
    idx1 = np.random.choice(len(train1), SUBSET_TRAIN1, replace=False)
    train1_sub = Subset(train1, idx1)
else:
    train1_sub = train1

train_loader1 = DataLoader(
    train1_sub,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=NUM_WORKERS,
    pin_memory=True
)

test_loader1 = DataLoader(
    test1,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=NUM_WORKERS,
    pin_memory=True
)

# ----- Dataset 2: ChestMNIST -----
train2 = DataClass2(split='train', transform=transform, download=True)
test2  = DataClass2(split='test',  transform=transform, download=True)

SUBSET_TRAIN2 = 8000    # keep this smaller; change to None for full

if SUBSET_TRAIN2 is not None:
    idx2 = np.random.choice(len(train2), SUBSET_TRAIN2, replace=False)
    train2_sub = Subset(train2, idx2)
else:
    train2_sub = train2

train_loader2 = DataLoader(
    train2_sub,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=NUM_WORKERS,
    pin_memory=True
)

test_loader2 = DataLoader(
    test2,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=NUM_WORKERS,
    pin_memory=True
)


100%|██████████| 206M/206M [02:57<00:00, 1.16MB/s]
100%|██████████| 82.8M/82.8M [00:22<00:00, 3.75MB/s]


# Training and evaluation utilities

In [None]:
# ----- Determine class count safely for dataset 1 -----
num_classes1 = (
    info1.get("num_classes") or
    info1.get("n_classes") or
    len(info1.get("label", [])) or
    len(info1.get("labels", []))
)

if not num_classes1:
    try:
        num_classes1 = train1.num_classes
    except:
        num_classes1 = len(np.unique(train1.labels))

print("Dataset 1 (PathMNIST) classes:", num_classes1)

# For dataset 2
num_classes2 = (
    info2.get("num_classes") or
    info2.get("n_classes") or
    len(info2.get("label", [])) or
    len(info2.get("labels", []))
)

if not num_classes2:
    try:
        num_classes2 = train2.num_classes
    except:
        num_classes2 = len(np.unique(train2.labels))

print("Dataset 2 (ChestMNIST) classes:", num_classes2)


def train_model(model, loader, epochs=EPOCHS):
    model.to(device)
    params = filter(lambda p: p.requires_grad, model.parameters())
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(params, lr=1e-3)

    for ep in range(epochs):
        model.train()
        running_loss = 0.0
        for x, y in loader:
            x = x.to(device, non_blocking=True)
            y = y.squeeze().to(device, non_blocking=True)

            optimizer.zero_grad()
            pred = model(x)
            loss = criterion(pred, y)
            loss.backward()
            optimizer.step()
            running_loss += loss.item() * x.size(0)

        epoch_loss = running_loss / len(loader.dataset)
        print(f"Epoch {ep+1}/{epochs} - loss: {epoch_loss:.4f}")

    return model


def evaluate(model, loader):
    model.to(device)
    model.eval()
    all_preds, all_labels = [], []

    with torch.no_grad():
        for x, y in loader:
            x = x.to(device, non_blocking=True)
            pred = model(x)
            all_preds.extend(pred.cpu().argmax(1).numpy())
            all_labels.extend(y.squeeze().numpy())

    all_preds = np.array(all_preds)
    all_labels = np.array(all_labels)

    acc = accuracy_score(all_labels, all_preds)
    f1  = f1_score(all_labels, all_preds, average="macro")
    cm  = confusion_matrix(all_labels, all_preds)
    return acc, f1, cm, all_preds


Dataset 1 (PathMNIST) classes: 9
Dataset 2 (ChestMNIST) classes: 14


# Pre-trained model builders (ResNet18, VGG16, ViT-Base, DINO-ViT)

In [None]:
def build_resnet(num_classes):
    model = timm.create_model("resnet18", pretrained=True)
    for name, param in model.named_parameters():
        if "fc" not in name:
            param.requires_grad = False
    model.fc = nn.Linear(model.fc.in_features, num_classes)
    return model

def build_vgg(num_classes):
    # Let timm build the correct classifier
    model = timm.create_model("vgg16", pretrained=True, num_classes=num_classes)

    # Freeze backbone layers
    for name, param in model.named_parameters():
        # timm VGG uses "head" for the classifier module
        if "head" not in name:
            param.requires_grad = False

    return model



def build_vit(num_classes):
    model = timm.create_model("vit_base_patch16_224", pretrained=True, num_classes=num_classes)
    for name, param in model.named_parameters():
        if "head" not in name:
            param.requires_grad = False
    return model

def build_dino(num_classes):
    model = timm.create_model("vit_small_patch16_224.dino", pretrained=True)

    # Freeze all layers except head
    for name, param in model.named_parameters():
        if "head" not in name:
            param.requires_grad = False

    # Replace head properly using model.num_features
    model.head = nn.Linear(model.num_features, num_classes)

    return model



# Fine-tuning individual models on PathMNIST

In [None]:
# ===== Dataset 1: PathMNIST =====
print("=== DATASET 1: PathMNIST ===")

# ResNet
print("\n=== Training ResNet-18 (PathMNIST) ===")
start = time.time()
resnet1 = build_resnet(num_classes1)
resnet1 = train_model(resnet1, train_loader1)
acc_res1, f1_res1, cm_res1, preds_res1 = evaluate(resnet1, test_loader1)
print(f"ResNet-18 done in {time.time()-start:.2f} s")
print("Accuracy:", acc_res1, "| F1:", f1_res1)

# VGG-16
print("\n=== Training VGG-16 (PathMNIST) ===")
start = time.time()
vgg1 = build_vgg(num_classes1)
vgg1 = train_model(vgg1, train_loader1)
acc_vgg1, f1_vgg1, cm_vgg1, preds_vgg1 = evaluate(vgg1, test_loader1)
print(f"VGG-16 done in {time.time()-start:.2f} s")
print("Accuracy:", acc_vgg1, "| F1:", f1_vgg1)

# ViT
print("\n=== Training ViT-Base (PathMNIST) ===")
start = time.time()
vit1 = build_vit(num_classes1)
vit1 = train_model(vit1, train_loader1)
acc_vit1, f1_vit1, cm_vit1, preds_vit1 = evaluate(vit1, test_loader1)
print(f"ViT-Base done in {time.time()-start:.2f} s")
print("Accuracy:", acc_vit1, "| F1:", f1_vit1)

# DINO
print("\n=== Training DINO ViT-Small (PathMNIST) ===")
start = time.time()
dino1 = build_dino(num_classes1)
dino1 = train_model(dino1, train_loader1)
acc_dino1, f1_dino1, cm_dino1, preds_dino1 = evaluate(dino1, test_loader1)
print(f"DINO done in {time.time()-start:.2f} s")
print("Accuracy:", acc_dino1, "| F1:", f1_dino1)


=== DATASET 1: PathMNIST ===

=== Training ResNet-18 (PathMNIST) ===
Epoch 1/1 - loss: 1.8817
ResNet-18 done in 36.28 s
Accuracy: 0.5916434540389972 | F1: 0.46720830040490086

=== Training VGG-16 (PathMNIST) ===
Epoch 1/1 - loss: 1.0577
VGG-16 done in 117.81 s
Accuracy: 0.8064066852367688 | F1: 0.7468896132710968

=== Training ViT-Base (PathMNIST) ===
Epoch 1/1 - loss: 1.0193
ViT-Base done in 211.44 s
Accuracy: 0.7987465181058496 | F1: 0.7377623114521275

=== Training DINO ViT-Small (PathMNIST) ===
Epoch 1/1 - loss: 1.8099
DINO done in 61.25 s
Accuracy: 0.8231197771587744 | F1: 0.7493981812862142


# Ensemble learning on PathMNIST (majority vote, weighted soft-voting, stacking)

In [None]:
labels1 = test1.labels.squeeze()

# Majority voting
all_votes1 = np.vstack([preds_res1, preds_vgg1, preds_vit1, preds_dino1])
maj_preds1 = st.mode(all_votes1, axis=0, keepdims=True).mode.squeeze()

acc_maj1 = accuracy_score(labels1, maj_preds1)
f1_maj1  = f1_score(labels1, maj_preds1, average="macro")
cm_maj1  = confusion_matrix(labels1, maj_preds1)

# Weighted averaging (soft voting)
def get_soft(model, loader):
    model.eval()
    model.to(device)
    soft = []
    with torch.no_grad():
        for x, _ in loader:
            x = x.to(device, non_blocking=True)
            probs = nn.Softmax(dim=1)(model(x)).cpu().numpy()
            soft.append(probs)
    return np.vstack(soft)

soft_res1  = get_soft(resnet1, test_loader1)
soft_vgg1  = get_soft(vgg1, test_loader1)
soft_vit1  = get_soft(vit1, test_loader1)
soft_dino1 = get_soft(dino1, test_loader1)

weights1 = np.array([acc_res1, acc_vgg1, acc_vit1, acc_dino1])
weights1 = weights1 / weights1.sum()

weighted_soft1 = (
    weights1[0]*soft_res1 +
    weights1[1]*soft_vgg1 +
    weights1[2]*soft_vit1 +
    weights1[3]*soft_dino1
)

weighted_preds1 = weighted_soft1.argmax(axis=1)

acc_w1 = accuracy_score(labels1, weighted_preds1)
f1_w1  = f1_score(labels1, weighted_preds1, average="macro")
cm_w1  = confusion_matrix(labels1, weighted_preds1)

# Stacking
stack_X1 = np.vstack([
    preds_res1,
    preds_vgg1,
    preds_vit1,
    preds_dino1
]).T

stack_y1 = labels1

meta1 = LogisticRegression(max_iter=500)
meta1.fit(stack_X1, stack_y1)
stack_preds1 = meta1.predict(stack_X1)

acc_s1 = accuracy_score(stack_y1, stack_preds1)
f1_s1  = f1_score(stack_y1, stack_preds1, average="macro")
cm_s1  = confusion_matrix(stack_y1, stack_preds1)


# Transfer learning on second dataset: ChestMNIST (ResNet18)

In [None]:
print("\n=== Training DINO ViT-Small ===")
start = time.time()

# Build DINO model properly
dino = timm.create_model("vit_small_patch16_224.dino", pretrained=True)

# Freeze everything except head
for name, param in dino.named_parameters():
    if "head" not in name:
        param.requires_grad = False

# Replace head classifier
dino.head = nn.Linear(dino.num_features, num_classes1)

print("DINO: starting training...")
dino = train_model(dino, train_loader1)
print("DINO: evaluating...")
acc_dino1, f1_dino1, cm_dino1, preds_dino1 = evaluate(dino, test_loader1)

end = time.time()
print(f"DINO done in {end-start:.2f} seconds")
print("Accuracy:", acc_dino1, "| F1:", f1_dino1)



=== Training DINO ViT-Small ===
DINO: starting training...
Epoch 1/1 - loss: 1.2776
DINO: evaluating...
DINO done in 65.42 seconds
Accuracy: 0.834958217270195 | F1: 0.7762318564602073


# ChestMNIST preprocessing + ResNet18 training

In [None]:
print("\n=== CHESTMNIST PREPROCESSING ===")

# 1) Reload ChestMNIST with correct transform (grayscale -> 3-channel RGB)
transform_chest = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.Grayscale(num_output_channels=3),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5],
                         std=[0.5, 0.5, 0.5]),
])

train2 = DataClass2(split='train', transform=transform_chest, download=True)
test2  = DataClass2(split='test',  transform=transform_chest, download=True)

# 2) Convert multi-label (14-d) ChestMNIST labels -> single class index
def chest_single_label(dataset):
    labels = []
    for y in dataset.labels:
        y = y.astype(int)
        if y.sum() == 0:
            labels.append(0)
        else:
            labels.append(np.argmax(y))
    dataset.labels = np.array(labels)
    return dataset

train2 = chest_single_label(train2)
test2  = chest_single_label(test2)

# 3) DataLoaders for ChestMNIST
train_loader2 = DataLoader(
    train2,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=NUM_WORKERS,
    pin_memory=True
)

test_loader2 = DataLoader(
    test2,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=NUM_WORKERS,
    pin_memory=True
)

# 4) Derive number of classes from processed labels
num_classes2 = len(np.unique(train2.labels))
print("ChestMNIST num_classes (after collapsing):", num_classes2)

# 5) Reuse ResNet builder (if not already defined exactly like this above)
def build_resnet(num_classes):
    model = timm.create_model("resnet18", pretrained=True)
    # freeze backbone
    for name, param in model.named_parameters():
        if "fc" not in name:
            param.requires_grad = False
    # new classification head
    model.fc = nn.Linear(model.fc.in_features, num_classes)
    return model

print("\n=== TRAINING CHESTMNIST RESNET-18 ===")
start = time.time()
resnet2 = build_resnet(num_classes2)
resnet2 = train_model(resnet2, train_loader2)
acc_res2, f1_res2, cm_res2, preds_res2 = evaluate(resnet2, test_loader2)
print(f"ResNet-18 (ChestMNIST) done in {time.time()-start:.2f} s")
print("Accuracy:", acc_res2, "| F1:", f1_res2)



=== CHESTMNIST PREPROCESSING ===
ChestMNIST num_classes (after collapsing): 14

=== TRAINING CHESTMNIST RESNET-18 ===
Epoch 1/1 - loss: 1.3987
ResNet-18 (ChestMNIST) done in 182.95 s
Accuracy: 0.6395934560691838 | F1: 0.05572753013007492


## Final results summary + confusion matrices

In [None]:

print("\n================ FINAL RESULTS ================")

print("\n=== PATHMNIST INDIVIDUAL MODELS ===")
print("ResNet-18 :", acc_res1,  f1_res1)
print("VGG-16    :", acc_vgg1,  f1_vgg1)
print("ViT-Base  :", acc_vit1,  f1_vit1)
print("DINO ViT  :", acc_dino1, f1_dino1)

print("\n=== PATHMNIST ENSEMBLES ===")
print("Majority Voting :", acc_maj1, f1_maj1)
print("Weighted Avg    :", acc_w1,  f1_w1)
print("Stacking        :", acc_s1,  f1_s1)

print("\n=== CHESTMNIST (ResNet-18 only) ===")
print("ResNet-18 :", acc_res2, f1_res2)

print("\n================ END METRICS ================\n")


# --- Confusion matrix plotting helper ---
def plot_confusion(cm, class_names, title):
    plt.figure(figsize=(4, 4))
    plt.imshow(cm, interpolation="nearest", cmap="Blues")
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(class_names))
    plt.xticks(tick_marks, class_names, rotation=45, ha="right")
    plt.yticks(tick_marks, class_names)
    plt.xlabel("Predicted")
    plt.ylabel("True")
    plt.tight_layout()
    plt.show()







=== PATHMNIST INDIVIDUAL MODELS ===
ResNet-18 : 0.5916434540389972 0.46720830040490086
VGG-16    : 0.8064066852367688 0.7468896132710968
ViT-Base  : 0.7987465181058496 0.7377623114521275
DINO ViT  : 0.834958217270195 0.7762318564602073

=== PATHMNIST ENSEMBLES ===
Majority Voting : 0.8445682451253482 0.7771834821561808
Weighted Avg    : 0.8682451253481894 0.8169716029008476
Stacking        : 0.700974930362117 0.5683378180649454

=== CHESTMNIST (ResNet-18 only) ===
ResNet-18 : 0.6395934560691838 0.05572753013007492




# Training and Evaluation Results**

## PathMNIST — Individual Models

| Model        | Accuracy | F1 Score |
|--------------|----------|----------|
| ResNet-18    | 0.592 | 0.467 |
| VGG-16       | 0.806 | 0.747 |
| ViT-Base     | 0.799 | 0.738 |
| DINO ViT     | 0.835 | 0.776 |

## PathMNIST — Ensemble Models

| Ensemble Method      | Accuracy | F1 Score |
|----------------------|----------|----------|
| Majority Voting      | 0.845 | 0.777 |
| Weighted Averaging   | **0.868** | **0.817** |
| Stacking             | 0.701 | 0.568 |

Weighted averaging performed the best overall.

---

## ChestMNIST — ResNet-18 Only

| Model     | Accuracy | F1 Score |
|-----------|----------|----------|
| ResNet-18 | 0.640 | 0.056 |

ChestMNIST is much harder due to being a multi-label dataset and heavily imbalanced.


## Conclusion:
# Comparison to MedMNIST Benchmarks

The MedMNIST website reports AUC and ACC values for many models.

## PathMNIST Benchmark Accuracy (ACC column)
Most high-performing models reach around 0.90–0.91*accuracy.

## My Best Model
- Weighted Ensemble: 0.868 accuracy

### Interpretation (simple terms)
My model performs slightly below the highest reported accuracy.
This is expected because:
- I trained only one epoch
- I used subset training instead of the full dataset
- Colab limits GPU time

Given these constraints, the performance is *ery close to benchmark-level models.

## ChestMNIST Benchmark Accuracy
Official accuracy is usually ~0.94.

## My Result
- 0.64 accuracy

This is lower because ChestMNIST is originally multi-label. Collapsing labels into a single class loses important information. More training epochs would also help improve performance.

# Individual models
 - ResNet-18 had the lowest accuracy - its a smaller network which may explain why it may not have determined as many useful features as the others.
 - VGG-16 and Vit-Base did well compared to ResNet-18 , and they are bigger models which naturally pick up more details from the images.
 - DINO ViT did the overall best here, which might be because it starts with pretrained features so even with a light fine tuning, it picks up the important details better.
 - Overall, the larger the models with stronger pretrained features did better here.

# Ensamble methods :
- Majority voting gave a small improvement compared to the single models.
- Weighted averaging gave the best results overall - since it made the better models count more in the final predictions
- combining the strengths of the models.
- Stacking did overrl worse than the other ensambles, which makes sense since stacking tends to work best when training a seperate model on data it hasnt seen before, which isnt our set-up for this.



#  Challenges and Future Improvements

## Challenges
- Google Colab timeouts limited the number of training epochs.
- Some models trained slowly, especially ViT and VGG.
- ChestMNIST is a multi-label dataset, which doesn’t fit well with single-label training.
- Reducing ChestMNIST to a single label hurts performance.

## Future Improvements
To match or beat the official leaderboard:

1. Train for*more epochs
2. Unfreeze more layers for deeper fine-tuning  
3. Use the full datasets, not subsets  
4. Treat ChestMNIST as a multi-label problem (sigmoid + BCE loss)  
5. Use more data augmentation  
6. Try stronger models (ResNet-50, EfficientNet, Swin Transformer)

Even with constraints, the ensemble method clearly improved accuracy and showed why combining models is effective.
