# Claris: Histopathology Cancer Detector

This project was realized for Expo-Science hosted by Hydro-Quebec.

## Introduction

Many healthcare professionals are under great pressure with the current situation of Quebec's healthcare industry. A severe lack of personnel is a prominent issue, and the efficiency in cancer diagnosis process was worsened by this problem. It is not always easy to detect cancer in images. This project aims to develop a deep learning model that can detect histopathologic cancer in images, in order to ease the process of detection and pressure on the healthcare industry.

---

## Dataset

https://www.kaggle.com/competitions/histopathologic-cancer-detection/data 

This dataset of histopathologic scans of lymph node sections is a modified version of the PCAM (PatchCamelyon) dataset.

> 
    The PatchCamelyon benchmark (PCAM) consists of 327.680 color images (96 x 96px) extracted from histopathologic scans of lymph node sections. Each image is annoted with a binary label indicating presence of metastatic tissue.
    Fundamental machine learning advancements are predominantly evaluated on straight-forward natural-image classification datasets and medical imaging is becoming one of the major applications of ML and thus deserves a spot on the list of go-to ML datasets. Both to challenge future work, and to steer developments into directions that are beneficial for this domain.


## Model

The model used in this project is DenseNet-121, a type of Convolutional Neural Network (CNN) for precision purposes. The CNN is trained on the dataset using the categorical cross-entropy loss function. The model will be able to classify the images into one of the following categories: benign or malignant.

## Implementation

The implementation of the model is done with Jupyter Notebook using the Pytorch library. The model is trained on Kaggle duo-T4 GPU using the Adam optimizer. 
  
## Accuracy
The model achieved a peak accuracy of around 97.5% after 6 epochs.

![alt text](image.png)

## Sources
Thanks to many Kaggle competitions, datasets and notebooks for inspiration.

### 1) Imports and configuration

In [13]:
# Import useful libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
#import cv2

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split
from torchvision import transforms, models

from PIL import Image
from sklearn.model_selection import train_test_split


#### Setting up the environment
Using Kaggle's Nvidia GPU 

In [14]:
# Device -> Nvidia GPU
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # Use GPU cuda if available
print('Using device:', DEVICE)

# Config
IMG_SIZE = 96
BATCH_SIZE = 64
NUM_WORKERS = 2


Using device: cuda


### 2) Data transformations
Diversify the training data by augmenting the images with random rotations, flips, and translations.

In [15]:
# Train data transformations
#  - Augmented to increase diversity of the dataset
train_tfms = transforms.Compose([
    transforms.Resize((IMG_SIZE, IMG_SIZE)),            # Ensure consistent size
    transforms.RandomHorizontalFlip(),                  # Random flips (augmentation)
    transforms.RandomVerticalFlip(),                    # Random flips  (augmentation)
    transforms.RandomRotation(20),                      # Random rotations (augmentation)
    transforms.ToTensor(),                              # Convert to tensor
    transforms.Normalize(                               # Standardize pixel values
        mean=[0.485, 0.456, 0.406],  # ImageNet mean
        std=[0.229, 0.224, 0.225]    # ImageNet std
    )
])

# Test data transformations (no augmentation or randomness)
val_tfms = transforms.Compose([
    transforms.Resize((IMG_SIZE, IMG_SIZE)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

### 3) Data loading helper class


In [16]:
class Dataset(Dataset):
    """
    Dataset for Kaggle Histopathologic Cancer Detection (based on PCAM)
    Uses .tif images + train_labels.csv
    """

    def __init__(self, img_dir, csv_path, transform=None):
        self.img_dir = img_dir
        self.transform = transform
    
        self.df = pd.read_csv(csv_path) # train_labels.csv

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        img_id = self.df.iloc[idx, 0]          # image id
        label  = int(self.df.iloc[idx, 1])     # 0 or 1 (cancer or not)

        img_path = os.path.join(self.img_dir, img_id + ".tif")
        image = Image.open(img_path).convert("RGB")

        if self.transform:
            image = self.transform(image)

        return image, label


### 4) Load PCam dataset


In [17]:
# Kaggle directory (PCam)
BASE_DIR = "/kaggle/input/competitions/histopathologic-cancer-detection/"

TRAIN_DIR = BASE_DIR + "train"
CSV_PATH = BASE_DIR + "train_labels.csv"

# Train / validation test split
df = pd.read_csv(CSV_PATH)

# Split into train and validation datasets
train_df, val_df = train_test_split(
    df,
    test_size=0.2,
    stratify=df["label"],
    random_state=42
)

train_df.to_csv("train_split.csv", index=False)
val_df.to_csv("val_split.csv", index=False)


#### Data loaders


In [18]:

BATCH_SIZE = 128

train_dataset = Dataset(
    TRAIN_DIR, "train_split.csv", train_tfms
)

val_dataset = Dataset(
    TRAIN_DIR, "val_split.csv", val_tfms
)

train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    pin_memory=True, # faster GPU data transfer
    persistent_workers=True, # keep warm state for >1 epoch
    num_workers=4
)

val_loader = DataLoader(
    val_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    pin_memory=True,
    persistent_workers=True,
    num_workers=4
)

### 5) Loading DenseNet-121 model
The model has been pre-trained on ImageNet. <br>
Hence, the model already knows how to detect edges, textures, patterns, colors, etc.


In [19]:
# Load pretrained DenseNet121 from torchvision
model = models.densenet121(pretrained=True)

# Replace the classifier layer
# DenseNet outputs 1024 features â†’ we need 2 classes
model.classifier = nn.Linear(1024, 2)

# Wrap for multi-GPU
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs!")
    model = torch.nn.DataParallel(model) # parallel computing
    
model = model.to(DEVICE)


Using 2 GPUs!




#### Loss function and optimizer

In [20]:
# Class imbalance handling
class_weights = torch.tensor([1.0, 1.5]).to(DEVICE)

# Binary classification
criterion = nn.CrossEntropyLoss(weight=class_weights)

# Adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

# Prevent underflow (~0 gradient rounded to 0)
scaler = torch.amp.GradScaler() 

### 6) Training the model!

In [21]:
def train_one_epoch(model, loader):
    model.train()  # training mode
    running_loss = 0.0
    
    loop = tqdm(train_loader, total=len(train_loader))

    for images, labels in loader:
        # Move data to device (GPU)
        images = images.to(DEVICE, non_blocking=True)
        labels = labels.to(DEVICE, non_blocking=True)

        # Reset gradients
        optimizer.zero_grad()
        
        # Mixed precision training for faster training
        # COMPARE RESULTS AND PENALIZE LOSS
        with torch.amp.autocast("cuda"):
            # Forward pass (prediction results)
            outputs = model(images)
            # Compute loss
            loss = criterion(outputs, labels)

        # Backpropagation (how should each weight change to make the loss smaller?)
        scaler.scale(loss).backward() # loss.backward()

        # Update weights for real
        scaler.step(optimizer)
        scaler.update()
        
        running_loss += loss.item()
        
        loop.set_postfix(loss=running_loss / (loop.n+1))

    return running_loss / len(loader)


#### Validation

In [22]:
def validate(model, loader):
    model.eval()  # evaluation mode
    correct = 0
    total = 0

    with torch.no_grad():
        for images, labels in tqdm(loader, desc="Validating", leave=False): # Progress bar
            images = images.to(DEVICE)
            labels = labels.to(DEVICE)

            outputs = model(images)
            preds = torch.argmax(outputs, dim=1)

            correct += (preds == labels).sum().item()
            total += labels.size(0)

    return correct / total


### 7) Visualizing prediction results


In [23]:
EPOCHS = 10  # increase later

for epoch in range(EPOCHS):
    train_loss = train_one_epoch(model, train_loader)
    val_acc = validate(model, val_loader)

    print(f"Epoch [{epoch+1}/{EPOCHS}] "
          f"Train Loss: {train_loss:.4f} | "
          f"Val Acc: {val_acc*100:.2f}%")
    
    # Save model
    torch.save(model.state_dict(), "pcam_model.pth")


  0%|          | 0/1376 [09:28<?, ?it/s, loss=289] 
                                                             

Epoch [1/10] Train Loss: 0.2099 | Val Acc: 95.17%


  0%|          | 0/1376 [09:23<?, ?it/s, loss=195]  
                                                             

Epoch [2/10] Train Loss: 0.1420 | Val Acc: 96.01%


  0%|          | 0/1376 [09:25<?, ?it/s, loss=169]  
                                                             

Epoch [3/10] Train Loss: 0.1227 | Val Acc: 96.84%


  0%|          | 0/1376 [09:25<?, ?it/s, loss=145]  
                                                             

Epoch [4/10] Train Loss: 0.1056 | Val Acc: 97.03%


  0%|          | 0/1376 [09:23<?, ?it/s, loss=132]  
                                                             

Epoch [5/10] Train Loss: 0.0958 | Val Acc: 96.91%


  0%|          | 0/1376 [09:30<?, ?it/s, loss=120]   
                                                             

Epoch [6/10] Train Loss: 0.0872 | Val Acc: 97.33%


  0%|          | 0/1376 [09:25<?, ?it/s, loss=109]  
                                                             

Epoch [7/10] Train Loss: 0.0795 | Val Acc: 97.05%


  0%|          | 0/1376 [09:23<?, ?it/s, loss=101]   
                                                             

Epoch [8/10] Train Loss: 0.0734 | Val Acc: 97.61%


  0%|          | 0/1376 [09:30<?, ?it/s, loss=93.9] 
                                                             

Epoch [9/10] Train Loss: 0.0682 | Val Acc: 97.53%


  0%|          | 0/1376 [09:29<?, ?it/s, loss=87.7]  
                                                             

Epoch [10/10] Train Loss: 0.0637 | Val Acc: 97.42%
