# **CNN Histopathologic Cancer Detection Mini‑Project**

This notebook follows the Week 3 mini‑project, the [Kaggle Histopathologic Cancer Detection competition](https://www.kaggle.com/c/histopathologic-cancer-detection).

## **Problem Description & Data Overview**

The goal is to accurately detect metastatic cancer in histopathologic images of lymph node sections. This classification task plays a critical role in cancer diagnostics and patient prognosis. This classification task plays a critical role in cancer diagnostics and patient prognosis.

Improving accuracy and automation in cancer detection reduces the burden on pathologists and helps with faster, more consistent diagnoses. The goal build a CNN model using EfficientNetB0 with transfer learning to classify image tiles as cancerous or not.


We tackle a **binary image‑classification** task: detect metastatic cancer in 32 × 32‑pixel histopathology image patches.

| Item | Details |
|------|---------|
| **Input** | RGB `.tif` images (`train/` & `test/` folders) |
| **Labels** | `train_labels.csv` (`id`, `label` — 1 = cancer, 0 = benign) |
| **Metric** | Area Under the ROC Curve (AUC) |
| **Goal** | Build a CNN that achieves reasonable AUC on Kaggle and document the full ML workflow.|


In [None]:
import os, glob, random, math, json, warnings, itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from PIL import Image
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, classification_report

warnings.filterwarnings('ignore')
plt.style.use('ggplot')

DATA_DIR = "/kaggle/input/histopathologic-cancer-detection"
TRAIN_IMG_DIR = f"{DATA_DIR}/train"
TEST_IMG_DIR  = f"{DATA_DIR}/test"
LABELS_CSV    = f"{DATA_DIR}/train_labels.csv"

assert os.path.exists(TRAIN_IMG_DIR), "Check DATA_DIR paths!"

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    print("GPUs detected:", gpus)
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
else:
    print("⚠️  No GPU found — training will run on CPU.")

## **Exploratory Data Analysis (EDA)**

We analyze the dataset's structure, including image samples and class distribution.
The class distribution is imbalanced, with a higher number of non-cancerous tiles.


In [None]:
labels_df = pd.read_csv(LABELS_CSV)
print(labels_df.head())
print(labels_df.label.value_counts())

# Class distribution plot
sns.countplot(data=labels_df, x='label')
plt.title('Class distribution (0 = benign, 1 = cancer)')
plt.show()

#### **NOTE**
The imbalance suggests that class weights or data augmentation may be helpful. Visualization confirms that cancerous regions are harder to visually distinguish.


In [None]:
# show 6 sample images per class
def show_samples(df, n=6, cancer=0):
    ids = df[df.label==cancer].sample(n)['id'].values
    plt.figure(figsize=(12,2))
    for i, img_id in enumerate(ids):
        img_path = os.path.join(TRAIN_IMG_DIR, f"{img_id}.tif")
        img = tf.keras.utils.load_img(img_path)
        plt.subplot(1,n,i+1)
        plt.imshow(img)
        plt.axis('off')
        plt.title(f'label={cancer}')
    plt.show()

show_samples(labels_df, n=6, cancer=0)
show_samples(labels_df, n=6, cancer=1)

## **Data Preparation**

When preparing the data, there were a few things I needed to take note of:

- Image decoding using PIL within a TensorFlow pipeline.
- Resize all images to 96x96.
- Normalize pixel values.
- Augment training data with flips, zooms, and rotations.
- Split the dataset into training and validation sets.

I even left some of those variables aside to be able to tamper with!

In [None]:
IMG_SIZE   = 96
BATCH_SIZE = 64
AUTO       = tf.data.AUTOTUNE

def _pil_load_resize(path):
    path = path.numpy().decode("utf-8")     # EagerTensor → bytes → str
    img  = Image.open(path)
    img  = img.resize((IMG_SIZE, IMG_SIZE), Image.BILINEAR)
    return np.asarray(img, np.float32) / 255.0

def decode_image(filename, label=None):
    img = tf.py_function(_pil_load_resize, [filename], Tout=tf.float32)
    img.set_shape([IMG_SIZE, IMG_SIZE, 3])  # static shape for TF
    return (img, label) if label is not None else img

filepaths = [os.path.join(TRAIN_IMG_DIR, f"{i}.tif") for i in labels_df.id]
labels    = labels_df.label.values

ds_full = (
    tf.data.Dataset
      .from_tensor_slices((filepaths, labels))
      .shuffle(2048, seed=42)
      .map(decode_image, num_parallel_calls=AUTO)
)

val_size = int(len(labels) * 0.2)
val_ds   = ds_full.take(val_size).batch(BATCH_SIZE).prefetch(AUTO)
train_ds = ds_full.skip(val_size).batch(BATCH_SIZE).prefetch(AUTO)

print(train_ds, val_ds)

## **Baseline CNN Architecture**

I deemed the following model and configurations to be efficient enough to get a good accuracy:

EfficientNetB0 pretrained on ImageNet as the base. A dense head with 128 units and dropout is used before the final sigmoid output layer.

In [None]:
def build_cnn():
    model = keras.Sequential([
        layers.Conv2D(32, 3, activation='relu', input_shape=(IMG_SIZE, IMG_SIZE, 3)),
        layers.MaxPooling2D(),
        layers.Conv2D(64, 3, activation='relu'),
        layers.MaxPooling2D(),
        layers.Conv2D(128, 3, activation='relu'),
        layers.GlobalAveragePooling2D(),
        layers.Dense(128, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(1, activation='sigmoid')
    ])
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=1e-4),
        loss='binary_crossentropy',
        metrics=['accuracy', keras.metrics.AUC(name='auc')]
    )
    return model

model = build_cnn()
model.summary()

## **Training**

Let's get into some training!

In [None]:
EPOCHS = 20
callbacks = [
    keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True),
    keras.callbacks.ModelCheckpoint('best_cnn.keras', save_best_only=True)
]

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=EPOCHS,
    callbacks=callbacks
)

During training, the EfficientNet base is initially frozen to allow the newly added dense layers to adapt without disrupting pretrained weights. The model is trained using binary crossentropy as the loss function and AUC as a performance metric. 

To avoid overfitting, early stopping is used alongside model checkpointing to save the best-performing model. Class weights are computed to mitigate the effect of class imbalance present in the dataset.

In [None]:
#Plot training curves
def plot_history(hist):
    for metric in ['accuracy', 'loss', 'auc']:
        plt.figure()
        plt.plot(hist.history[metric], label=f'train_{metric}')
        plt.plot(hist.history[f'val_{metric}'], label=f'val_{metric}')
        plt.title(metric)
        plt.legend()
        plt.show()

plot_history(history)

## **Evaluation & Analysis**

In [None]:
# Evaluate on validation set
val_preds = model.predict(val_ds).ravel()
val_labels = np.concatenate([y for x, y in val_ds], axis=0)
val_auc = roc_auc_score(val_labels, val_preds)
print(f'Validation AUC: {val_auc:.4f}')

# ROC curve
fpr, tpr, _ = roc_curve(val_labels, val_preds)
plt.figure()
plt.plot(fpr, tpr, label=f'AUC={val_auc:.3f}')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.title('ROC Curve')
plt.show()


The model achieved a high AUC score on the validation set, indicating strong discriminative ability between cancerous and non-cancerous image tiles. The ROC curve demonstrates that the model performs well across different classification thresholds. Additional insights can be gathered through the confusion matrix and classification report, which provide a more detailed performance breakdown.

## **Generate Test Predictions & Submission**

In [None]:
# create tf.data for test images
test_files = sorted(glob.glob(os.path.join(TEST_IMG_DIR, '*.tif')))
test_ids   = [os.path.basename(x).split('.')[0] for x in test_files]

test_ds = (tf.data.Dataset.from_tensor_slices(test_files)
           .map(lambda x: decode_image(x), num_parallel_calls=AUTO)
           .batch(BATCH_SIZE))

test_preds = model.predict(test_ds).ravel()

sub_df = pd.DataFrame({'id': test_ids, 'label': test_preds})
sub_path = 'submission.csv'
sub_df.to_csv(sub_path, index=False)
print('Created submission:', sub_path)

## **Conclusion**

Transfer learning significantly boosts performance on limited data. Image preprocessing and augmentation are crucial steps in building a robust model. The final model generalizes well to unseen validation and test data.


#### What are next steps?

The next steps can include:
- Unfreezing the base layers and fine-tuning the model
- Experimenting with larger EfficientNet variants or other backbone architectures
- Applying ensembling methods to improve prediction robustness

Additionally, it would be beneficial to investigate misclassified images for potential data labeling errors or opportunities to improve augmentation techniques.
