# 🧬 LC25000 Cancer Classification - Full ML Pipeline

This notebook runs the **entire machine learning pipeline** for the LC25000 cancer image classification project, using the modular scripts in your repository.

**Steps:**

0. Project Directory Setup
1. Environment Setup
2. Data Download & Extraction
3. Dataset Splitting
4. Model Summary
5. Training
6. Plotting Training Metrics
7. Animated Training Curves
8. Evaluation on Test Set
9. Grad-CAM Visualization
10. Visualize Predictions as an Image Grid
11. Visualize Misclassifications
12. Grad-CAM Grid (side-by-side)
> **Tip:** Run each cell one by one. Each step is independent and will show its output below the cell.


## 0. Project Directory Setup

This cell creates all the necessary folders for the project (`data`, `outputs`, `results`, etc.) so that all scripts and outputs will work without errors.

**Why?**
- Ensures all scripts can save and load files without directory errors.
- Makes it easy for new users to get started.

**Folders created:**
- `data/`: For raw and processed datasets
- `notebooks/`: For Jupyter notebooks
- `outputs/`: For generated plots and Grad-CAM images
- `results/`: For metrics and result files
- `sample_images/`: (Optional) For example images
- `saved_models/`: For trained model files
- `scripts/`: For all Python scripts

**Run this cell before any other steps.**
(Remove """)

In [None]:
"""
import os

folders = [
    "data",
    "notebooks",
    "outputs",
    "results",
    "sample_images",
    "saved_models",
    "scripts"
]

# Create each folder if it doesn't exist
for folder in folders:
    os.makedirs(folder, exist_ok=True)

print("✅ Project directory structure created.")
"""

## 1. Environment Setup

Install all required dependencies for the project.

- If using a CPU, the default requirements are sufficient.
- For M1/M2 Mac GPU support, see the commented cell below.

**Why?**
- Ensures all Python packages are available for scripts and notebooks.
- Avoids import errors and missing package issues.

**Run this cell only once per environment.**

**Troubleshooting:**
- If you see errors about missing packages, re-run this cell.
- If you use a new environment, run this cell again.


In [6]:
!pip install -r requirements.txt



In [35]:
# For M1/M2 Mac GPU support (uncomment if needed)
# !pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu

## 2. Data Download & Extraction

Download and extract the LC25000 dataset.

- The dataset is large (~1.8GB).
- **Skip this step if you already have the data in `data/Lung_and_Colon_Cancer/`.**

**Why?**
- Provides the raw images needed for training and evaluation.

**Tip:** If you are running on Colab or a cloud environment, you may need to re-download each session.

**Files affected:**
- Downloads `LC25000.zip` and extracts to `data/Lung_and_Colon_Cancer/`

### 2.1 How to Download Datasets from Kaggle Using the Kaggle API

To download datasets from Kaggle programmatically, follow these steps:

🔑 Obtain Your Kaggle API Token
	1.	Sign in to your Kaggle account at https://www.kaggle.com.
	2.	Click on your profile picture in the top-right corner and select Account.
	3.	Scroll down to the API section and click on Create New API Token.
	4.	A file named kaggle.json will be downloaded to your computer. This file contains your Kaggle username and API key.()

⚠️ Keep your kaggle.json file secure. Do not share it or upload it to public repositories.

### 2.2 📁 Upload kaggle.json to Your Notebook Environment

In your notebook environment (e.g., Google Colab):
	1.	Use the file upload feature to upload the kaggle.json file to your current working directory.
	2.	Run the following code to set up the Kaggle API credentials:


In [12]:
#set your working directory
import os
os.chdir('/.../.../cancer_clasification_lc25000')
print("Current working directory:", os.getcwd())

Current working directory: /Users/amiraynede/cancer_clasification_lc25000


In [13]:
import os
import shutil
    
# Install the Kaggle API client
!pip install kaggle --quiet

# Create the .kaggle directory
os.makedirs(os.path.expanduser("~/.kaggle"), exist_ok=True)

# Move the kaggle.json file to the .kaggle directory
shutil.move("kaggle.json", os.path.expanduser("~/.kaggle/kaggle.json"))

# Set the permissions of the kaggle.json file
os.chmod(os.path.expanduser("~/.kaggle/kaggle.json"), 0o600)

### 2.3 📦 Download and Unzip the Dataset

Replace dataset-owner/dataset-name with the identifier of the dataset you wish to download. For example, to download the “Lung and Colon Cancer Histopathological Images” dataset:

In [14]:
# Download the dataset
!kaggle datasets download -d andrewmvd/lung-and-colon-cancer-histopathological-images

# Unzip the downloaded dataset
!unzip -q lung-and-colon-cancer-histopathological-images.zip -d data/

Dataset URL: https://www.kaggle.com/datasets/andrewmvd/lung-and-colon-cancer-histopathological-images
License(s): CC-BY-SA-4.0
^C
unzip:  cannot find or open lung-and-colon-cancer-histopathological-images.zip, lung-and-colon-cancer-histopathological-images.zip.zip or lung-and-colon-cancer-histopathological-images.zip.ZIP.


## 3. Dataset Splitting

Split the raw dataset into train/val/test folders using the provided script.

- **Input:** `data/Lung_and_Colon_Cancer/`
- **Output:** `data/lc25000_split/` (with `train/`, `val/`, `test/` subfolders)

**Why?**
- Ensures reproducible splits for training, validation, and testing.
- Makes it easy to compare results across runs.

**Files affected:**
- Reads from `data/Lung_and_Colon_Cancer/`
- Creates `data/lc25000_split/train/`, `val/`, `test/`

In [None]:
!python scripts/split_dataset.py

✅ Split class 'lung_aca': 5000 images
✅ Split class 'colon_n': 5000 images
✅ Split class 'colon_aca': 5000 images
✅ Split class 'lung_n': 5000 images
✅ Split class 'lung_scc': 5000 images
✅ Dataset split completed!


## 4. Model Summary

This step prints a summary of your model architecture, including each layer, output shape, and parameter count.

**Why?**
- Useful for verifying your model structure before training.
- Helps you understand the complexity and size of your model.

**How?**
- Uses `torchinfo.summary` to display a Keras-like summary table.

In [None]:
!python -m scripts.model_summary --num_classes 5 --input_size 1 3 224 224

Layer (type:depth-idx)                   Input Shape               Output Shape              Param #                   Kernel Shape              Mult-Adds
ResNet                                   [1, 3, 224, 224]          [1, 5]                    --                        --                        --
├─Conv2d: 1-1                            [1, 3, 224, 224]          [1, 64, 112, 112]         (9,408)                   [7, 7]                    118,013,952
├─BatchNorm2d: 1-2                       [1, 64, 112, 112]         [1, 64, 112, 112]         (128)                     --                        128
├─ReLU: 1-3                              [1, 64, 112, 112]         [1, 64, 112, 112]         --                        --                        --
├─MaxPool2d: 1-4                         [1, 64, 112, 112]         [1, 64, 56, 56]           --                        3                         --
├─Sequential: 1-5                        [1, 64, 56, 56]           [1, 64, 56, 56]           --

## 5. Training

Train the ResNet18 model on the LC25000 dataset.

- This may take a while depending on your hardware.
- The best model will be saved to `saved_models/` with a timestamp.
- Training metrics and plots will be saved to `results/`.

**Why?**
- This is where your model learns to classify cancer images.

**Tips:**
- You can adjust the number of epochs in `scripts/train.py`.
- Monitor your system resources if running locally.

In [None]:
!python -m scripts.train

Epoch 1/10
100%|█████████████████████████████████████████| 625/625 [03:54<00:00,  2.66it/s]
Train Loss: 0.2394, Train Acc: 0.9285
Val Loss:   0.0858, Val Acc:   0.9744
✅ Saved best model to saved_models/resnet18_lc25000_20250710_022956.pth
Epoch 2/10
100%|█████████████████████████████████████████| 625/625 [03:52<00:00,  2.69it/s]
Train Loss: 0.1147, Train Acc: 0.9610
Val Loss:   0.0582, Val Acc:   0.9836
✅ Saved best model to saved_models/resnet18_lc25000_20250710_022956.pth
Epoch 3/10
100%|█████████████████████████████████████████| 625/625 [03:53<00:00,  2.67it/s]
Train Loss: 0.0938, Train Acc: 0.9670
Val Loss:   0.0498, Val Acc:   0.9868
✅ Saved best model to saved_models/resnet18_lc25000_20250710_022956.pth
Epoch 4/10
100%|█████████████████████████████████████████| 625/625 [03:52<00:00,  2.68it/s]
Train Loss: 0.0893, Train Acc: 0.9678
Val Loss:   0.0462, Val Acc:   0.9848
Epoch 5/10
100%|█████████████████████████████████████████| 625/625 [03:52<00:00,  2.69it/s]
Train Loss: 0.0782, 

## 6. Plotting Training Metrics

Generate loss and accuracy plots from the latest training metrics JSON.

- Input: `results/training_metrics_<timestamp>.json`
- Output: `outputs/` (plots)

**Why?**
- Visualizes how your model is learning over time.
- Helps you spot overfitting or underfitting.

**Tip:** Update the path below to match your latest metrics file if needed.

In [None]:
# Replace with your latest metrics file if needed
!python -m scripts.plot results/training_metrics_20250707_152206.json

Figure(1400x600)
✅ Saved combined loss/accuracy plot to outputs/training_metrics_20250707_152206_combined.png


## 7. Animated Training Curves

This cell animates the training and validation loss and accuracy curves over epochs, using the latest training metrics file in the `results/` folder.

**Why?**
- Visually shows how your model improves during training, epoch by epoch.
- Makes it easy to spot sudden changes or plateaus in learning.

This step animates the training and validation loss and accuracy curves over epochs, so you can visually see how your model improves during training.
How to use:
You can either:
Run the Python script directly (in a terminal):
python -m scripts.animate_training_curves
This will display the animation in a separate window (best for local use).

OR

Copy and run the provided code cell in your notebook to see the animation inline in the notebook output (recommended for Jupyter/Colab).

Tip:

The notebook cell version is best for interactive exploration.
The script version is useful for automated runs or when working outside a notebook.

In [9]:
#!python -m scripts.animate_training_curves

In [8]:
import glob
import os
import json
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

# Find the latest training_metrics_*.json file
metrics_files = glob.glob("results/training_metrics_*.json")
if not metrics_files:
    raise FileNotFoundError("No training_metrics_*.json files found in results/")
metrics_path = max(metrics_files, key=os.path.getctime)
print(f"Using metrics file: {metrics_path}")

with open(metrics_path, "r") as f:
    metrics = json.load(f)

train_losses = metrics["train_losses"]
val_losses = metrics["val_losses"]
train_accuracies = metrics["train_accuracies"]
val_accuracies = metrics["val_accuracies"]
epochs = range(1, len(train_losses) + 1)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

def animate(i):
    ax1.clear()
    ax2.clear()
    ax1.plot(epochs[:i+1], train_losses[:i+1], 'b-', label='Train Loss')
    ax1.plot(epochs[:i+1], val_losses[:i+1], 'r-', label='Val Loss')
    ax1.set_title("Loss Over Epochs")
    ax1.set_xlabel("Epoch")
    ax1.set_ylabel("Loss")
    ax1.legend()
    ax1.set_xlim(1, len(epochs))
    ax1.set_ylim(0, max(max(train_losses), max(val_losses)) * 1.1)

    ax2.plot(epochs[:i+1], train_accuracies[:i+1], 'b-', label='Train Acc')
    ax2.plot(epochs[:i+1], val_accuracies[:i+1], 'r-', label='Val Acc')
    ax2.set_title("Accuracy Over Epochs")
    ax2.set_xlabel("Epoch")
    ax2.set_ylabel("Accuracy")
    ax2.legend()
    ax2.set_xlim(1, len(epochs))
    ax2.set_ylim(0, 1.05)

ani = FuncAnimation(fig, animate, frames=len(epochs), interval=200, repeat=False)
plt.close(fig)
HTML(ani.to_jshtml())

Using metrics file: results/training_metrics_20250709_182126.json


## 8. Evaluation on Test Set

In this step, the model predicts labels for all test images, calculates metrics, and saves results for further analysis.

- **Classification report**: Printed and saved to `outputs/classification_report.txt`
- **Confusion matrix**: Saved as `outputs/confusion_matrix.png`
- **Raw predictions**: Saved as `outputs/test_predictions.csv`

**The CSV file contains, for each test image:**
- `filename`: path to the image
- `true_label`: actual class
- `predicted_label`: model's prediction

**Example:**

filename,true_label,predicted_label

.../lungaca1356.jpeg,lung_aca,lung_aca

.../lungaca2966.jpeg,lung_aca,lung_scc


You can use this file to find misclassified images, analyze errors, or do further analysis in pandas/Excel.

In [None]:
!python -m scripts.evaluate_on_test

📂 Using latest model: saved_models/resnet18_lc25000_20250709_182126.pth

📊 Classification Report:

              precision    recall  f1-score   support

   colon_aca       1.00      1.00      1.00        73
     colon_n       1.00      1.00      1.00        85
    lung_aca       1.00      0.98      0.99        63
      lung_n       1.00      1.00      1.00        76
    lung_scc       0.99      1.00      0.99        78

    accuracy                           1.00       375
   macro avg       1.00      1.00      1.00       375
weighted avg       1.00      1.00      1.00       375

✅ Saved classification report to outputs/classification_report.txt
✅ Saved raw predictions to outputs/test_predictions.csv
Figure(800x600)


## 9. Grad-CAM Visualization

Visualize model attention using Grad-CAM on a sample test image.

- Update the image path below to any image from your test set.
- Uses the latest model by default.
- Output is saved to `outputs/` as a heatmap overlay.

**Why?**
- Helps you interpret what parts of the image the model is focusing on for its prediction.


In [25]:
# Example image from test set
!python -m scripts.gradcam --image_path data/lc25000_split/test/lung_aca/lungaca1356.jpeg

📂 Using latest model: saved_models/resnet18_lc25000_20250709_182126.pth
✅ Forward hook triggered
output.requires_grad: False
🎯 Target class index: 2
📉 Loss shape: torch.Size([1])
✅ Backward hook triggered
Heatmap min: 0.21601054072380066, max: 0.9999918341636658
✅ Saved Grad-CAM heatmap to outputs/lungaca1356_gradcam.jpg
✅ Saved Grad-CAM heatmap to outputs/lungaca1356_gradcam.jpg


## 10. Visualize Predictions as an Image Grid

This step displays a random sample of test images in a grid, showing their true and predicted labels.

- **Green title:** The model's prediction matches the true label (correct).
- **Red title:** The model's prediction does not match the true label (misclassification).

**Why?**
- Provides a quick, visual overview of how well your model is performing on individual images.
- Helps you spot patterns in correct and incorrect predictions.

**How to use:**
- You can adjust the number of images and columns using the `--n_images` and `--cols` arguments in the code cell.
- The images are sampled randomly from the test set predictions.

**Tip:** Use this visualization to get a sense of which classes or image types are easiest or hardest for your model.

In [None]:
!python -m scripts.visualize_predictions --csv_path outputs/test_predictions.csv --n_images 9 --cols 3 --output_path outputs/prediction_grid.png

✅ Saved grid plot to outputs/prediction_grid.png
Figure(1200x1200)


## 11. Visualize Misclassifications

This cell displays a grid of misclassified test images, with true and predicted labels in red. The grid is also saved as `outputs/misclassified_grid.png`.

**Why?**
- Focuses your attention on the images where the model made mistakes.
- Useful for error analysis: you can look for patterns in the misclassifications (e.g., certain classes being confused, poor image quality, etc.).
- Helps guide future improvements to your model or data.

**How to use:**
- You can adjust the number of images and columns using the `--n_images` and `--cols` arguments in the code cell.
- Only images where the prediction does not match the true label are shown.

**Tip:** After running this cell, review the misclassified images to see if there are common features or issues that could be addressed in future model iterations.

In [29]:
!python -m scripts.visualize_misclassifications --csv_path outputs/test_predictions.csv --n_images 9 --cols 3 --output_path outputs/misclassified_grid.png

✅ Saved misclassification grid to outputs/misclassified_grid.png
Figure(1200x400)


## 12. Visualize Grad-CAM Heatmaps in a Grid

This cell displays a grid of test images (optionally only misclassified ones) with their Grad-CAM heatmaps side by side.  
The grid is also saved as `outputs/gradcam_grid.png`.

**Why?**
- Lets you compare the original image and the model's attention side by side.
- Useful for qualitative analysis and presentations.

In [30]:
!python -m scripts.visualize_gradcam_grid --csv_path outputs/test_predictions.csv --model_path saved_models/resnet18_lc25000_20250707_152206.pth --n_images 4 --cols 2 --only_misclassified --output_path outputs/gradcam_grid.png

✅ Forward hook triggered
output.requires_grad: False
🎯 Target class index: 4
📉 Loss shape: torch.Size([1])
✅ Backward hook triggered
Heatmap min: 0.26249441504478455, max: 0.9999836683273315
✅ Saved Grad-CAM heatmap to outputs/lungaca4634_gradcam.jpg
✅ Saved Grad-CAM heatmap to outputs/lungaca4634_gradcam.jpg
✅ Saved Grad-CAM grid to outputs/gradcam_grid.png
Figure(1600x400)
