# Medical Image Analysis Project: Pneumonia Detection

## Overview
In this capstone project, you will apply the deep learning techniques learned in this course to a real-world medical imaging problem: **Pneumonia Detection from Chest X-Rays**.

Specifically, you will work with the **PneumoniaMNIST** dataset, a binary classification subset of the MedMNIST collection. Your goal is to build a robust classifier that distinguishes between 'Normal' and 'Pneumonia' cases.

Unlike previous tutorials where models were prescribed, **you interpret the data and choose the model architecture** best suited for the task. You will be evaluated not just on accuracy, but on your design choices, rigorous evaluation, and interpretability of results.

## 1. Setup and Data Loading
First, we install and load the necessary libraries. We rely on `medmnist` for data retrieval and `torch` for modeling.

In [None]:
# !pip install medmnist # Uncomment if running in Colab

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import medmnist
from medmnist import INFO, Evaluator
import matplotlib.pyplot as plt
import numpy as np

print(f"MedMNIST v{medmnist.__version__} @ {medmnist.__file__}")

### Data Understanding
We use **PneumoniaMNIST**, which contains cropped chest X-ray images. The task is binary classification: **0 (Normal)** vs **1 (Pneumonia)**.

In [None]:
data_flag = 'pneumoniamnist'
download = True

info = INFO[data_flag]
task = info['task']
n_channels = info['n_channels']
n_classes = len(info['label'])

DataClass = getattr(medmnist, info['python_class'])

# Basic transform for visualization/baseline
data_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[.5], std=[.5])
])

train_dataset = DataClass(split='train', transform=data_transform, download=download)
val_dataset = DataClass(split='val', transform=data_transform, download=download)
test_dataset = DataClass(split='test', transform=data_transform, download=download)

print(f"Train samples: {len(train_dataset)}")
print(f"Val samples: {len(val_dataset)}")
print(f"Test samples: {len(test_dataset)}")

In [None]:
# Visualization of samples
train_dataset.montage(length=5)

### Class Distribution Analysis
Before proceeding, it is crucial to understand if our classes are balanced. Imbalanced datasets can lead to biased models that predict the majority class. Check the counts below.

In [None]:
# Analyze Class Distribution
def plot_class_distribution(dataset, title='Class Distribution'):
    targets = [y for _, y in dataset]
    targets = np.array(targets).squeeze()
    unique, counts = np.unique(targets, return_counts=True)
    plt.bar(unique, counts)
    plt.xticks(unique, ['Normal (0)', 'Pneumonia (1)'])
    plt.title(title)
    plt.ylabel('Count')
    plt.show()
    print(f"Counts: {dict(zip(['Normal', 'Pneumonia'], counts))}")

plot_class_distribution(train_dataset, 'Train Set Distribution')


## 2. Project Requirements

You are required to complete the following tasks to structure your project. Please document your process clearly in markdown cells alongside your code.

### 1. Define the Problem Statement
- Clearly state the medical problem you are solving (Pneumonia Detection).
- Explain why this is important and what the clinical impact of an automated solution could be.
- Describe the dataset characteristics (size, class balance, image type).

### 2. Choose an Architecture / Approach
- Select a modeling approach. You may choose:
    - One of the architectures explored in `Building_Models.ipynb` (e.g., Simple CNN, Radiomics-based RF, Contrastive Learning).
    - An alternative architecture (e.g., ResNet, DenseNet, Vision Transformer) if you wish to explore further.
- **Justify your choice**: Why is this model suitable for this specific task and dataset?

### 3. Hyperparameter Tuning
- Experiment with key hyperparameters to optimize performance.
- Consider tuning: Learning rate, Batch size, Number of epochs, Optimizer type (Adam vs SGD), Dropout rate, etc.
- Document your tuning process and the final set of hyperparameters selected.

### 4. Training Analysis
- Implement a training loop that tracks performance on both Training and Validation sets.
- **Produce a Training Loss Curve**: Plot training and validation loss over epochs to diagnose overfitting or underfitting.

### 5. Evaluation and Metrics
- Evaluate your best model on the **Test Set**.
- **Produce an ROC Curve**: Plot the Receiver Operating Characteristic curve.
- Report key metrics:
    - **AUC (Area Under Curve)**
    - **F1-Score**
    - **Sensitivity (Recall)** and **Specificity**

### 6. Extensions, Limitations, and Implications
- **Extensions**: How could this model be improved further? (e.g., more data, ensemble methods, external validation).
- **Limitations**: What are the current weaknesses of your solution? (e.g., class imbalance handling, robustness to noise, generalization).
- **Implications**: Discuss the ethical and practical implications of deploying this AI model in a real clinical setting (e.g., bias, explainability, doctor-AI collaboration).

## 3. Student Workspace
Implement your solution below.

In [None]:
# Your code starts here
# Good luck!