
# Capstone 3 – Thoracic X-Ray Project  
## Step 5: Modeling with DenseNet121 (ImageNet Preprocessing)

This notebook implements **Step 5: Modeling & Evaluation** for the thoracic X-ray project using a **DenseNet121** backbone and transfer learning.

It assumes you have already run the **pre-processing notebook** and produced:

- `preprocessed/train_metadata.csv`
- `preprocessed/val_metadata.csv`
- `preprocessed/test_metadata.csv`

Each of these files must contain:

- One column per disease label (multi-label one-hot targets)
- An `image_path` column pointing to the local image file for each X-ray

In this version, pixel scaling is aligned with the official **DenseNet121 ImageNet preprocessing** via `preprocess_input`.


## 1. Imports

In [None]:

import pandas as pd
import numpy as np
from pathlib import Path

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import DenseNet121
from tensorflow.keras.applications.densenet import preprocess_input
from tensorflow.keras import layers, models
import tensorflow as tf

from sklearn.metrics import roc_auc_score

print("TensorFlow version:", tf.__version__)


## 2. Configuration and Paths

In [None]:

BASE_DIR = Path(r"C:\Springboard\Data Science at Scale\23.5 Capstone 3 - Project Proposals\Thoracic")

PREPROCESSED_DIR = BASE_DIR / "preprocessed"

TRAIN_META_PATH = PREPROCESSED_DIR / "train_metadata.csv"
VAL_META_PATH   = PREPROCESSED_DIR / "val_metadata.csv"
TEST_META_PATH  = PREPROCESSED_DIR / "test_metadata.csv"

print("Train metadata path:", TRAIN_META_PATH, "| Exists?", TRAIN_META_PATH.exists())
print("Val   metadata path:", VAL_META_PATH,   "| Exists?", VAL_META_PATH.exists())
print("Test  metadata path:", TEST_META_PATH,  "| Exists?", TEST_META_PATH.exists())


## 3. Load Preprocessed Metadata

In [None]:

train_df = pd.read_csv(TRAIN_META_PATH)
val_df   = pd.read_csv(VAL_META_PATH)
test_df  = pd.read_csv(TEST_META_PATH)

print("Train shape:", train_df.shape)
print("Val shape:  ", val_df.shape)
print("Test shape: ", test_df.shape)

train_df.head()


## 4. Identify Label Columns and Image Paths

In [None]:

DISEASES = [
    "Atelectasis", "Cardiomegaly", "Effusion", "Infiltration",
    "Mass", "Nodule", "Pneumonia", "Pneumothorax",
    "Consolidation", "Edema", "Emphysema", "Fibrosis",
    "Pleural_Thickening", "Hernia", "No Finding"
]

label_cols = [d for d in DISEASES if d in train_df.columns]

print("Label columns used for training (multi-label targets):")
print(label_cols)
print("Number of labels:", len(label_cols))

IMAGE_COL = "image_path"
if IMAGE_COL not in train_df.columns:
    raise KeyError(f"Expected '{IMAGE_COL}' column in train_df for image paths.")

train_df[[IMAGE_COL] + label_cols].head()


## 5. Image Data Generators (DenseNet121 Preprocessing)

In [None]:

IMG_HEIGHT = 224
IMG_WIDTH  = 224
BATCH_SIZE = 32

train_datagen = ImageDataGenerator(
    preprocessing_function=preprocess_input,
    rotation_range=5,
    width_shift_range=0.05,
    height_shift_range=0.05,
    zoom_range=0.05,
    horizontal_flip=True
)

val_test_datagen = ImageDataGenerator(
    preprocessing_function=preprocess_input
)

train_gen = train_datagen.flow_from_dataframe(
    dataframe=train_df,
    x_col=IMAGE_COL,
    y_col=label_cols,
    target_size=(IMG_HEIGHT, IMG_WIDTH),
    color_mode="rgb",
    class_mode="raw",
    batch_size=BATCH_SIZE,
    shuffle=True
)

val_gen = val_test_datagen.flow_from_dataframe(
    dataframe=val_df,
    x_col=IMAGE_COL,
    y_col=label_cols,
    target_size=(IMG_HEIGHT, IMG_WIDTH),
    color_mode="rgb",
    class_mode="raw",
    batch_size=BATCH_SIZE,
    shuffle=False
)

test_gen = val_test_datagen.flow_from_dataframe(
    dataframe=test_df,
    x_col=IMAGE_COL,
    y_col=label_cols,
    target_size=(IMG_HEIGHT, IMG_WIDTH),
    color_mode="rgb",
    class_mode="raw",
    batch_size=BATCH_SIZE,
    shuffle=False
)


## 6. Build DenseNet121 Models (Frozen vs Fine-Tuned)

In [None]:

input_shape = (IMG_HEIGHT, IMG_WIDTH, 3)
num_labels  = len(label_cols)

def build_densenet_model(input_shape, num_labels, train_base=False):
    base_model = DenseNet121(
        include_top=False,
        weights="imagenet",
        input_shape=input_shape
    )
    base_model.trainable = train_base

    inputs = layers.Input(shape=input_shape)
    x = base_model(inputs, training=False)
    x = layers.GlobalAveragePooling2D()(x)
    x = layers.Dense(256, activation="relu")(x)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(num_labels, activation="sigmoid")(x)

    model = models.Model(inputs=inputs, outputs=outputs)
    return model

def compile_model(model, lr=1e-4):
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=lr),
        loss="binary_crossentropy",
        metrics=[
            tf.keras.metrics.BinaryAccuracy(name="binary_accuracy"),
            tf.keras.metrics.AUC(name="auc")
        ]
    )

model_a = build_densenet_model(input_shape, num_labels, train_base=False)
compile_model(model_a, lr=1e-4)
model_a.summary()


## 7. Train Model A – DenseNet Frozen (Feature Extractor)

In [None]:

EPOCHS_A = 3

history_a = model_a.fit(
    train_gen,
    validation_data=val_gen,
    epochs=EPOCHS_A
)


## 8. Train Model B – DenseNet Fine-Tuned

In [None]:

model_b = build_densenet_model(input_shape, num_labels, train_base=True)
compile_model(model_b, lr=1e-5)

EPOCHS_B = 3

history_b = model_b.fit(
    train_gen,
    validation_data=val_gen,
    epochs=EPOCHS_B
)


## 9. Model Selection Based on Validation AUC

In [None]:

def best_val_auc(history):
    vals = history.history.get("val_auc", [])
    return max(vals) if len(vals) > 0 else 0.0

val_auc_a = best_val_auc(history_a)
val_auc_b = best_val_auc(history_b)

print(f"Model A (DenseNet frozen)    - Best Val AUC: {val_auc_a:.4f}")
print(f"Model B (DenseNet fine-tuned) - Best Val AUC: {val_auc_b:.4f}")

if val_auc_b > val_auc_a:
    best_model = model_b
    best_model_name = "DenseNet fine-tuned (Model B)"
else:
    best_model = model_a
    best_model_name = "DenseNet frozen (Model A)"

print("\nSelected best model:", best_model_name)


## 10. Evaluation on Test Set (Global Metrics)

In [None]:

test_results = best_model.evaluate(test_gen, verbose=1)
for name, value in zip(best_model.metrics_names, test_results):
    print(f"{name}: {value:.4f}")


## 11. Per-Label AUROC for Test Set

In [None]:

test_gen.reset()
y_pred = best_model.predict(test_gen, verbose=1)
y_true = test_df[label_cols].values

per_label_auc = {}
for i, label in enumerate(label_cols):
    try:
        auc = roc_auc_score(y_true[:, i], y_pred[:, i])
        per_label_auc[label] = auc
    except ValueError:
        per_label_auc[label] = np.nan

per_label_auc_df = pd.DataFrame.from_dict(per_label_auc, orient="index", columns=["test_AUROC"]).sort_values("test_AUROC", ascending=False)
print(per_label_auc_df)

macro_auc = per_label_auc_df["test_AUROC"].mean(skipna=True)
print("\nMacro-average AUROC across labels (excluding NaN):", macro_auc)


## 12. Save Best Model

In [None]:

models_dir = BASE_DIR / "models"
models_dir.mkdir(parents=True, exist_ok=True)

best_model_path = models_dir / "best_densenet_model.h5"
best_model.save(best_model_path)
print("Saved best model to:", best_model_path)



## 13. Summary

In this notebook, we:

- Loaded preprocessed train/validation/test metadata with local `image_path` columns.  
- Built Keras `ImageDataGenerator` pipelines for multi-label image classification using **DenseNet121's `preprocess_input`** for pixel scaling aligned with ImageNet pretraining.  
- Defined two DenseNet121-based models:
  - **Model A:** DenseNet with frozen backbone (feature extractor).  
  - **Model B:** DenseNet with fine-tuned backbone.  
- Trained both models and selected the best one based on validation AUROC.  
- Evaluated the selected model on the test set, reporting global metrics and **per-label AUROC**.  
- Saved the best-performing DenseNet model to disk for reuse.
