# Notebook 3: Model Testing — Cutaneous Leishmaniasis Ulcer Classification

## Purpose
Evaluate the trained binary classification model on **unseen** CL ulcer test images.

- **Class 0 (Sensitive):** CL ulcers showing healing / good treatment response
- **Class 1 (Poor):** CL ulcers showing poor treatment response

## Prerequisites
1. Run `preprocessing.ipynb` to preprocess training images.
2. Run `model_training.ipynb` to train and save `model.h5`.
3. Upload `model.h5` and a **test dataset ZIP** to this notebook.

## Evaluation Metrics
- Test Accuracy
- Confusion Matrix
- Precision, Recall, F1-Score
- Visual predictions with confidence scores

---
**⚠️ Clinical Constraint:** Test images MUST be cutaneous leishmaniasis ulcer wounds ONLY.  
Test images must be **unseen** — NOT used during training.

## 1. Import Libraries

In [None]:
import os
import shutil
import zipfile
import numpy as np
import matplotlib.pyplot as plt
import cv2

import tensorflow as tf
from tensorflow import keras

from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report,
    precision_score,
    recall_score,
    f1_score,
    ConfusionMatrixDisplay
)

# Google Colab file upload utility
try:
    from google.colab import files
    IN_COLAB = True
except ImportError:
    IN_COLAB = False
    print("Not running in Google Colab. Manual upload will be skipped.")

# Reproducibility
np.random.seed(42)
tf.random.set_seed(42)

print(f"TensorFlow version: {tf.__version__}")
print("All libraries imported successfully.")

## 2. Upload Model and Test Data (Manual Upload)

**Two separate uploads are required:**

### Upload 1: Trained Model
- Upload `model.h5` from Notebook 2

### Upload 2: Test Dataset
- Upload a ZIP file containing:
```
test/
  ├── sensitive/   ← Unseen CL ulcers (healing / good response)
  │     ├── img001.jpg
  │     └── ...
  └── poor/        ← Unseen CL ulcers (poor treatment response)
        ├── img001.jpg
        └── ...
```

**⚠️ CLINICAL REMINDER:**  
Test images must be **unseen** cutaneous leishmaniasis ulcer images NOT used during training.

In [None]:
# ============================================================
# UPLOAD 1: TRAINED MODEL (model.h5)
# ============================================================

MODEL_PATH = 'model.h5'

if IN_COLAB:
    print("="*50)
    print("  STEP 1: Upload model.h5")
    print("="*50)
    print("Select the model.h5 file from Notebook 2.\n")

    uploaded_model = files.upload()

    # Verify model file was uploaded
    uploaded_names = list(uploaded_model.keys())
    if not any(f.endswith('.h5') for f in uploaded_names):
        print("⚠️  No .h5 file detected in upload. Please upload model.h5")
    else:
        print("\n✅ Model file uploaded.")
else:
    print("Not in Colab. Ensure 'model.h5' exists in the working directory.")

In [None]:
# ============================================================
# UPLOAD 2: TEST DATASET (test.zip)
# ============================================================

TEST_DIR = 'test'
CLASSES = ['sensitive', 'poor']

if IN_COLAB:
    print("="*50)
    print("  STEP 2: Upload test dataset ZIP")
    print("="*50)
    print("Select your test ZIP containing:")
    print("  test/sensitive/  and  test/poor/\n")

    uploaded_test = files.upload()

    for filename in uploaded_test.keys():
        if filename.endswith('.zip'):
            print(f"\nExtracting '{filename}'...")
            with zipfile.ZipFile(filename, 'r') as zip_ref:
                zip_ref.extractall('.')
            print(f"Extraction complete.")
        else:
            print(f"⚠️  '{filename}' is not a ZIP file.")
else:
    print("Not in Colab. Ensure 'test/' folder exists in the working directory.")

# --------------------------------------------------
# AUTO-DETECT TEST DIRECTORY
# --------------------------------------------------
def find_data_dir(expected_name, required_subdirs):
    """Find directory containing required subdirectories after extraction."""
    if os.path.isdir(expected_name):
        if all(os.path.isdir(os.path.join(expected_name, s)) for s in required_subdirs):
            return expected_name

    if all(os.path.isdir(s) for s in required_subdirs):
        os.makedirs(expected_name, exist_ok=True)
        for s in required_subdirs:
            dest = os.path.join(expected_name, s)
            if not os.path.exists(dest):
                shutil.move(s, dest)
        return expected_name

    for root, dirs, _ in os.walk('.'):
        dirs[:] = [d for d in dirs if not d.startswith('.') and d != '__MACOSX']
        if all(s in dirs for s in required_subdirs) and root != '.':
            return root

    raise FileNotFoundError(
        f"Could not find directory with subdirectories {required_subdirs}.\n"
        f"Check your ZIP structure."
    )

TEST_DIR = find_data_dir(TEST_DIR, CLASSES)
print(f"\n✅ Using test directory: '{TEST_DIR}/'")

## 3. Load Trained Model

In [None]:
# ============================================================
# LOAD TRAINED MODEL
# ============================================================

if not os.path.exists(MODEL_PATH):
    raise FileNotFoundError(
        f"Model file '{MODEL_PATH}' not found.\n"
        f"Please upload model.h5 from Notebook 2."
    )

model = keras.models.load_model(MODEL_PATH)

print(f"✅ Model loaded: {MODEL_PATH}")
print(f"   Input shape:  {model.input_shape}")
print(f"   Output shape: {model.output_shape}")

## 4. Preprocess and Load Test Images

Apply the **same preprocessing pipeline** as in Notebook 1:
1. Resize to 224×224
2. RGB → CIE LAB → Extract L (luminosity) channel
3. CLAHE contrast enhancement
4. Median filtering
5. Normalize to [0, 1]
6. Replicate to 3 channels (for MobileNetV2)

**Critical:** Preprocessing must match training exactly.

In [None]:
# ============================================================
# PREPROCESSING FUNCTION
# Identical pipeline as preprocessing.ipynb — MUST match exactly!
# ============================================================

IMG_SIZE = 224
CLAHE_CLIP_LIMIT = 2.0
CLAHE_TILE_SIZE = (8, 8)
MEDIAN_KERNEL_SIZE = 5


def preprocess_test_image(image_path):
    """
    Apply the same preprocessing pipeline used during training.

    Pipeline:
      1. Load & resize to 224x224
      2. BGR → CIE LAB → extract L channel
      3. CLAHE enhancement
      4. Median filtering
      5. Normalize to [0, 1]
      6. Replicate to 3 channels (MobileNetV2)

    Returns:
        (processed_3ch, original_rgb) or (None, None) on failure.
    """
    img_bgr = cv2.imread(image_path)

    if img_bgr is None:
        print(f"  ⚠️  Could not load: {os.path.basename(image_path)}")
        return None, None

    # Resize
    img_bgr = cv2.resize(img_bgr, (IMG_SIZE, IMG_SIZE), interpolation=cv2.INTER_AREA)
    original_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)

    # BGR → LAB → L channel
    img_lab = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2LAB)
    l_channel = img_lab[:, :, 0]

    # CLAHE
    clahe = cv2.createCLAHE(clipLimit=CLAHE_CLIP_LIMIT, tileGridSize=CLAHE_TILE_SIZE)
    l_clahe = clahe.apply(l_channel)

    # Median filtering
    l_filtered = cv2.medianBlur(l_clahe, MEDIAN_KERNEL_SIZE)

    # Normalize to [0, 1]
    processed = l_filtered.astype(np.float32) / 255.0

    # Replicate to 3 channels for MobileNetV2
    processed_3ch = np.stack([processed, processed, processed], axis=-1)

    return processed_3ch, original_rgb


print("✅ Preprocessing function defined.")

In [None]:
# ============================================================
# LOAD AND PREPROCESS ALL TEST IMAGES
# ============================================================

VALID_EXTENSIONS = {'.jpg', '.jpeg', '.png', '.bmp', '.tif', '.tiff'}
IGNORE_FILES = {'.ds_store', 'thumbs.db', 'desktop.ini'}

# Class mapping (MUST match training)
CLASS_MAP = {
    'sensitive': 0,
    'poor': 1
}
CLASS_NAMES = ['Sensitive', 'Poor']

test_images = []
test_labels = []
test_originals = []   # For visualization
test_filenames = []

for class_name, label in CLASS_MAP.items():
    class_dir = os.path.join(TEST_DIR, class_name)

    if not os.path.isdir(class_dir):
        print(f"⚠️  Test class directory '{class_dir}' not found — skipping.")
        continue

    image_files = sorted([
        f for f in os.listdir(class_dir)
        if os.path.splitext(f)[1].lower() in VALID_EXTENSIONS
        and not f.startswith('.')
        and f.lower() not in IGNORE_FILES
    ])

    print(f"Loading test class '{class_name}' (label={label}): {len(image_files)} images")

    for fname in image_files:
        img_path = os.path.join(class_dir, fname)
        processed, original = preprocess_test_image(img_path)

        if processed is not None:
            test_images.append(processed)
            test_labels.append(label)
            test_originals.append(original)
            test_filenames.append(fname)

# Validate that we have test images
if len(test_images) == 0:
    raise ValueError(
        "No test images were loaded!\n"
        "Check that your test/ directory contains valid images "
        "in 'sensitive/' and/or 'poor/' subdirectories."
    )

X_test = np.array(test_images)
y_test = np.array(test_labels)

print(f"\nTotal test images: {len(X_test)}")
print(f"  Sensitive (0): {np.sum(y_test == 0)}")
print(f"  Poor (1):      {np.sum(y_test == 1)}")
print(f"  Shape: {X_test.shape}")

# Clinical warning
print("\n⚠️  CLINICAL REMINDER: Ensure ALL test images are cutaneous")
print("   leishmaniasis ulcer wounds ONLY (no burns, diabetic ulcers, etc.).")

## 5. Generate Predictions

In [None]:
# ============================================================
# PREDICT ON TEST SET
# ============================================================

# Model output: probability of class 1 (poor-response)
y_pred_probs = model.predict(X_test, verbose=1).flatten()

# Binary predictions (threshold = 0.5)
y_pred = (y_pred_probs >= 0.5).astype(int)

# Confidence = model's certainty in its prediction
#   class 1 predicted → confidence = probability
#   class 0 predicted → confidence = 1 - probability
confidences = np.where(y_pred == 1, y_pred_probs, 1.0 - y_pred_probs)

print(f"\nPredictions generated for {len(y_pred)} test images.")
print(f"  Predicted Sensitive: {np.sum(y_pred == 0)}")
print(f"  Predicted Poor:      {np.sum(y_pred == 1)}")

## 6. Evaluation Metrics

In [None]:
# ============================================================
# TEST ACCURACY
# ============================================================

test_accuracy = accuracy_score(y_test, y_pred)

print("=" * 55)
print("   CUTANEOUS LEISHMANIASIS ULCER CLASSIFICATION")
print("              TEST SET EVALUATION")
print("=" * 55)
print(f"")
print(f"  Test Accuracy:      {test_accuracy:.4f} ({test_accuracy*100:.1f}%)")
print(f"  Total test images:  {len(y_test)}")
print(f"  Correct:            {int(np.sum(y_pred == y_test))}")
print(f"  Incorrect:          {int(np.sum(y_pred != y_test))}")

In [None]:
# ============================================================
# PRECISION, RECALL, F1-SCORE
# ============================================================

# Check if both classes are present in test labels and predictions
has_both_true = len(np.unique(y_test)) == 2
has_both_pred = len(np.unique(y_pred)) == 2

if not has_both_true:
    print("⚠️  WARNING: Test set contains only ONE class.")
    print("   Precision/Recall/F1 may not be meaningful.")

precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred, zero_division=0)
f1 = f1_score(y_test, y_pred, zero_division=0)

print(f"\n{'='*55}")
print(f"           CLASSIFICATION METRICS")
print(f"{'='*55}")
print(f"")
print(f"  Precision (Poor-response): {precision:.4f}")
print(f"  Recall (Poor-response):    {recall:.4f}")
print(f"  F1-Score (Poor-response):  {f1:.4f}")

# Full report
print(f"\n{'-'*55}")
print(f"  Detailed Classification Report:")
print(f"{'-'*55}")

# Determine which labels are present for the report
present_labels = sorted(set(y_test.tolist() + y_pred.tolist()))
present_names = [CLASS_NAMES[i] for i in present_labels]

print(classification_report(
    y_test, y_pred,
    labels=present_labels,
    target_names=present_names,
    zero_division=0
))

In [None]:
# ============================================================
# CONFUSION MATRIX
# ============================================================

cm = confusion_matrix(y_test, y_pred, labels=[0, 1])

fig, ax = plt.subplots(figsize=(7, 6))
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=CLASS_NAMES
)
disp.plot(ax=ax, cmap='Blues', values_format='d')
ax.set_title('Confusion Matrix — CL Ulcer Classification\n(Test Set)',
             fontsize=13, fontweight='bold')
ax.set_xlabel('Predicted Label', fontsize=11)
ax.set_ylabel('True Label', fontsize=11)
plt.tight_layout()
plt.show()

print(f"Confusion Matrix Breakdown:")
print(f"  True Sensitive → Predicted Sensitive: {cm[0][0]}")
print(f"  True Sensitive → Predicted Poor:      {cm[0][1]}")
print(f"  True Poor      → Predicted Sensitive: {cm[1][0]}")
print(f"  True Poor      → Predicted Poor:      {cm[1][1]}")

## 7. Visual Predictions

Display 5 test CL ulcer images with true labels, predicted labels, and confidence scores.

In [None]:
# ============================================================
# DISPLAY 5 TEST IMAGES WITH PREDICTIONS
# ============================================================

num_display = min(5, len(test_originals))

if num_display == 0:
    print("⚠️  No test images to display.")
else:
    # Select a mix of correct and incorrect predictions
    correct_idx = np.where(y_pred == y_test)[0]
    incorrect_idx = np.where(y_pred != y_test)[0]

    display_indices = []

    if len(incorrect_idx) > 0 and len(correct_idx) > 0:
        # Mix: up to 2 incorrect + rest correct
        n_inc = min(2, len(incorrect_idx))
        n_cor = min(num_display - n_inc, len(correct_idx))
        display_indices += list(np.random.choice(incorrect_idx, n_inc, replace=False))
        display_indices += list(np.random.choice(correct_idx, n_cor, replace=False))
    elif len(correct_idx) > 0:
        n = min(num_display, len(correct_idx))
        display_indices = list(np.random.choice(correct_idx, n, replace=False))
    elif len(incorrect_idx) > 0:
        n = min(num_display, len(incorrect_idx))
        display_indices = list(np.random.choice(incorrect_idx, n, replace=False))
    else:
        display_indices = list(range(num_display))

    # Trim to exactly num_display
    display_indices = display_indices[:num_display]
    actual_display = len(display_indices)

    fig, axes = plt.subplots(1, actual_display, figsize=(4 * actual_display, 5))

    # Handle single-image case (matplotlib returns a single axes, not array)
    if actual_display == 1:
        axes = [axes]

    for idx, ax in zip(display_indices, axes):
        ax.imshow(test_originals[idx])

        true_label = CLASS_NAMES[y_test[idx]]
        pred_label = CLASS_NAMES[y_pred[idx]]
        conf = confidences[idx]
        is_correct = (y_pred[idx] == y_test[idx])

        color = 'green' if is_correct else 'red'
        symbol = '✓' if is_correct else '✗'

        ax.set_title(
            f"True: {true_label}\n"
            f"Pred: {pred_label} {symbol}\n"
            f"Confidence: {conf:.1%}",
            fontsize=10, color=color, fontweight='bold'
        )
        ax.axis('off')

    plt.suptitle('CL Ulcer Test Predictions — Sample Images',
                 fontsize=14, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()

## 8. Final Summary

In [None]:
# ============================================================
# FINAL SUMMARY
# ============================================================

print("=" * 60)
print("  CUTANEOUS LEISHMANIASIS ULCER CLASSIFICATION — SUMMARY")
print("=" * 60)
print(f"")
print(f"  Model:              MobileNetV2 (Transfer Learning)")
print(f"  Task:               Binary Classification")
print(f"  Classes:            Sensitive (healing) vs Poor (poor-response)")
print(f"")
print(f"  Test Images:        {len(y_test)}")
print(f"    Sensitive:        {int(np.sum(y_test == 0))}")
print(f"    Poor:             {int(np.sum(y_test == 1))}")
print(f"")
print(f"  Test Accuracy:      {test_accuracy:.4f} ({test_accuracy*100:.1f}%)")
print(f"  Precision (Poor):   {precision:.4f}")
print(f"  Recall (Poor):      {recall:.4f}")
print(f"  F1-Score (Poor):    {f1:.4f}")
print(f"")
print(f"  Correct:            {int(np.sum(y_pred == y_test))}/{len(y_test)}")
print(f"  Incorrect:          {int(np.sum(y_pred != y_test))}/{len(y_test)}")
print(f"")
print("=" * 60)
print("  PIPELINE COMPLETE")
print("=" * 60)
print("")
print("Clinical Notes:")
print("  • ~60% accuracy is clinically realistic for small CL datasets.")
print("  • Performance improves with more labeled CL ulcer training data.")
print("  • Always validate findings with a dermatologist.")
print("  • This model is for RESEARCH PURPOSES ONLY — not for diagnosis.")