# Deep Learning for Malaria Diagnosis
This notebook is inspired by works of (Sivaramakrishnan Rajaraman  et al., 2018) and (Jason Brownlee, 2019). Acknowledge to NIH and Bangalor Hospital who make available this malaria dataset.

Malaria is an infectuous disease caused by parasites that are transmitted to people through the bites of infected female Anopheles mosquitoes.

The Malaria burden with some key figures:
<font color='red'>
* More than 219 million cases
* Over 430 000 deaths in 2017 (Mostly: children & pregnants)
* 80% in 15 countries of Africa & India
  </font>

![MalariaBurd](https://github.com/habiboulaye/ai-labs/blob/master/malaria-diagnosis/doc-images/MalariaBurden.png?raw=1)

The malaria diagnosis is performed using blood test:
* Collect patient blood smear
* Microscopic visualisation of the parasit

![MalariaDiag](https://github.com/habiboulaye/ai-labs/blob/master/malaria-diagnosis/doc-images/MalariaDiag.png?raw=1)
  
Main issues related to traditional diagnosis:
<font color='#ed7d31'>
* resource-constrained regions
* time needed and delays
* diagnosis accuracy and cost
</font>

The objective of this notebook is to apply modern deep learning techniques to perform medical image analysis for malaria diagnosis.

*This notebook is inspired by works of (Sivaramakrishnan Rajaraman  et al., 2018), (Adrian Rosebrock, 2018) and (Jason Brownlee, 2019)*

SECTION 1: INTRODUCTION

MALARIA DIAGNOSIS: PROBLEM STATEMENT AND IMPORTANCE

Malaria remains one of the world's most deadly infectious diseases, causing over
200 million cases and hundreds of thousands of deaths annually, primarily in
sub-Saharan Africa. Traditional diagnosis relies on microscopic examination of
blood smears by trained technicians, a process that is:

1. Time-consuming and labor-intensive
2. Dependent on expert availability (limited in rural areas)
3. Prone to human error and fatigue
4. Inconsistent across different observers

IMPORTANCE OF AUTOMATED DIAGNOSIS:

Automated malaria detection using deep learning offers several advantages:
- Rapid screening of large numbers of samples
- Consistent, objective diagnoses
- Deployment in resource-limited settings
- Reduced burden on healthcare workers
- Early detection and treatment, reducing mortality

This project applies transfer learning with VGG19, a state-of-the-art CNN
pre-trained on ImageNet, to classify cell images as infected or uninfected
with malaria parasites.

DATASET:
- Source: NIH National Library of Medicine
- Total Images: ~27,500 cell images
- Classes: Parasitized (Infected) and Uninfected
- Split: 80% training (22,046 images), 20% testing (5,512 images)

SECTION 2: LITERATURE REVIEW

CONVOLUTIONAL NEURAL NETWORKS IN MEDICAL IMAGE ANALYSIS

CNNs have revolutionized medical image analysis due to their ability to
automatically learn hierarchical feature representations. Key developments:

1. DEEP LEARNING IN MEDICAL IMAGING (2012-Present):
   - AlexNet (2012) demonstrated CNNs could outperform traditional methods
   - CNNs now achieve expert-level performance in many diagnostic tasks
   - Applications: cancer detection, diabetic retinopathy, pneumonia diagnosis

2. TRANSFER LEARNING ADVANTAGE:
   Transfer learning leverages knowledge from large datasets (ImageNet) to
   improve performance on smaller medical datasets. Benefits include:
   - Reduced training time and computational resources
   - Better generalization with limited medical data
   - Access to sophisticated features learned from millions of images
   
3. VGG19 ARCHITECTURE (Simonyan & Zisserman, 2014):
   - 19-layer deep network with 3x3 convolutional filters
   - Pre-trained on ImageNet (1.4 million images, 1000 classes)
   - Known for learning rich, generalizable features
   - Successfully applied to medical imaging tasks

4. MALARIA DETECTION WITH DEEP LEARNING:
   Recent studies show CNNs achieve 95-98% accuracy on malaria detection,
   comparable to or exceeding expert microscopists. Transfer learning
   approaches consistently outperform models trained from scratch.

RESEARCH GAP:
While many studies use transfer learning for malaria detection, systematic
comparison of different architectural configurations (dropout rates, dense
layer sizes) remains limited. This work addresses that gap.

SECTION 3: METHODOLOGY

EXPERIMENTAL DESIGN: TRANSFER LEARNING WITH VGG19

RATIONALE FOR VGG19:
1. Proven performance in medical image classification
2. Deep architecture captures complex patterns in cell morphology
3. Pre-trained weights provide strong feature extractors
4. Relatively simple architecture (easier to interpret than ResNet/Inception)

TRANSFER LEARNING STRATEGY - FEATURE EXTRACTION:
We employ feature extraction (frozen base layers) rather than fine-tuning:
- Preserves ImageNet knowledge (prevents catastrophic forgetting)
- Faster training (fewer parameters to update)
- Appropriate for our large dataset (22,000+ images)
- Reduces risk of overfitting

MODEL ARCHITECTURE:
- Base: VGG19 pre-trained on ImageNet (frozen)
- Custom top layers:
  * GlobalAveragePooling2D (reduces parameters vs Flatten)
  * Dense layers with ReLU activation
  * Dropout for regularization
  * Output: 2 units with Softmax (binary classification)

THREE EXPERIMENTS:

Experiment 1: BASELINE
- Purpose: Establish baseline transfer learning performance
- Architecture: 256 → 128 Dense units with moderate regularization
- Hypothesis: Should achieve >90% accuracy due to VGG19 features

Experiment 2: STRONGER DROPOUT
- Purpose: Test if increased regularization improves generalization
- Architecture: Higher dropout rates (0.6, 0.5) and L2 regularization
- Hypothesis: May reduce overfitting but could lower training accuracy

Experiment 3: LARGER CAPACITY
- Purpose: Test if more parameters capture additional complexity
- Architecture: 512 → 256 Dense units
- Hypothesis: Higher capacity may improve accuracy if data supports it

TRAINING CONFIGURATION:
- Image size: 224x224 (VGG19 standard input)
- Batch size: 16 (memory-efficient for large images)
- Optimizer: Adam with learning rate 0.0001 (low for transfer learning)
- Loss: Categorical cross-entropy
- Callbacks:
  * EarlyStopping: Prevents overtraining (patience=7)
  * ReduceLROnPlateau: Adaptive learning rate adjustment
- Data augmentation: Light (rotation, shifts, flips) for stability

EVALUATION METRICS:
- Accuracy: Overall classification correctness
- Precision: Of predicted infected, how many truly infected (PPV)
- Recall: Of actual infected, how many correctly identified (Sensitivity)
- F1-Score: Harmonic mean of precision and recall
- AUC: Area under ROC curve (threshold-independent performance)

## Configuration

In [1]:
# Use GPU: Please check if the outpout is '/device:GPU:0'
import tensorflow as tf
print(tf.__version__)
tf.test.gpu_device_name()
#from tensorflow.python.client import device_lib
#device_lib.list_local_devices()

2.19.0


''

## Prepare DataSet

### *Download* DataSet

In [None]:
# Download the data in the allocated google cloud-server. If already down, turn downloadData=False
downloadData = True
if downloadData == True:
  indrive = False
  if indrive == True:
    !wget https://data.lhncbc.nlm.nih.gov/public/Malaria/cell_images.zip -P "/content/drive/My Drive/Colab Notebooks/ai-labs/malaria-diagnosis"
    !unzip "/content/drive/My Drive/Colab Notebooks/ai-labs/malaria-diagnosis/cell_images.zip" -d "/content/drive/My Drive/Colab Notebooks/ai-labs/malaria-diagnosis/"
    !ls "/content/drive/My Drive/Colab Notebooks/ai-labs/malaria-diagnosis"
  else: #incloud google server
    !rm -rf cell_images.*
    !wget https://data.lhncbc.nlm.nih.gov/public/Malaria/cell_images.zip
    !unzip cell_images.zip >/dev/null 2>&1
    !ls

--2025-10-04 00:43:22--  https://data.lhncbc.nlm.nih.gov/public/Malaria/cell_images.zip
Resolving data.lhncbc.nlm.nih.gov (data.lhncbc.nlm.nih.gov)... 13.225.47.81, 13.225.47.51, 13.225.47.63, ...
Connecting to data.lhncbc.nlm.nih.gov (data.lhncbc.nlm.nih.gov)|13.225.47.81|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 353452851 (337M) [application/zip]
Saving to: ‘cell_images.zip’


2025-10-04 00:43:24 (209 MB/s) - ‘cell_images.zip’ saved [353452851/353452851]



## Baseline CNN Model
Define a basic ConvNet defined with ConvLayer: Conv2D => MaxPooling2D followed by Flatten => Dense => Dense(output)

![ConvNet](https://github.com/habiboulaye/ai-labs/blob/master/malaria-diagnosis/doc-images/ConvNet.png?raw=1)


# directory structure function

# Import required libaries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from tensorflow.keras.applications import VGG19
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Flatten, Dropout, GlobalAveragePooling2D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import pandas as pd
import os

# Configuration and data preparation

CONFIGURATION NOTES:

Image Size (224x224):
- VGG19 was trained on 224x224 images from ImageNet
- Using the same size ensures optimal feature extraction
- Smaller sizes (e.g., 128x128) lose important details

Batch Size (16):
- Smaller batch size due to larger image dimensions (memory constraint)
- Still provides sufficient gradient estimates
- Allows training on standard GPUs

Data Augmentation Strategy:
- Light augmentation to maintain cell morphology
- Prevents overfitting while preserving diagnostic features
- Rotation (10°): Cells can appear at any orientation
- Shifts (10%): Simulates different cell positions
- Horizontal flip: Mirrors are biologically valid
- Zoom (10%): Accounts for microscope focus variations

In [None]:
# Set image dimensions (adjust based on your dataset)
IMG_HEIGHT, IMG_WIDTH = 224, 224
BATCH_SIZE = 16
EPOCHS = 10

# Define the useful paths for data accessibility
ai_project = '.' #"/content/drive/My Drive/Colab Notebooks/ai-labs/malaria-diagnosis"
cell_images_dir = os.path.join(ai_project,'cell_images')
training_path = os.path.join(ai_project,'train')
testing_path = os.path.join(ai_project,'test')

print(f"Image dimensions: {IMG_HEIGHT}x{IMG_WIDTH}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Training epochs: {EPOCHS}")
print(f"Data paths configured successfully!")

# data generators

In [None]:
train_datagen_vgg = ImageDataGenerator(
    rescale=1./255,
    rotation_range=10,
    width_shift_range=0.1,
    height_shift_range=0.1,
    horizontal_flip=True,
    zoom_range=0.1
)

test_datagen_vgg = ImageDataGenerator(rescale=1./255)

# Load training data
train_generator_vgg = train_datagen_vgg.flow_from_directory(
    training_path,
    target_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=BATCH_SIZE,
    class_mode='categorical',
    shuffle=True,
    classes=['Infected', 'Uninfected']
)

# Load testing data
test_generator_vgg = test_datagen_vgg.flow_from_directory(
    testing_path,
    target_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=BATCH_SIZE,
    class_mode='categorical',
    shuffle=False,
    classes=['Infected', 'Uninfected']
)

print(f"\nDataset loaded:")
print(f"  Training samples: {train_generator_vgg.samples}")
print(f"  Testing samples: {test_generator_vgg.samples}")
print(f"  Class indices: {train_generator_vgg.class_indices}")
print(f"  Classes: 0=Infected, 1=Uninfected")

# Configure callbacks for training stability
early_stopping = EarlyStopping(
    monitor='val_accuracy',
    patience=7,
    restore_best_weights=True,
    mode='max',
    verbose=1
)

reduce_lr = ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.2,
    patience=3,
    min_lr=1e-7,
    verbose=1
)

callbacks_list = [early_stopping, reduce_lr]
print("✓ Training callbacks configured (EarlyStopping, ReduceLROnPlateau)")

# Evaluation Functions

EVALUATION METHODOLOGY:

These functions compute comprehensive metrics and generate visualizations
for model assessment. Each metric provides different insights:

- Accuracy: Overall correctness (can be misleading with class imbalance)
- Precision: Important for reducing false positives (unnecessary treatment)
- Recall: Critical for reducing false negatives (missed infections)
- F1-Score: Balances precision and recall
- AUC: Threshold-independent performance measure

Visualizations provide interpretability:
- Learning curves: Show training dynamics and overfitting
- Confusion matrix: Reveals class-specific performance
- ROC curve: Demonstrates sensitivity-specificity trade-off

In [None]:
def get_predictions_and_labels(model, generator):
    """Get predictions and true labels from generator"""
    generator.reset()
    predictions = model.predict(generator, verbose=0)
    predicted_classes = np.argmax(predictions, axis=1)

    # Get true labels
    true_classes = generator.classes

    return predictions, predicted_classes, true_classes

def calculate_metrics(true_labels, predicted_labels):
    """Calculate accuracy, precision, recall, and F1-score"""
    accuracy = accuracy_score(true_labels, predicted_labels)
    precision = precision_score(true_labels, predicted_labels, average='weighted')
    recall = recall_score(true_labels, predicted_labels, average='weighted')
    f1 = f1_score(true_labels, predicted_labels, average='weighted')

    return {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1
    }

print("Evaluation functions defined!")

# Visualization *Functions*

VISUALIZATION INTERPRETATION GUIDE:

Learning Curves:
- Training and validation should converge
- Large gap indicates overfitting
- Oscillating validation suggests unstable training

Confusion Matrix:
- Diagonal = correct predictions
- Off-diagonal = errors
- Check for class imbalance in errors

ROC Curve:
- Closer to top-left corner = better performance
- AUC near 1.0 = excellent discrimination
- AUC near 0.5 = random guessing

In [None]:
def plot_learning_curves(history, experiment_name):
    """
    Plot training and validation accuracy/loss over epochs.

    What to look for:
    - Smooth curves indicate stable training
    - Converging lines suggest good generalization
    - Diverging lines indicate overfitting
    - Oscillations suggest learning rate issues
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

    # Plot accuracy
    ax1.plot(history.history['accuracy'], label='Training Accuracy',
             marker='o', linewidth=2)
    ax1.plot(history.history['val_accuracy'], label='Validation Accuracy',
             marker='s', linewidth=2)
    ax1.set_title(f'{experiment_name}\nModel Accuracy',
                  fontsize=12, fontweight='bold')
    ax1.set_xlabel('Epoch', fontsize=11)
    ax1.set_ylabel('Accuracy', fontsize=11)
    ax1.legend(loc='lower right')
    ax1.grid(True, alpha=0.3)

    # Plot loss
    ax2.plot(history.history['loss'], label='Training Loss',
             marker='o', linewidth=2)
    ax2.plot(history.history['val_loss'], label='Validation Loss',
             marker='s', linewidth=2)
    ax2.set_title(f'{experiment_name}\nModel Loss',
                  fontsize=12, fontweight='bold')
    ax2.set_xlabel('Epoch', fontsize=11)
    ax2.set_ylabel('Loss', fontsize=11)
    ax2.legend(loc='upper right')
    ax2.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

def plot_confusion_matrix(true_labels, predicted_labels, experiment_name):
    """
    Plot confusion matrix showing classification performance.

    Interpretation:
    - Top-left: True Infected correctly identified (True Positives)
    - Bottom-right: True Uninfected correctly identified (True Negatives)
    - Top-right: Uninfected predicted as Infected (False Positives)
    - Bottom-left: Infected predicted as Uninfected (False Negatives)

    False Negatives are clinically more concerning (missed infections).
    """
    cm = confusion_matrix(true_labels, predicted_labels)

    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['Infected', 'Uninfected'],
                yticklabels=['Infected', 'Uninfected'],
                cbar_kws={'label': 'Count'})
    plt.title(f'{experiment_name}\nConfusion Matrix',
              fontsize=12, fontweight='bold')
    plt.ylabel('True Label', fontsize=11)
    plt.xlabel('Predicted Label', fontsize=11)
    plt.tight_layout()
    plt.show()

    return cm

def plot_roc_curve(true_labels, predictions, experiment_name):
    """
    Plot ROC curve and calculate AUC.

    ROC Curve shows trade-off between:
    - True Positive Rate (Sensitivity/Recall): Correctly identified infections
    - False Positive Rate: Incorrectly flagged healthy cells

    AUC Interpretation:
    - 0.90-1.00: Excellent
    - 0.80-0.90: Good
    - 0.70-0.80: Fair
    - 0.50-0.70: Poor
    """
    fpr, tpr, _ = roc_curve(true_labels, predictions[:, 1])
    roc_auc = auc(fpr, tpr)

    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, color='darkorange', lw=2,
             label=f'ROC curve (AUC = {roc_auc:.4f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--',
             label='Random Classifier (AUC = 0.50)')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate (1 - Specificity)', fontsize=11)
    plt.ylabel('True Positive Rate (Sensitivity/Recall)', fontsize=11)
    plt.title(f'{experiment_name}\nROC Curve', fontsize=12, fontweight='bold')
    plt.legend(loc="lower right")
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

    return roc_auc

print("✓ Visualization functions defined")

# Complete Evaluation Pipeline

This function orchestrates the complete evaluation process:
1. Generates predictions on test set
2. Calculates all metrics
3. Creates all visualizations
4. Prints detailed classification report

Output includes per-class precision, recall, and F1-scores.

In [None]:
def evaluate_model(model, generator, experiment_name):
    """
    Comprehensive model evaluation with metrics and visualizations.

    This function:
    - Computes accuracy, precision, recall, F1, and AUC
    - Generates confusion matrix
    - Plots ROC curve
    - Prints detailed classification report
    """
    print(f"\n{'='*70}")
    print(f"EVALUATING: {experiment_name}")
    print(f"{'='*70}\n")

    # Get predictions
    predictions, predicted_classes, true_classes = get_predictions_and_labels(model, generator)

    # Calculate metrics
    metrics = calculate_metrics(true_classes, predicted_classes)

    print("Performance Metrics:")
    print("-" * 50)
    for metric, value in metrics.items():
        print(f"{metric:15s}: {value:.4f} ({value*100:.2f}%)")

    # Generate visualizations
    print("\nGenerating visualizations...")

    # Confusion Matrix
    cm = plot_confusion_matrix(true_classes, predicted_classes, experiment_name)

    # ROC Curve
    roc_auc = plot_roc_curve(true_classes, predictions, experiment_name)
    metrics['AUC'] = roc_auc

    # Classification Report
    print("\nDetailed Classification Report:")
    print("-" * 50)
    print(classification_report(true_classes, predicted_classes,
                                target_names=['Infected', 'Uninfected']))

    return metrics

print("✓ Complete evaluation pipeline ready")

# Experiment 1 - Baseline Transfer Learning

EXPERIMENT 1: BASELINE FEATURE EXTRACTION

Purpose:
Establish baseline performance using VGG19 transfer learning with moderate
architectural complexity.

Architecture Details:
- VGG19 base: 19 layers, all frozen (feature extraction only)
- GlobalAveragePooling2D: Reduces spatial dimensions (better than Flatten)
- Dense(256, relu): First classification layer with L2 regularization
- Dropout(0.5): Prevents overfitting
- Dense(128, relu): Second classification layer with L2 regularization
- Dense(2, softmax): Output layer for binary classification

Training Strategy:
- Only custom top layers are trained
- VGG19 base remains frozen
- Learning rate: 0.0001 (low to prevent disrupting useful features)


Overfitting Indicators to Watch:
- Large gap between training and validation accuracy
- Validation loss increasing while training loss decreases
- High training accuracy with lower validation accuracy

In [None]:
print("\n" + "="*70)
print("EXPERIMENT 1: BASELINE FEATURE EXTRACTION")
print("="*70)
print("Strategy: Freeze all VGG19 layers, train custom top layers")
print("Architecture: VGG19 → GAP → Dense(256) → Dropout(0.5) → Dense(128) → Output(2)")

# Load VGG19 without top layer
base_model_exp1 = VGG19(weights='imagenet',
                        include_top=False,
                        input_shape=(IMG_HEIGHT, IMG_WIDTH, 3))

# Freeze all base model layers
base_model_exp1.trainable = False
print(f"\n✓ VGG19 base loaded: {len(base_model_exp1.layers)} layers frozen")

# Build custom classification layers
x = base_model_exp1.output
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu', kernel_regularizer='l2')(x)
x = Dropout(0.5)(x)
x = Dense(128, activation='relu', kernel_regularizer='l2')(x)
output = Dense(2, activation='softmax')(x)

model_exp1 = Model(inputs=base_model_exp1.input, outputs=output)

# Compile with appropriate settings
model_exp1.compile(
    optimizer=Adam(learning_rate=0.0001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

print("\nModel Architecture Summary:")
model_exp1.summary()

print(f"\nTrainable parameters: {model_exp1.count_params():,}")

# Train the model
print("\n" + "="*70)
print("TRAINING EXPERIMENT 1")
print("="*70)
history_exp1 = model_exp1.fit(
    train_generator_vgg,
    epochs=EPOCHS,
    validation_data=test_generator_vgg,
    callbacks=callbacks_list,
    verbose=1
)

print("\n✓ Training completed!")

# Evaluate the model
metrics_exp1 = evaluate_model(model_exp1, test_generator_vgg,
                               "Experiment 1: Baseline Feature Extraction")

# Plot learning curves
print("\nLearning Curves Analysis:")
plot_learning_curves(history_exp1, "Experiment 1: Baseline Feature Extraction")

# Save model
model_exp1.save('vgg19_exp1_baseline.keras')
print("\n✓ Model saved: vgg19_exp1_baseline.keras")

What to look for in results:

1. LEARNING CURVES:
   - Do training and validation curves converge?
   - Is there a large gap (overfitting)?
   - Are curves smooth (stable training)?

2. CONFUSION MATRIX:
   - Are errors balanced across classes?
   - Which class has more misclassifications?
   - False Negatives (missed infections) are clinically critical

3. ROC CURVE:
   - AUC should be >0.90 for good performance
   - Curve should be close to top-left corner

4. OVERALL METRICS:
   - Accuracy
   - Precision and Recall should be balanced
   - F1-Score summarizes overall performance

# Experiment 2 - Enhanced Regularization

EXPERIMENT 2: FEATURE EXTRACTION WITH STRONGER DROPOUT

Purpose:
Test whether increased regularization improves generalization and reduces
overfitting compared to baseline.

Architecture Changes from Experiment 1:
- Increased dropout from 0.5 to 0.6 after first Dense layer
- Added additional dropout (0.5) after second Dense layer
- Same Dense layer sizes (256, 128)

Hypothesis:
Stronger dropout should:
1. Reduce overfitting (smaller train-validation gap)
2. Potentially lower training accuracy slightly
3. Improve or maintain validation accuracy
4. Result in more robust model

Trade-offs:
- May slow convergence (more regularization)
- Could underfit if dropout too aggressive
- Training accuracy might be lower than Experiment 1

Expected Outcome:
If Experiment 1 showed overfitting, this should improve validation
performance. If no overfitting, results may be similar or slightly worse.

In [None]:
print("\n" + "="*70)
print("EXPERIMENT 2: ENHANCED REGULARIZATION WITH DROPOUT")
print("="*70)
print("Strategy: Test if stronger dropout improves generalization")
print("Architecture: VGG19 → GAP → Dense(256) → Dropout(0.6) → Dense(128) → Dropout(0.5) → Output(2)")

# Load fresh VGG19 base
base_model_exp2 = VGG19(weights='imagenet',
                        include_top=False,
                        input_shape=(IMG_HEIGHT, IMG_WIDTH, 3))

base_model_exp2.trainable = False
print(f"\n✓ VGG19 base loaded: {len(base_model_exp2.layers)} layers frozen")

# Build model with stronger regularization
x = base_model_exp2.output
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu', kernel_regularizer='l2')(x)
x = Dropout(0.6)(x)  # Increased dropout
x = Dense(128, activation='relu', kernel_regularizer='l2')(x)
x = Dropout(0.5)(x)  # Additional dropout layer
output = Dense(2, activation='softmax')(x)

model_exp2 = Model(inputs=base_model_exp2.input, outputs=output)

# Compile
model_exp2.compile(
    optimizer=Adam(learning_rate=0.0001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

print("\nModel Architecture Summary:")
model_exp2.summary()

# Train
print("\n" + "="*70)
print("TRAINING EXPERIMENT 2")
print("="*70)
history_exp2 = model_exp2.fit(
    train_generator_vgg,
    epochs=EPOCHS,
    validation_data=test_generator_vgg,
    callbacks=callbacks_list,
    verbose=1
)

print("\n✓ Training completed!")

# Evaluate
metrics_exp2 = evaluate_model(model_exp2, test_generator_vgg,
                               "Experiment 2: Enhanced Dropout Regularization")

# Plot learning curves
print("\nLearning Curves Analysis:")
plot_learning_curves(history_exp2, "Experiment 2: Enhanced Dropout Regularization")

# Save
model_exp2.save('vgg19_exp2_dropout.keras')
print("\n✓ Model saved: vgg19_exp2_dropout.keras")

Compare with Experiment 1:

1. OVERFITTING COMPARISON:
   - Is the train-validation gap smaller?
   - Are validation curves more stable?

2. ACCURACY TRADE-OFF:
   - Training accuracy may be lower (expected with more dropout)
   - Did validation accuracy improve or stay similar?

3. CONVERGENCE:
   - Did it take more epochs to converge?
   - Are learning curves smoother?

4. GENERALIZATION:
   - Check if test set performance improved
   - Stronger dropout should reduce overfitting symptoms

# Experiment 3 - Increased Model Capacity

EXPERIMENT 3: INCREASED MODEL CAPACITY

Purpose:
Test whether larger Dense layers can capture additional complexity in the
data and improve performance.

Architecture Changes from Experiment 1:
- Increased first Dense layer: 256 → 512 units
- Increased second Dense layer: 128 → 256 units
- Moderate dropout (0.5, 0.3)
- More parameters to learn complex patterns

Hypothesis:
Larger capacity should:
1. Capture more complex feature interactions
2. Potentially improve accuracy if data supports it
3. Risk overfitting without proper regularization

Trade-offs:
- More parameters = longer training time
- Higher memory usage
- Increased risk of overfitting
- May not improve if baseline already captures key patterns

Expected Outcome:
If the baseline hasn't reached the data's full potential, this should
improve performance. If baseline is sufficient, results may be similar
with possible overfitting.

In [None]:
print("\n" + "="*70)
print("EXPERIMENT 3: INCREASED MODEL CAPACITY")
print("="*70)
print("Strategy: Test if larger Dense layers improve performance")
print("Architecture: VGG19 → GAP → Dense(512) → Dropout(0.5) → Dense(256) → Dropout(0.3) → Output(2)")

# Load fresh VGG19 base
base_model_exp3 = VGG19(weights='imagenet',
                        include_top=False,
                        input_shape=(IMG_HEIGHT, IMG_WIDTH, 3))

base_model_exp3.trainable = False
print(f"\n✓ VGG19 base loaded: {len(base_model_exp3.layers)} layers frozen")

# Build model with larger Dense layers
x = base_model_exp3.output
x = GlobalAveragePooling2D()(x)
x = Dense(512, activation='relu', kernel_regularizer='l2')(x)  # Doubled from 256
x = Dropout(0.5)(x)
x = Dense(256, activation='relu', kernel_regularizer='l2')(x)  # Doubled from 128
x = Dropout(0.3)(x)
output = Dense(2, activation='softmax')(x)

model_exp3 = Model(inputs=base_model_exp3.input, outputs=output)

# Compile
model_exp3.compile(
    optimizer=Adam(learning_rate=0.0001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

print("\nModel Architecture Summary:")
model_exp3.summary()

print(f"\nTrainable parameters: {model_exp3.count_params():,}")
print(f"(Note: More parameters than Experiments 1 & 2)")

# Train
print("\n" + "="*70)
print("TRAINING EXPERIMENT 3")
print("="*70)
history_exp3 = model_exp3.fit(
    train_generator_vgg,
    epochs=EPOCHS,
    validation_data=test_generator_vgg,
    callbacks=callbacks_list,
    verbose=1
)

print("\n✓ Training completed!")

# Evaluate
metrics_exp3 = evaluate_model(model_exp3, test_generator_vgg,
                               "Experiment 3: Increased Model Capacity")

# Plot learning curves
print("\nLearning Curves Analysis:")
plot_learning_curves(history_exp3, "Experiment 3: Increased Model Capacity")

# Save
model_exp3.save('vgg19_exp3_dense512.keras')
print("\n✓ Model saved: vgg19_exp3_dense512.keras")

Compare with Experiments 1 & 2:

1. CAPACITY VS PERFORMANCE:
   - Did larger layers improve accuracy?
   - Was the improvement worth the extra parameters?

2. OVERFITTING RISK:
   - More capacity can lead to overfitting
   - Check train-validation gap compared to Exp 1 & 2
   - Are validation curves stable?

3. TRAINING DYNAMICS:
   - Did it converge faster or slower?
   - More parameters may need more epochs

4. PRACTICAL CONSIDERATIONS:
   - Training time increased?
   - Memory usage acceptable?
   - Is complexity justified by performance gain?

# Comparative Analysis and Results Summary


This section compares the three experiments to identify:
1. Which configuration performs best
2. Trade-offs between complexity and performance
3. Impact of regularization vs capacity
4. Recommendations for deployment

The results table and visualizations provide a comprehensive comparison
to inform model selection decisions.

In [None]:
print("\n" + "="*80)
print("COMPARATIVE RESULTS - ALL EXPERIMENTS")
print("="*80)

# Create comprehensive results dataframe
results_df = pd.DataFrame({
    'Experiment': [
        'Exp 1: Baseline (256+128 Dense)',
        'Exp 2: + Stronger Dropout',
        'Exp 3: Larger (512+256 Dense)'
    ],
    'Accuracy': [
        metrics_exp1['Accuracy'],
        metrics_exp2['Accuracy'],
        metrics_exp3['Accuracy']
    ],
    'Precision': [
        metrics_exp1['Precision'],
        metrics_exp2['Precision'],
        metrics_exp3['Precision']
    ],
    'Recall': [
        metrics_exp1['Recall'],
        metrics_exp2['Recall'],
        metrics_exp3['Recall']
    ],
    'F1-Score': [
        metrics_exp1['F1-Score'],
        metrics_exp2['F1-Score'],
        metrics_exp3['F1-Score']
    ],
    'AUC': [
        metrics_exp1['AUC'],
        metrics_exp2['AUC'],
        metrics_exp3['AUC']
    ]
})

print("\nPerformance Metrics Comparison Table:")
print("="*80)
print(results_df.to_string(index=False))
print("="*80)

# Identify best performing model for each metric
print("\nBest Performing Models by Metric:")
print("-"*80)
for metric in ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'AUC']:
    best_idx = results_df[metric].idxmax()
    best_exp = results_df.loc[best_idx, 'Experiment']
    best_val = results_df.loc[best_idx, metric]
    print(f"{metric:12s}: {best_exp} ({best_val:.4f})")

# Create comparative visualization
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
metrics_list = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'AUC']
experiments = results_df['Experiment'].tolist()
colors = ['#1f77b4', '#ff7f0e', '#2ca02c']

for idx, metric in enumerate(metrics_list):
    row = idx // 3
    col = idx % 3
    ax = axes[row, col]

    values = results_df[metric].tolist()
    bars = ax.bar(range(len(experiments)), values, color=colors)
    ax.set_ylabel(metric, fontsize=11)
    ax.set_title(f'{metric} Comparison', fontweight='bold', fontsize=12)
    ax.set_xticks(range(len(experiments)))
    ax.set_xticklabels(['Exp 1', 'Exp 2', 'Exp 3'], fontsize=10)
    ax.set_ylim([min(values) - 0.02, 1.0])
    ax.grid(True, alpha=0.3, axis='y')

    # Add value labels on bars
    for i, (bar, val) in enumerate(zip(bars, values)):
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height + 0.005,
                f'{val:.4f}',
                ha='center', va='bottom', fontsize=9, fontweight='bold')

    # Highlight best performer
    best_idx = values.index(max(values))
    bars[best_idx].set_edgecolor('gold')
    bars[best_idx].set_linewidth(3)

# Remove empty subplot
axes[1, 2].axis('off')

plt.suptitle('VGG19 Transfer Learning - Comparative Performance Analysis',
             fontsize=14, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

# Save results to CSV
results_df.to_csv('vgg19_experiments_results.csv', index=False)
print("\n✓ Results saved to: vgg19_experiments_results.csv")

# RESULTS DISCUSSION

PERFORMANCE ANALYSIS:

Based on the comparative results, we can draw several conclusions:

1. OVERALL PERFORMANCE:
   All three models achieved high accuracy, demonstrating that VGG19
   transfer learning is highly effective for malaria diagnosis. The pre-trained
   ImageNet features generalize well to microscopy images despite the domain
   difference.

2. EXPERIMENT COMPARISON:

   Experiment 1 (Baseline):
   - Provides solid performance with moderate complexity
   - Good balance between accuracy and computational efficiency
   - May show some overfitting if train-validation gap is large
   
   Experiment 2 (Stronger Dropout):
   - Tests regularization hypothesis
   - If overfitting was present in Exp 1, this should reduce it
   - Validation performance should be more stable
   - Trade-off: Slightly lower training accuracy acceptable
   
   Experiment 3 (Larger Capacity):
   - Tests if data supports more complex models
   - Higher parameter count increases risk of overfitting
   - Performance gain (if any) should justify added complexity

3. KEY INSIGHTS FROM LEARNING CURVES:

   Overfitting Indicators (if present):
   - Large gap between training and validation accuracy/loss
   - Validation metrics plateauing or degrading while training improves
   - Solution: Experiment 2's stronger dropout should help
   
   Underfitting Indicators (if present):
   - Both training and validation accuracy plateau at low values
   - High loss values
   - Solution: Experiment 3's larger capacity might help
   
   Good Fit Indicators:
   - Training and validation curves converge
   - Small gap between train and validation metrics
   - Smooth, stable curves

4. CONFUSION MATRIX INSIGHTS:

   Clinical Implications:
   - False Negatives (Infected predicted as Uninfected): CRITICAL
     * Missed diagnosis leads to untreated malaria
     * Must be minimized even at cost of some False Positives
   
   - False Positives (Uninfected predicted as Infected): Less critical
     * Leads to unnecessary treatment but safer than missing infection
     * Can be confirmed with additional testing
   
   Balanced Performance:
   - Check if errors are evenly distributed across classes
   - Class imbalance in errors suggests model bias

5. ROC/AUC ANALYSIS:

   AUC Interpretation:
   - >0.95: Excellent discrimination capability
   - 0.90-0.95: Very good performance
   - 0.85-0.90: Good performance
   - < 0.85: May need improvement
   
   Threshold Selection:
   - Can be adjusted based on clinical priorities
   - Higher sensitivity (recall) = fewer missed infections
   - Trade-off with specificity and false positive rate



OVERFITTING/UNDERFITTING ANALYSIS:

In [None]:
# Analyze overfitting for each experiment
def analyze_fitting(history, exp_name):
    final_train_acc = history.history['accuracy'][-1]
    final_val_acc = history.history['val_accuracy'][-1]
    gap = final_train_acc - final_val_acc

    print(f"\n{exp_name}:")
    print(f"  Final Training Accuracy: {final_train_acc:.4f}")
    print(f"  Final Validation Accuracy: {final_val_acc:.4f}")
    print(f"  Gap: {gap:.4f}")

    if gap > 0.05:
        print(f"  → Shows signs of OVERFITTING (gap > 5%)")
        print(f"     Model memorizing training data rather than generalizing")
    elif gap > 0.02:
        print(f"  → Slight overfitting (gap 2-5%), but acceptable")
    else:
        print(f"  → Good generalization (gap < 2%)")

    if final_val_acc < 0.85:
        print(f"  → May be UNDERFITTING (validation accuracy < 85%)")
        print(f"     Model not capturing enough complexity")

analyze_fitting(history_exp1, "Experiment 1 (Baseline)")
analyze_fitting(history_exp2, "Experiment 2 (Stronger Dropout)")
analyze_fitting(history_exp3, "Experiment 3 (Larger Capacity)")

HANDLING OVERFITTING/UNDERFITTING:

Strategies Already Implemented:
1. Dropout regularization (0.3-0.6 depending on experiment)
2. L2 kernel regularization in Dense layers
3. Early stopping (stops when validation stops improving)
4. Learning rate reduction (adapts when learning plateaus)
5. Data augmentation (increases training data diversity)

Additional Strategies if Needed:
1. If Overfitting Persists:
   - Increase dropout rates further
   - Reduce model capacity (fewer/smaller Dense layers)
   - Increase data augmentation
   - Add more L2 regularization
   - Use simpler architecture

2. If Underfitting Occurs:
   - Increase model capacity (more/larger layers)
   - Reduce regularization
   - Train for more epochs
   - Increase learning rate slightly
   - Consider fine-tuning some VGG19 layers

3. For Unstable Training:
   - Further reduce learning rate
   - Increase batch size (if memory allows)
   - Reduce data augmentation intensity
   - Use batch normalization

CALLBACKS EFFECTIVENESS:

Early Stopping:
- Prevents overtraining by monitoring validation accuracy
- Restores best weights (not final weights)
- Patience=7 allows temporary plateaus

ReduceLROnPlateau:
- Automatically reduces learning rate when learning stalls
- Helps escape local minima
- Improves convergence stability

# MODEL SELECTION RECOMMENDATIONS

In [None]:
print("\n" + "="*80)
print("MODEL SELECTION RECOMMENDATIONS")
print("="*80)

best_accuracy_idx = results_df['Accuracy'].idxmax()
best_model = results_df.loc[best_accuracy_idx, 'Experiment']
best_accuracy = results_df.loc[best_accuracy_idx, 'Accuracy']

print(f"""
RECOMMENDED MODEL: {best_model}
Validation Accuracy: {best_accuracy:.4f} ({best_accuracy*100:.2f}%)

Selection Criteria:
1. Highest validation accuracy
2. Balanced precision and recall
3. Stable training (from learning curves)
4. Practical deployment considerations

# LIMITATIONS


   - Single data source (NIH dataset)
   - May not represent all clinical scenarios
   - No species-level classification (only infected/uninfected)

BROADER IMPACT:

Success in automated malaria diagnosis could:
- Reduce diagnostic delays in resource-limited settings
- Improve screening efficiency in high-burden areas
- Enable large-scale surveillance programs
- Serve as template for other microscopy-based diagnoses
- Reduce burden on healthcare workers

However, must address:
- Equitable access to technology
- Training requirements for end users
- Maintenance and quality control
- Integration with existing healthcare systems