# Deep Learning for Malaria Diagnosis
This notebook is inspired by works of (Sivaramakrishnan Rajaraman  et al., 2018) and (Jason Brownlee, 2019). Acknowledge to NIH and Bangalor Hospital who make available this malaria dataset.

Malaria is an infectuous disease caused by parasites that are transmitted to people through the bites of infected female Anopheles mosquitoes.

The Malaria burden with some key figures:
<font color='red'>
* More than 219 million cases
* Over 430 000 deaths in 2017 (Mostly: children & pregnants)
* 80% in 15 countries of Africa & India
  </font>

![MalariaBurd](https://github.com/habiboulaye/ai-labs/blob/master/malaria-diagnosis/doc-images/MalariaBurden.png?raw=1)

The malaria diagnosis is performed using blood test:
* Collect patient blood smear
* Microscopic visualisation of the parasit

![MalariaDiag](https://github.com/habiboulaye/ai-labs/blob/master/malaria-diagnosis/doc-images/MalariaDiag.png?raw=1)
  
Main issues related to traditional diagnosis:
<font color='#ed7d31'>
* resource-constrained regions
* time needed and delays
* diagnosis accuracy and cost
</font>

The objective of this notebook is to apply modern deep learning techniques to perform medical image analysis for malaria diagnosis.

*This notebook is inspired by works of (Sivaramakrishnan Rajaraman  et al., 2018), (Adrian Rosebrock, 2018) and (Jason Brownlee, 2019)*

## Configuration

In [None]:
import os

# List directories matching pattern using glob
import glob
for path in glob.glob(r"C:\Users\ADVANCED TECH\Google*"):
	print(path)


In [None]:
#Mount the local drive project_folder
import os
drive_path = r"C:\Users\ADVANCED TECH\Downloads\Google Drive\Colab Notebooks\Projects\malaria-diagnosis" 

if os.path.exists(drive_path):
	print(os.listdir(drive_path))
else:
	print({drive_path})


{'C:\\Users\\ADVANCED TECH\\Downloads\\Google Drive\\Colab Notebooks\\Projects\\malaria-diagnosis'}


In [None]:
# Local Windows environment - no drive mounting needed
print("Running on local Windows environment")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Local Windows environment setup
# Note: TensorFlow should be installed via: pip install tensorflow

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

print("TensorFlow version:", tf.__version__)

# Check for GPU availability
print("GPU Available: ", tf.config.list_physical_devices('GPU'))
if tf.config.list_physical_devices('GPU'):
    print("GPU device name:", tf.test.gpu_device_name())
else:
    print("Running on CPU")

2.20.0
2.20.0



[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


''

## Populating namespaces

In [None]:
# Importing basic libraries
import os
import random
import shutil
from matplotlib import pyplot
from matplotlib.image import imread
%matplotlib inline

# Importing the Keras libraries and packages
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Convolution2D as Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense




In [None]:
# Define the useful paths for data accessibility
ai_project = r"C:\Users\ADVANCED TECH\Downloads\Google Drive\Colab Notebooks\Projects\malaria-diagnosis" #"/content/drive/My Drive/Colab Notebooks (1)/ai-labs/malaria-diagnosis"
cell_images_dir = os.path.join(ai_project,'cell_images')
training_path = os.path.join(ai_project,'train')
testing_path = os.path.join(ai_project,'test')

## Prepare DataSet

### *Download* DataSet

In [None]:
import subprocess
import urllib.request
import zipfile
import os

# Download the data locally. If already downloaded, turn downloadData=False
downloadData = False  # Data already downloaded
if downloadData == True:
    # Local Windows environment
    data_url = "https://data.lhncbc.nlm.nih.gov/public/Malaria/cell_images.zip"
    local_zip_path = os.path.join(ai_project, "cell_images.zip")
    
    print(f"Downloading data to: {local_zip_path}")
    
    # Create directory if it doesn't exist
    os.makedirs(ai_project, exist_ok=True)
    
    # Download using urllib
    urllib.request.urlretrieve(data_url, local_zip_path)
    
    # Extract the zip file
    with zipfile.ZipFile(local_zip_path, 'r') as zip_ref:
        zip_ref.extractall(ai_project)
    
    # Clean up zip file
    os.remove(local_zip_path)
    
    # List files in the directory
    print("Files in project directory:")
    print(os.listdir(ai_project))
else:
    print("Data download skipped. Dataset already available.")

Downloading data to: C:\Users\ADVANCED TECH\Downloads\Google Drive\Colab Notebooks\Projects\malaria-diagnosis\cell_images.zip
Files in project directory:
['cell_images']
Files in project directory:
['cell_images']


## Baseline CNN Model
Define a basic ConvNet defined with ConvLayer: Conv2D => MaxPooling2D followed by Flatten => Dense => Dense(output)

![ConvNet](https://github.com/habiboulaye/ai-labs/blob/master/malaria-diagnosis/doc-images/ConvNet.png?raw=1)


Training

In [8]:
# === Redoing training using TensorFlow's image_dataset_from_directory (tf.data) ===
# Single self-contained cell: locate data -> load -> split -> build -> train -> evaluate

# 1) Setup
import os, json, random, datetime
from pathlib import Path
import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, optimizers

print("TF version:", tf.__version__)
SEED = 42
AUTOTUNE = tf.data.AUTOTUNE
IMG_SIZE = (84, 84)
BATCH = 4

# 2) Locate the dataset folder robustly
EXPECTED_CLASSES = {"Parasitized", "Uninfected"}

# Hints to search (add any others you like)
HINT_ROOTS = [
    Path(r"C:\Users\ADVANCED TECH\Downloads\Google Drive\Colab Notebooks\Projects\malaria-diagnosis"),
    Path(r"C:\Users\ADVANCED TECH\Downloads\Google Drive\Colab Notebooks\ai-labs\malaria-diagnosis"),
    Path(r"C:\Users\ADVANCED TECH\Google Drive"),
    Path(r"C:\Users\ADVANCED TECH\Downloads"),
    Path.home() / "Downloads",
    Path(r"G:\My Drive"),  # Google Drive for Desktop default
]

def find_cell_images(roots):
    for base in roots:
        if not base.exists():
            continue
        # Try an exact child first
        direct = base / "cell_images"
        candidates = [direct] if direct.exists() else list(base.rglob("cell_images"))
        for c in candidates:
            if c.is_dir():
                subdirs = {d.name for d in c.iterdir() if d.is_dir()}
                if EXPECTED_CLASSES.issubset(subdirs):
                    return c.resolve()
    return None

DATA_DIR = find_cell_images(HINT_ROOTS)
if DATA_DIR is None:
    raise FileNotFoundError(
        "Could not find a folder named 'cell_images' containing 'Parasitized' and 'Uninfected' "
        "under any of these roots:\n" + "\n".join(str(p) for p in HINT_ROOTS)
    )

print(f"Using DATA_DIR: {DATA_DIR}")
print(f"DATA_DIR exists: {DATA_DIR.exists()}")

# Experiment output directory
EXP_DIR = Path(r"C:\Users\ADVANCED TECH\Downloads\Group1-Malaria-Diagnosis-CNN\experiments")
EXP_DIR.mkdir(parents=True, exist_ok=True)

# 3) Load datasets with image_dataset_from_directory
train_ds = tf.keras.utils.image_dataset_from_directory(
    DATA_DIR,
    labels="inferred",
    label_mode="binary",
    validation_split=0.30,   # 70% train, 30% remainder
    subset="training",
    seed=SEED,
    image_size=IMG_SIZE,
    batch_size=BATCH,
    shuffle=True
)

valtest_ds = tf.keras.utils.image_dataset_from_directory(
    DATA_DIR,
    labels="inferred",
    label_mode="binary",
    validation_split=0.30,
    subset="validation",
    seed=SEED,
    image_size=IMG_SIZE,
    batch_size=BATCH,
    shuffle=True
)

class_names = train_ds.class_names
print("Classes:", class_names)

# Split the 30% remainder evenly into 15% val + 15% test
num_batches = tf.data.experimental.cardinality(valtest_ds).numpy()
assert num_batches > 0, "No batches in val/test split."
val_batches = num_batches // 2
ds_val  = valtest_ds.take(val_batches)
ds_test = valtest_ds.skip(val_batches)

# Optional performance tuning
def prepare(ds, cache=True, shuffle=False):
    if cache:
        ds = ds.cache()
    if shuffle:
        ds = ds.shuffle(1000, seed=SEED, reshuffle_each_iteration=True)
    return ds.prefetch(AUTOTUNE)

train_ds = prepare(train_ds, cache=False, shuffle=True)  # avoid caching large set if RAM is tight
ds_val   = prepare(ds_val)
ds_test  = prepare(ds_test)

# 4) Build baseline CNN
def build_baseline(input_shape=(84,84,3)):
    inputs = layers.Input(shape=input_shape)
    x = layers.Rescaling(1./255)(inputs)
    x = layers.Conv2D(16, 3, padding="same", activation="relu")(x)
    x = layers.MaxPooling2D()(x)
    x = layers.Conv2D(32, 3, padding="same", activation="relu")(x)
    x = layers.MaxPooling2D()(x)
    x = layers.Flatten()(x)
    x = layers.Dense(64, activation="relu")(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    return models.Model(inputs, outputs)

model = build_baseline(input_shape=(IMG_SIZE[0], IMG_SIZE[1], 3))
model.compile(
    optimizer=optimizers.Adam(),
    loss="binary_crossentropy",
    metrics=["accuracy", tf.keras.metrics.AUC(name="auc")]
)
model.summary()

# 5) Train
exp_name = f"E0_tfdata_baseline_adam_lr1e-3_b{BATCH}"
out_dir = EXP_DIR / exp_name
out_dir.mkdir(parents=True, exist_ok=True)

ckpt = callbacks.ModelCheckpoint(
    filepath=str(out_dir / "best.weights.h5"),
    monitor="val_accuracy",
    save_best_only=True,
    save_weights_only=True,
    mode="max",
    verbose=1
)
early = callbacks.EarlyStopping(
    monitor="val_accuracy",
    patience=3,
    mode="max",
    restore_best_weights=True,
    verbose=1
)

history = model.fit(
    train_ds,
    validation_data=ds_val,
    epochs=8,
    callbacks=[ckpt, early],
    verbose=2
)

# 6) Evaluate on test
test_loss, test_acc, test_auc = model.evaluate(ds_test, verbose=0)
print(f"\nTest — Loss: {test_loss:.4f} | Acc: {test_acc:.4f} | AUC: {test_auc:.4f}")

# Save minimal metadata
with open(out_dir/"config.json", "w") as f:
    json.dump({
        "img_size": IMG_SIZE, "batch": BATCH, "epochs": 8,
        "optimizer": "adam", "lr": 1e-3,
        "loader": "image_dataset_from_directory", "split": "70/15/15 via 70/30 then take/skip",
        "data_dir": str(DATA_DIR)
    }, f, indent=2)
with open(out_dir/"test_results.json", "w") as f:
    json.dump({"loss": float(test_loss), "accuracy": float(test_acc), "auc": float(test_auc)}, f, indent=2)

print(f"Saved weights & logs to: {out_dir}")


TF version: 2.20.0
Using DATA_DIR: C:\Users\ADVANCED TECH\Downloads\Google Drive\Colab Notebooks\Projects\malaria-diagnosis\cell_images
DATA_DIR exists: True
Found 27558 files belonging to 2 classes.
Found 27558 files belonging to 2 classes.
Using 19291 files for training.
Using 19291 files for training.
Found 27558 files belonging to 2 classes.
Found 27558 files belonging to 2 classes.
Using 8267 files for validation.
Using 8267 files for validation.
Classes: ['Parasitized', 'Uninfected']
Classes: ['Parasitized', 'Uninfected']


Epoch 1/8

Epoch 1: val_accuracy improved from None to 0.88480, saving model to C:\Users\ADVANCED TECH\Downloads\Group1-Malaria-Diagnosis-CNN\experiments\E0_tfdata_baseline_adam_lr1e-3_b4\best.weights.h5

Epoch 1: val_accuracy improved from None to 0.88480, saving model to C:\Users\ADVANCED TECH\Downloads\Group1-Malaria-Diagnosis-CNN\experiments\E0_tfdata_baseline_adam_lr1e-3_b4\best.weights.h5
4823/4823 - 633s - 131ms/step - accuracy: 0.8409 - auc: 0.9244 - loss: 0.3561 - val_accuracy: 0.8848 - val_auc: 0.9717 - val_loss: 0.2796
Epoch 2/8
4823/4823 - 633s - 131ms/step - accuracy: 0.8409 - auc: 0.9244 - loss: 0.3561 - val_accuracy: 0.8848 - val_auc: 0.9717 - val_loss: 0.2796
Epoch 2/8

Epoch 2: val_accuracy improved from 0.88480 to 0.92280, saving model to C:\Users\ADVANCED TECH\Downloads\Group1-Malaria-Diagnosis-CNN\experiments\E0_tfdata_baseline_adam_lr1e-3_b4\best.weights.h5

Epoch 2: val_accuracy improved from 0.88480 to 0.92280, saving model to C:\Users\ADVANCED TECH\Downloads\Gro

In [12]:
# 4) Build baseline CNN (with in-model rescaling; no augmentation for this baseline)
def build_baseline(input_shape=(84,84,3)):
    inputs = layers.Input(shape=input_shape)

    x = layers.Rescaling(1./255)(inputs)

    x = layers.Conv2D(32, 3, padding="same", activation="relu")(x)
    x = layers.MaxPooling2D()(x)
    x = layers.Conv2D(64, 3, padding="same", activation="relu")(x)
    x = layers.MaxPooling2D()(x)
    x = layers.Flatten()(x)
    x = layers.Dense(64, activation="relu")(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    return models.Model(inputs, outputs)

model = build_baseline(input_shape=(IMG_SIZE[0], IMG_SIZE[1], 3))
model.compile(
    optimizer=optimizers.Adam(),
    loss="binary_crossentropy",
    metrics=["accuracy", tf.keras.metrics.AUC(name="auc")]
)
model.summary()

## Incremental Experiments to Improve CNN Accuracy

Here are seven incremental experiments to improve the baseline CNN model's accuracy. Each experiment builds upon the previous one.

### Experiment 1: Increase Model Capacity (More Filters)

**Name:** E1_MoreFilters

**Change:** Increase the number of filters in the convolutional layers.

**Reasoning:** More filters allow the model to learn a richer set of features from the images.

**Code Snippet:**

In [13]:
def build_E2_AddConvLayer(input_shape=(84,84,3)):
    inputs = layers.Input(shape=input_shape)
    x = layers.Rescaling(1./255)(inputs)
    x = layers.Conv2D(64, 3, padding="same", activation="relu")(x)
    x = layers.MaxPooling2D()(x)
    x = layers.Conv2D(128, 3, padding="same", activation="relu")(x)
    x = layers.MaxPooling2D()(x)
    x = layers.Conv2D(256, 3, padding="same", activation="relu")(x) # Added conv layer
    x = layers.MaxPooling2D()(x)
    x = layers.Flatten()(x)
    x = layers.Dense(128, activation="relu")(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    return models.Model(inputs, outputs)

model_E2 = build_E2_AddConvLayer(input_shape=(IMG_SIZE[0], IMG_SIZE[1], 3))
model_E2.compile(optimizer=optimizers.Adam(), loss="binary_crossentropy", metrics=["accuracy", tf.keras.metrics.AUC(name="auc")])

# Train model E2
exp_name_E2 = "E2_AddConvLayer"
out_dir_E2 = EXP_DIR/exp_name_E2
out_dir_E2.mkdir(parents=True, exist_ok=True)

ckpt_E2 = callbacks.ModelCheckpoint(
    filepath=str(out_dir_E2/"best.weights.h5"),
    monitor="val_accuracy",
    save_best_only=True,
    save_weights_only=True,
    mode="max",
    verbose=1
)
early_E2 = callbacks.EarlyStopping(
    monitor="val_accuracy",
    patience=3,
    mode="max",
    restore_best_weights=True,
    verbose=1
)

history_E2 = model_E2.fit(
    train_ds,
    validation_data=ds_val,
    epochs=8,
    callbacks=[ckpt_E2, early_E2],
    verbose=2
)

# Evaluate on test
test_loss_E2, test_acc_E2, test_auc_E2 = model_E2.evaluate(ds_test, verbose=0)
print(f"\nTest (E2) — Loss: {test_loss_E2:.4f} | Acc: {test_acc_E2:.4f} | AUC: {test_auc_E2:.4f}")

# Save minimal metadata
with open(out_dir_E2/"config.json", "w") as f:
    json.dump({
        "img_size": IMG_SIZE, "batch": BATCH, "epochs": 8,
        "optimizer": "adam", "lr": 1e-3, "dropout": "none", "layers": "added conv",
        "loader": "image_dataset_from_directory", "split": "70/15/15 via 70/30 then take/skip"
    }, f, indent=2)
with open(out_dir_E2/"test_results.json", "w") as f:
    json.dump({"loss": float(test_loss_E2), "accuracy": float(test_acc_E2), "auc": float(test_auc_E2)}, f, indent=2)

print(f"Saved weights & logs to: {out_dir_E2}")

Epoch 1/8

Epoch 1: val_accuracy improved from None to 0.95039, saving model to C:\Users\ADVANCED TECH\Downloads\Group1-Malaria-Diagnosis-CNN\experiments\E2_AddConvLayer\best.weights.h5

Epoch 1: val_accuracy improved from None to 0.95039, saving model to C:\Users\ADVANCED TECH\Downloads\Group1-Malaria-Diagnosis-CNN\experiments\E2_AddConvLayer\best.weights.h5
4823/4823 - 871s - 180ms/step - accuracy: 0.7253 - auc: 0.8372 - loss: 0.4681 - val_accuracy: 0.9504 - val_auc: 0.9836 - val_loss: 0.1730
Epoch 2/8
4823/4823 - 871s - 180ms/step - accuracy: 0.7253 - auc: 0.8372 - loss: 0.4681 - val_accuracy: 0.9504 - val_auc: 0.9836 - val_loss: 0.1730
Epoch 2/8

Epoch 2: val_accuracy improved from 0.95039 to 0.95668, saving model to C:\Users\ADVANCED TECH\Downloads\Group1-Malaria-Diagnosis-CNN\experiments\E2_AddConvLayer\best.weights.h5

Epoch 2: val_accuracy improved from 0.95039 to 0.95668, saving model to C:\Users\ADVANCED TECH\Downloads\Group1-Malaria-Diagnosis-CNN\experiments\E2_AddConvLayer\

In [None]:
def build_E3_Dropout(input_shape=(84,84,3)):
    inputs = layers.Input(shape=input_shape)
    x = layers.Rescaling(1./255)(inputs)
    x = layers.Conv2D(64, 3, padding="same", activation="relu")(x)
    x = layers.MaxPooling2D()(x)
    x = layers.Conv2D(128, 3, padding="same", activation="relu")(x)
    x = layers.MaxPooling2D()(x)
    x = layers.Conv2D(256, 3, padding="same", activation="relu")(x)
    x = layers.MaxPooling2D()(x)
    x = layers.Flatten()(x)
    x = layers.Dense(128, activation="relu")(x)
    x = layers.Dropout(0.5)(x) # Added dropout
    outputs = layers.Dense(1, activation="sigmoid")(x)
    return models.Model(inputs, outputs)

model_E3 = build_E3_Dropout(input_shape=(IMG_SIZE[0], IMG_SIZE[1], 3))
model_E3.compile(optimizer=optimizers.Adam(), loss="binary_crossentropy", metrics=["accuracy", tf.keras.metrics.AUC(name="auc")])

# Train model E3
exp_name_E3 = "E3_Dropout"
out_dir_E3 = EXP_DIR/exp_name_E3
out_dir_E3.mkdir(parents=True, exist_ok=True)

ckpt_E3 = callbacks.ModelCheckpoint(
    filepath=str(out_dir_E3/"best.weights.h5"),
    monitor="val_accuracy",
    save_best_only=True,
    save_weights_only=True,
    mode="max",
    verbose=1
)
early_E3 = callbacks.EarlyStopping(
    monitor="val_accuracy",
    patience=3,
    mode="max",
    restore_best_weights=True,
    verbose=1
)

history_E3 = model_E3.fit(
    train_ds,
    validation_data=ds_val,
    epochs=8,
    callbacks=[ckpt_E3, early_E3],
    verbose=2
)

# Evaluate on test
test_loss_E3, test_acc_E3, test_auc_E3 = model_E3.evaluate(ds_test, verbose=0)
print(f"\nTest (E3) — Loss: {test_loss_E3:.4f} | Acc: {test_acc_E3:.4f} | AUC: {test_auc_E3:.4f}")

# Save minimal metadata
with open(out_dir_E3/"config.json", "w") as f:
    json.dump({
        "img_size": IMG_SIZE, "batch": BATCH, "epochs": 8,
        "optimizer": "adam", "lr": 1e-3, "dropout": "added", "layers": "added conv",
        "loader": "image_dataset_from_directory", "split": "70/15/15 via 70/30 then take/skip"
    }, f, indent=2)
with open(out_dir_E3/"test_results.json", "w") as f:
    json.dump({"loss": float(test_loss_E3), "accuracy": float(test_acc_E3), "auc": float(test_auc_E3)}, f, indent=2)

print(f"Saved weights & logs to: {out_dir_E3}")

Epoch 1/8

Epoch 1: val_accuracy improved from None to 0.92667, saving model to C:\Users\ADVANCED TECH\Downloads\Group1-Malaria-Diagnosis-CNN\experiments\E3_Dropout\best.weights.h5

Epoch 1: val_accuracy improved from None to 0.92667, saving model to C:\Users\ADVANCED TECH\Downloads\Group1-Malaria-Diagnosis-CNN\experiments\E3_Dropout\best.weights.h5
4823/4823 - 805s - 167ms/step - accuracy: 0.6734 - auc: 0.7727 - loss: 0.5387 - val_accuracy: 0.9267 - val_auc: 0.9675 - val_loss: 0.2299
Epoch 2/8
4823/4823 - 805s - 167ms/step - accuracy: 0.6734 - auc: 0.7727 - loss: 0.5387 - val_accuracy: 0.9267 - val_auc: 0.9675 - val_loss: 0.2299
Epoch 2/8

Epoch 2: val_accuracy improved from 0.92667 to 0.95305, saving model to C:\Users\ADVANCED TECH\Downloads\Group1-Malaria-Diagnosis-CNN\experiments\E3_Dropout\best.weights.h5

Epoch 2: val_accuracy improved from 0.92667 to 0.95305, saving model to C:\Users\ADVANCED TECH\Downloads\Group1-Malaria-Diagnosis-CNN\experiments\E3_Dropout\best.weights.h5
4823