<a href="https://colab.research.google.com/github/Blueorchid1711/malaria/blob/main/malaria.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("iarunava/cell-images-for-detecting-malaria")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'cell-images-for-detecting-malaria' dataset.
Path to dataset files: /kaggle/input/cell-images-for-detecting-malaria


In [None]:
final_val_accuracy = history.history['val_accuracy'][-1]
print(f"Final Validation Accuracy: {final_val_accuracy:.4f}")

Final Validation Accuracy: 0.8884


# Task
Build and evaluate an image classification model for malaria detection using the dataset located at `/kaggle/input/cell-images-for-detecting-malaria`.

## Prepare Data for Training

### Subtask:
Set up data generators for training and validation. This includes applying data augmentation techniques to the training data and resizing images to the input size expected by MobileNetV2.


**Reasoning**:
I need to define constants and import necessary libraries before setting up the data generators. This includes specifying the image dimensions, batch size, and the path to the dataset. These constants will be used in subsequent steps for configuring the data generators.



In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import os

# Define constants
IMAGE_WIDTH = 224
IMAGE_HEIGHT = 224
IMAGE_SIZE = (IMAGE_WIDTH, IMAGE_HEIGHT)
BATCH_SIZE = 32
DATA_DIR = '/kaggle/input/cell-images-for-detecting-malaria/cell_images'

print(f"Image size: {IMAGE_SIZE}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Dataset directory: {DATA_DIR}")

Image size: (224, 224)
Batch size: 32
Dataset directory: /kaggle/input/cell-images-for-detecting-malaria/cell_images


**Reasoning**:
Now that the constants are defined, I will create the ImageDataGenerator for the training set, incorporating data augmentation techniques and setting a validation split for later use. This generator will handle the preprocessing and augmentation of training images.



In [None]:
train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest',
    validation_split=0.2 # 20% for validation
)

print("Training data generator created with augmentation and validation split.")

Training data generator created with augmentation and validation split.


**Reasoning**:
Following the instructions, I need to create a separate `ImageDataGenerator` for the validation set. This generator will only rescale the images and use the same validation split, but without any data augmentation, to ensure the validation data represents the real-world distribution.



In [None]:
validation_datagen = ImageDataGenerator(
    rescale=1./255,
    validation_split=0.2 # Same split as training for consistency
)

print("Validation data generator created without augmentation, only rescaling.")

Validation data generator created without augmentation, only rescaling.


**Reasoning**:
Now that both `ImageDataGenerator` instances are set up, I will generate the training data using `flow_from_directory` on the `train_datagen`. This will load images from the specified `DATA_DIR`, apply augmentation, resize them, and organize them into batches for training.



In [None]:
train_generator = train_datagen.flow_from_directory(
    DATA_DIR,
    target_size=IMAGE_SIZE,
    batch_size=BATCH_SIZE,
    color_mode='rgb',
    class_mode='binary',
    subset='training'
)

print(f"Found {train_generator.samples} training images belonging to {train_generator.num_classes} classes.")

FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/input/cell-images-for-detecting-malaria/cell_images'

**Reasoning**:
The `FileNotFoundError` indicates that the `DATA_DIR` path as currently defined (`/kaggle/input/cell-images-for-detecting-malaria/cell_images`) does not exist. The `path` variable, which was obtained from `kagglehub.dataset_download`, points to `/kaggle/input/cell-images-for-detecting-malaria`. It is highly probable that the class subdirectories (`Parasitized` and `Uninfected`) are directly within this `path` directory, rather than nested under an additional `cell_images` folder. Therefore, `DATA_DIR` should be set directly to the `path` variable.



In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import os

# Define constants
IMAGE_WIDTH = 224
IMAGE_HEIGHT = 224
IMAGE_SIZE = (IMAGE_WIDTH, IMAGE_HEIGHT)
BATCH_SIZE = 32
# Corrected DATA_DIR to use the 'path' variable from the dataset download
# Assuming 'path' points to the directory containing 'Parasitized' and 'Uninfected' subdirectories
DATA_DIR = path

print(f"Image size: {IMAGE_SIZE}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Dataset directory: {DATA_DIR}")

NameError: name 'path' is not defined

**Reasoning**:
The `NameError` indicates that the variable `path` was not defined when `DATA_DIR = path` was executed. This happened because the cell defining `path` (from `kagglehub.dataset_download`) was not executed before or its value was lost. To fix this, I need to re-execute the `kagglehub.dataset_download` to ensure `path` is properly defined before it is used to set `DATA_DIR`.



In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import os
import kagglehub

# Download latest version to ensure 'path' is defined
path = kagglehub.dataset_download("iarunava/cell-images-for-detecting-malaria")

# Define constants
IMAGE_WIDTH = 224
IMAGE_HEIGHT = 224
IMAGE_SIZE = (IMAGE_WIDTH, IMAGE_HEIGHT)
BATCH_SIZE = 32
# Corrected DATA_DIR to use the 'path' variable from the dataset download
# Assuming 'path' points to the directory containing 'Parasitized' and 'Uninfected' subdirectories
DATA_DIR = path

print(f"Image size: {IMAGE_SIZE}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Dataset directory: {DATA_DIR}")

Downloading from https://www.kaggle.com/api/v1/datasets/download/iarunava/cell-images-for-detecting-malaria?dataset_version_number=1...


100%|██████████| 675M/675M [00:05<00:00, 119MB/s] 

Extracting files...





Image size: (224, 224)
Batch size: 32
Dataset directory: /root/.cache/kagglehub/datasets/iarunava/cell-images-for-detecting-malaria/versions/1


**Reasoning**:
Now that the `DATA_DIR` has been correctly defined and `path` is accessible, I will re-attempt to generate the training data using `flow_from_directory`. This should now correctly locate the image directories and create the `train_generator`.



In [None]:
train_generator = train_datagen.flow_from_directory(
    DATA_DIR,
    target_size=IMAGE_SIZE,
    batch_size=BATCH_SIZE,
    color_mode='rgb',
    class_mode='binary',
    subset='training'
)

print(f"Found {train_generator.samples} training images belonging to {train_generator.num_classes} classes.")

Found 44093 images belonging to 1 classes.
Found 44093 training images belonging to 1 classes.


**Reasoning**:
The previous output indicates that the `train_generator` found 44093 images but reported '1 classes'. This is incorrect for a binary classification problem (Parasitized/Uninfected) which should have 2 classes. The issue is likely due to the `DATA_DIR` pointing to the root of the dataset which contains an extra directory `cell_images` instead of directly having the `Parasitized` and `Uninfected` subdirectories. I need to correct `DATA_DIR` to point to the `cell_images` subdirectory within the `path` downloaded from KaggleHub.



In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import os
import kagglehub

# Re-download to ensure 'path' is defined and up-to-date
path = kagglehub.dataset_download("iarunava/cell-images-for-detecting-malaria")

# Define constants
IMAGE_WIDTH = 224
IMAGE_HEIGHT = 224
IMAGE_SIZE = (IMAGE_WIDTH, IMAGE_HEIGHT)
BATCH_SIZE = 32
# Corrected DATA_DIR to point to the 'cell_images' subdirectory within the downloaded dataset
DATA_DIR = os.path.join(path, 'cell_images')

print(f"Image size: {IMAGE_SIZE}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Dataset directory: {DATA_DIR}")

Image size: (224, 224)
Batch size: 32
Dataset directory: /root/.cache/kagglehub/datasets/iarunava/cell-images-for-detecting-malaria/versions/1/cell_images


**Reasoning**:
Now that the `DATA_DIR` has been correctly updated to include the `cell_images` subdirectory, I need to re-create the `train_generator` to ensure it correctly identifies both classes (Parasitized and Uninfected) and the appropriate number of images for training.



In [None]:
train_generator = train_datagen.flow_from_directory(
    DATA_DIR,
    target_size=IMAGE_SIZE,
    batch_size=BATCH_SIZE,
    color_mode='rgb',
    class_mode='binary',
    subset='training'
)

print(f"Found {train_generator.samples} training images belonging to {train_generator.num_classes} classes.")

Found 44095 images belonging to 3 classes.
Found 44095 training images belonging to 3 classes.


**Reasoning**:
The `train_generator` is still reporting 3 classes instead of the expected 2 for binary classification. This indicates an issue with the directory structure within `DATA_DIR` or how `flow_from_directory` is interpreting it. I need to inspect the contents of the `DATA_DIR` to understand what subdirectories are present and causing the '3 classes' count.



In [None]:
import os

# List the contents of the DATA_DIR to check for unexpected subdirectories
print(f"Contents of DATA_DIR ({DATA_DIR}):")
for item in os.listdir(DATA_DIR):
    if os.path.isdir(os.path.join(DATA_DIR, item)):
        print(f" - {item}/")
    else:
        print(f" - {item}")

Contents of DATA_DIR (/root/.cache/kagglehub/datasets/iarunava/cell-images-for-detecting-malaria/versions/1/cell_images):
 - cell_images/
 - Uninfected/
 - Parasitized/


**Reasoning**:
The previous output showed an unexpected nested `cell_images` subdirectory within the current `DATA_DIR`, causing `flow_from_directory` to detect 3 classes. I need to update `DATA_DIR` one more time to point to the correct subdirectory that directly contains 'Parasitized' and 'Uninfected' folders to ensure only two classes are identified.



In [None]:
import os

# Corrected DATA_DIR to point to the actual directory containing the class folders
# The structure appears to be: path/cell_images/cell_images/Parasitized and path/cell_images/cell_images/Uninfected
DATA_DIR = os.path.join(path, 'cell_images', 'cell_images')

print(f"Corrected Dataset directory: {DATA_DIR}")

Corrected Dataset directory: /root/.cache/kagglehub/datasets/iarunava/cell-images-for-detecting-malaria/versions/1/cell_images/cell_images


**Reasoning**:
Now that the `DATA_DIR` has been correctly updated to point to the innermost `cell_images` subdirectory, I need to re-create the `train_generator` to ensure it correctly identifies the two class folders ('Parasitized' and 'Uninfected') and the appropriate number of images for training.



In [None]:
train_generator = train_datagen.flow_from_directory(
    DATA_DIR,
    target_size=IMAGE_SIZE,
    batch_size=BATCH_SIZE,
    color_mode='rgb',
    class_mode='binary',
    subset='training'
)

print(f"Found {train_generator.samples} training images belonging to {train_generator.num_classes} classes.")

Found 22048 images belonging to 2 classes.
Found 22048 training images belonging to 2 classes.


**Reasoning**:
Now that the training data generator is correctly set up, I will generate the validation data using `flow_from_directory` on the `validation_datagen`. This will load images from the specified `DATA_DIR`, apply rescaling, resize them, and organize them into batches for validation.



In [None]:
validation_generator = validation_datagen.flow_from_directory(
    DATA_DIR,
    target_size=IMAGE_SIZE,
    batch_size=BATCH_SIZE,
    color_mode='rgb',
    class_mode='binary',
    subset='validation'
)

print(f"Found {validation_generator.samples} validation images belonging to {validation_generator.num_classes} classes.")

Found 5510 images belonging to 2 classes.
Found 5510 validation images belonging to 2 classes.


## Build MobileNetV2 Model with Transfer Learning

### Subtask:
Load the pre-trained MobileNetV2 model as a base and add a custom classification head (dense layers) on top. Freeze the base model layers initially and compile the complete model for training.


**Reasoning**:
I need to import the necessary Keras layers and the MobileNetV2 model to start building the deep learning model as specified in the instructions.



In [None]:
import tensorflow as tf
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.layers import Flatten, Dense, BatchNormalization, Dropout
from tensorflow.keras.models import Model

print("Keras MobileNetV2, layers, and Model imported successfully.")

Keras MobileNetV2, layers, and Model imported successfully.


**Reasoning**:
Now that the necessary libraries are imported, I will load the pre-trained MobileNetV2 model, remove its top classification layer, and freeze its layers to prepare it as the base for transfer learning.



In [None]:
base_model = MobileNetV2(input_shape=IMAGE_SIZE + (3,), include_top=False, weights='imagenet')

# Freeze the base model layers
for layer in base_model.layers:
    layer.trainable = False

print("MobileNetV2 base model loaded and its layers frozen.")

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet_v2/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_1.0_224_no_top.h5
[1m9406464/9406464[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
MobileNetV2 base model loaded and its layers frozen.


**Reasoning**:
With the base model loaded and frozen, I will now create the custom classification head by adding a Flatten layer, a Dense layer with ReLU activation, a BatchNormalization layer, a Dropout layer, and a final Dense layer with sigmoid activation for binary classification.



In [None]:
x = base_model.output
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
x = BatchNormalization()(x)
x = Dropout(0.5)(x)
predictions = Dense(1, activation='sigmoid')(x)

print("Custom classification head created.")

Custom classification head created.


**Reasoning**:
Now that the base model and the custom classification head are defined, I will combine them into a complete model using the Keras `Model` class, specifying the base model's input and the custom head's output.



In [None]:
model = Model(inputs=base_model.input, outputs=predictions)

print("Complete model combining base model and custom head created.")

Complete model combining base model and custom head created.


**Reasoning**:
Now that the complete model is assembled, I will compile it using the Adam optimizer with a specified learning rate, binary crossentropy as the loss function, and accuracy as the metric, as required for binary classification tasks.



In [None]:
from tensorflow.keras.optimizers import Adam

model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

print("Model compiled successfully with Adam optimizer, binary_crossentropy loss, and accuracy metric.")

Model compiled successfully with Adam optimizer, binary_crossentropy loss, and accuracy metric.


## Train the Image Classification Model

### Subtask:
Train the configured MobileNetV2 model using the prepared data generators for 8 epochs, as requested. The training will utilize transfer learning techniques.


**Reasoning**:
Before training the model, I need to calculate the `steps_per_epoch` using the total number of training samples and the `BATCH_SIZE`. This ensures that each epoch processes the entire training dataset correctly.



In [None]:
import math

steps_per_epoch = math.ceil(train_generator.samples / BATCH_SIZE)

print(f"Calculated steps per epoch: {steps_per_epoch}")

Calculated steps per epoch: 689



**Reasoning**:
Now I will calculate the `validation_steps` using the total number of validation samples and the `BATCH_SIZE`. This is necessary for the model's `.fit()` method to correctly iterate through the validation data.



In [None]:
validation_steps = math.ceil(validation_generator.samples / BATCH_SIZE)

print(f"Calculated validation steps: {validation_steps}")

Calculated validation steps: 173


**Reasoning**:
Now that both `steps_per_epoch` and `validation_steps` are calculated, I will train the model using the `.fit()` method with the prepared data generators and the specified number of epochs.



In [None]:
history = model.fit(
    train_generator,
    steps_per_epoch=steps_per_epoch,
    epochs=8,
    validation_data=validation_generator,
    validation_steps=validation_steps
)

print("Model training completed for 8 epochs.")

  self._warn_if_super_not_called()


Epoch 1/8
[1m689/689[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m316s[0m 430ms/step - accuracy: 0.8589 - loss: 0.3477 - val_accuracy: 0.9018 - val_loss: 0.2583
Epoch 2/8
[1m689/689[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m275s[0m 399ms/step - accuracy: 0.8989 - loss: 0.2699 - val_accuracy: 0.8938 - val_loss: 0.2763
Epoch 3/8
[1m689/689[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m275s[0m 399ms/step - accuracy: 0.9050 - loss: 0.2514 - val_accuracy: 0.8820 - val_loss: 0.3113
Epoch 4/8
[1m689/689[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m271s[0m 393ms/step - accuracy: 0.9066 - loss: 0.2444 - val_accuracy: 0.8984 - val_loss: 0.2690
Epoch 5/8
[1m689/689[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m269s[0m 391ms/step - accuracy: 0.9155 - loss: 0.2302 - val_accuracy: 0.8677 - val_loss: 0.3532
Epoch 6/8
[1m689/689[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m268s[0m 390ms/step - accuracy: 0.9115 - loss: 0.2398 - val_accuracy: 0.9002 - val_loss: 0.2598
Epoch 7/8
