### Model and Library Imports

In this section, we import the necessary libraries and modules for building our deep learning model. We utilize Keras for constructing the neural network architecture, leveraging various pre-trained models such as VGG16, EfficientNetB3, ResNet50, and MobileNetV2. Additionally, we include utilities for image processing, data augmentation, and early stopping during training to prevent overfitting.

The following libraries are imported:
- **TensorFlow/Keras**: For model building and training.
- **NumPy**: For numerical operations.
- **Matplotlib**: For visualizing results.
- **Pandas**: For data manipulation.
- **OpenCV**: For image processing.
- **Glob**: For file handling.

We also suppress warnings to keep the output clean.


In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import *
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.applications import VGG16, EfficientNetB3, ResNet50, MobileNetV2
from tensorflow.keras.models import load_model


import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import cv2
import os
import glob

import warnings
warnings.filterwarnings('ignore')

# Dataset Exploration

### Dataset Paths

In this section, we define the paths to the dataset containing cell images used for detecting malaria. The dataset is organized into two main directories: one for parasitized cells and another for uninfected cells. We use the `os.path.join` method to create full paths for each category, ensuring compatibility across different operating systems.

- **DATA_DIR**: The main directory where the cell images are stored.
- **PARASITIZED_DIR**: Path to the directory containing images of parasitized cells.
- **UNINFECTED_DIR**: Path to the directory containing images of uninfected cells.


In [None]:
# paths
DATA_DIR = '/kaggle/input/cell-images-for-detecting-malaria/cell_images/cell_images'

PARASITIZED_DIR = os.path.join(DATA_DIR, 'Parasitized')
UNINFECTED_DIR = os.path.join(DATA_DIR, 'Uninfected')


### Dataset Size Overview

In this section, we print the number of images in each category of the dataset: parasitized and uninfected cells. This provides an initial overview of the dataset size, which is crucial for understanding the distribution of data and planning the model training process.

- The number of **parasitized images** is displayed.
- The number of **uninfected images** is displayed.


In [None]:
print(f'Parasitized images number = {len(os.listdir(PARASITIZED_DIR))}')
print(f'Uninfected images number = {len(os.listdir(UNINFECTED_DIR))}')

### Dataset Size Output

The dataset contains an equal number of images in both categories, which is important for balanced training. 

- **Parasitized images number**: 13,780
- **Uninfected images number**: 13,780

This balance helps ensure that the model does not become biased toward one class during training.


### Image Paths Collection

In this section, we gather the file paths for all images in the dataset. We use the `glob` library to retrieve the paths for both parasitized and uninfected images. The paths are stored in separate lists, which are then combined into a single list for further processing.

- **parasitized_images_paths**: A list of file paths for all parasitized cell images.
- **uninfected_images_paths**: A list of file paths for all uninfected cell images.
- **all_images_paths**: A combined list containing the paths of both parasitized and uninfected images, facilitating streamlined data handling in subsequent steps.


In [None]:
parasitized_images_paths = glob.glob(os.path.join(PARASITIZED_DIR, '*'))
uninfected_images_paths = glob.glob(os.path.join(UNINFECTED_DIR, '*'))

all_images_paths = parasitized_images_paths + uninfected_images_paths

### Image Extensions Count

In this section, we analyze the file extensions of the images in the dataset. We create a dictionary to count the occurrences of each file extension among the collected image paths.

- **extensions**: A dictionary where the keys are file extensions and the values are the counts of how many images have that extension.
- The loop iterates through each image path, extracts the file extension, and updates the count in the dictionary accordingly.

This helps ensure that we are aware of the types of image files present in the dataset, which can be useful for preprocessing steps.


In [None]:
extensions = {}

for image_path in all_images_paths:
    ext = image_path.split('.')[-1] # get the extension of a path such as image.png -> png
    
    if ext in extensions: 
        extensions[ext] += 1
    else:
        extensions[ext] = 1
extensions

- **PNG files**: 27,558
- **DB files**: 2

This indicates that the majority of images in the dataset are in PNG format, while there are very few database files present. Understanding the types of files helps inform the preprocessing and loading methods we will use for the model.


### Image Dimensions Extraction Function

In this section, we define a function to retrieve the dimensions of images stored in a specified directory. The function processes only PNG images and collects their widths and heights.

- **Function**: `get_images_dimensions(images_dir)`
  - **Input**: Directory path containing images.
  - **Output**: A dictionary with two lists:
    - **widths**: List of widths for all processed images.
    - **heights**: List of heights for all processed images.

The function uses OpenCV to read each image and extract its dimensions, which are important for understanding the input size for the model and ensuring consistency in preprocessing.


In [None]:
def get_images_dimensions(images_dir):
    
    images_paths = glob.glob(os.path.join(images_dir, '*'))

    dimensions = {'widths': [],
                   'heights': []}
    
    for image_path in images_paths:
        if image_path.split('.')[-1] == 'png':
            image = cv2.imread(image_path)
            w, h, _ = image.shape 
            dimensions['widths'].append(w)
            dimensions['heights'].append(h)
            
    return dimensions

In [None]:
parasitized_dimensions = get_images_dimensions(PARASITIZED_DIR)

In [None]:
uninfected_dimensions = get_images_dimensions(UNINFECTED_DIR)

### Visualization of Image Dimensions

In this section, we create scatter plots to visualize the widths and heights of parasitized and uninfected images. 

- The left plot displays the dimensions of **parasitized images**.
- The right plot shows the dimensions of **uninfected images**.

This visualization helps us understand the distribution of image sizes in the dataset, which is essential for ensuring consistency during preprocessing and model training. Variability in dimensions can affect how data generators handle images.


In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(10, 6))

axes[0].scatter(parasitized_dimensions['widths'], parasitized_dimensions['heights'], alpha=0.4, label='parasitized')
axes[0].set_xlabel('images widths')
axes[0].set_ylabel('images heights')
axes[0].legend()

axes[1].scatter(uninfected_dimensions['widths'], uninfected_dimensions['heights'], color='red',alpha=0.4, label='uninfected')
axes[1].set_xlabel('images widths')
axes[1].set_ylabel('images heights')
axes[1].legend()

fig.suptitle('images widths VS heights')
plt.tight_layout()


### Input Shape for Model

For model training, we standardize all images to a shape of **(128, 128, 3)**. 



### Sample Images Visualization

In this section, we display sample images from both the parasitized and uninfected cell datasets.

- The left plot shows a **parasitized cell**.
- The right plot shows an **uninfected cell**.

These visualizations help provide a clearer understanding of the dataset, allowing for a qualitative assessment of the image quality and characteristics before model training.


In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(10, 6))

image = cv2.imread(parasitized_images_paths[0])
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
axes[0].set_title('parasitized cell')
axes[0].imshow(image)
axes[0].axis('off')


image = cv2.imread(uninfected_images_paths[0])
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
axes[1].set_title('uninfected cell')
axes[1].imshow(image)
axes[1].axis('off')


### Pixel Value Analysis and Normalization Check

In this section, we analyze the pixel values of sample images from both the parasitized and uninfected datasets by checking their maximum and minimum values:

- The **maximum pixel value** represents the brightest pixel in the image.
- The **minimum pixel value** represents the darkest pixel.

By examining these values, we can determine if the data needs normalization. If the pixel values range from 0 to 255, normalization (e.g., scaling between 0 and 1) may be required to optimize model performance.


In [None]:
print(f' max pixel value of parasitized image {cv2.imread(parasitized_images_paths[0]).max()}')
print(f' min pixel value of parasitized image {cv2.imread(parasitized_images_paths[0]).min()}')
print(f' max pixel value of uninfected image {cv2.imread(uninfected_images_paths[0]).max()}')
print(f' min pixel value of uninfected image {cv2.imread(uninfected_images_paths[0]).min()}')


Since the pixel values do not span the full [0, 255] range, normalization (scaling between 0 and 1) is necessary to ensure consistent model input and improve training efficiency.


# Data Rescaling and Validation Split

In this section, we use Keras's `ImageDataGenerator` for rescaling the pixel values and splitting the dataset for training and validation.

- **Rescaling**: The pixel values are normalized between 0 and 1 using `rescale=1/255.0`.
- **Validation Split**: 20% of the data is reserved for validation using `validation_split=0.2`.

This step ensures that the model receives normalized inputs without performing data augmentation.


In [None]:
data_generator = ImageDataGenerator(rescale=1/255.0, validation_split = 0.2)

In [None]:
# create train data
train_data = data_generator.flow_from_directory(
    directory=DATA_DIR,
    target_size=(128, 128),
    class_mode='binary',
    subset='training'
)

**Note**:  
If you encounter the error `Found 22048 images belonging to 3 classes` instead of 2, ensure that your `DATA_DIR` is set to `/kaggle/input/cell-images-for-detecting-malaria/cell_images/cell_images` rather than `/kaggle/input/cell-images-for-detecting-malaria/cell_images`. When unzipping a directory, it may create an additional directory with the same name, so make sure the file path is correct.

In [None]:
test_data = data_generator.flow_from_directory(
    directory=DATA_DIR,
    target_size=(128, 128),
    class_mode='binary',
    subset='validation'
)

# Convolutional Neural Network (CNN) Model

In this section, we define a simple CNN architecture for binary classification of parasitized and uninfected cells.

- **Conv2D Layers**: 
  - The model starts with two convolutional layers with 32 and 64 filters, each using a `3x3` kernel and ReLU activation.
  - Both layers apply **same padding** to retain image dimensions.
  
- **MaxPooling2D**: After each convolution, max-pooling with a `2x2` window is applied to reduce spatial dimensions.

- **Flatten Layer**: Converts the 2D output into a 1D vector.

- **Dense Layers**: 
  - Two fully connected layers with 128 units and ReLU activation.
  - A **Dropout** layer with a rate of 0.5 to prevent overfitting.
  - A final dense layer with 64 units.

- **Output Layer**: A single output neuron with a sigmoid activation for binary classification (parasitized or uninfected).

This model processes the normalized `(128, 128, 3)` image inputs and outputs a probability for each class.


In [None]:
cnn_model = Sequential([
    Conv2D(filters=32, kernel_size=3, padding='same', activation='relu',
          input_shape=[128, 128, 3]),
    MaxPooling2D(pool_size=(2,2)),
    Conv2D(filters=64, kernel_size=3, padding='same', activation='relu',
          input_shape=[128, 128, 3]),
    MaxPooling2D(pool_size=(2,2)),
    
    Flatten(),
    Dense(128, activation='relu'),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')
])

In [None]:
cnn_model.summary()

In [None]:
cnn_model.compile(optimizer='adam',
             loss='binary_crossentropy',
             metrics=['accuracy'])

In [None]:

early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

In [None]:
cnn_history = cnn_model.fit(train_data, epochs=20, callbacks=[early_stopping], validation_data=test_data)

In [None]:
cnn_model.evaluate(train_data)

In [None]:
cnn_model.evaluate(test_data)

### Model Training History Plot

In this section, we define the `plot_history` function to visualize the training and validation accuracy and loss over epochs.


These plots are essential for identifying any signs of overfitting or underfitting during the training process.


In [None]:
def plot_history(history):
    fig, axes = plt.subplots(ncols=2, figsize=(10, 4))
    axes[0].plot(history.history['accuracy'], label='train acc')
    axes[0].plot(history.history['val_accuracy'], label='test acc')

    axes[0].set_xlabel('number of epochs')
    axes[0].set_ylabel('accuracy')
    axes[0].set_title('train vs test accuracy')
    plt.legend()
    
    axes[1].plot(history.history['loss'], label='train loss')
    axes[1].plot(history.history['val_loss'], label='test loss')

    axes[1].set_xlabel('number of epochs')
    axes[1].set_ylabel('loss')
    axes[1].set_title('train vs test loss')
    plt.legend()
    
    plt.show()
    
    
    

In [None]:
plot_history(cnn_history)

# Fine-Tuning a Pretrained Model


In this section, we define the `finetune_basemodel` function to fine-tune a base model (e.g., VGG16, ResNet) pretrained on the ImageNet dataset for our specific task of binary classification.

- **Base Model**: 
  - The function takes a `base_model` as input (such as VGG16 or ResNet50) and loads pretrained weights from ImageNet.
  - The top layers of the base model are excluded (`include_top=False`), and its layers are frozen (non-trainable).

- **Custom Layers**: 
  - After the base model, a `Flatten` layer is added, followed by a dense layer with 128 units and ReLU activation.
  - A **Dropout** layer is used to reduce overfitting, followed by a single output neuron with a sigmoid activation for binary classification.

- **Compilation and Training**: 
  - The model is compiled using the Adam optimizer and binary cross-entropy loss, suitable for binary classification tasks.
  - **Early Stopping** is applied to prevent overfitting by monitoring validation loss with a patience of 5 epochs.

- **Performance Evaluation**: After training, the model evaluates both training and test datasets to print the accuracy results. The function also calls `plot_history` to visualize the accuracy and loss trends.



In [None]:
def finetune_basemodel(base_model, input_shape=(128, 128, 3)):
    base_model = base_model(weights='imagenet', include_top=False, input_shape=input_shape)

    for layer in base_model.layers:
        layer.trainable = False
    model = Sequential()
    model.add(base_vgg_model)
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    
    early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
    
    history = model.fit(train_data, epochs=20, callbacks=[early_stopping], validation_data=test_data)
    
    print(f'Train Accuracy = {model.evaluate(train_data)}')
    print(f'Test Accuracy = {model.evaluate(test_data)}')
    
    plot_history(history)
    
    return model, history

### Fine-Tuning VGG16 Model
- **VGG16 Overview**: VGG16 is characterized by its deep architecture, consisting of 16 layers with learnable weights. It primarily uses small convolutional filters (3x3) and is known for its simplicity and effectiveness in extracting features from images.

- **Output**: The trained VGG16 model and its training history are stored in the variables `vgg_model` and `vgg_history`, respectively.

This approach harnesses the power of VGG16 to enhance the performance of our classification task.


In [None]:
vgg_model, vgg_history = finetune_basemodel(VGG16)

### Fine-Tuning EfficientNetB3 Model

- **EfficientNetB3 Overview**: EfficientNetB3 is part of the EfficientNet family, designed to balance model depth, width, and resolution. It uses a combination of depthwise separable convolutions and squeeze-and-excitation blocks, making it highly efficient for various image classification tasks.


In [None]:
efficientnetb3_model, efficientnetb3_history = finetune_basemodel(EfficientNetB3)

### Fine-Tuning ResNet50 Model

- **ResNet50 Overview**: ResNet50 consists of 50 layers and introduces skip connections, which help mitigate the vanishing gradient problem in deep networks. This architecture allows for training very deep networks while maintaining high performance, making it particularly effective for various image classification tasks.


In [None]:
resnet50_model, resnet50_history = finetune_basemodel(ResNet50)

### Fine-Tuning MobileNetV2 Model

In this section, we fine-tune the **MobileNetV2** architecture, designed for efficient performance on mobile and edge devices.

- **MobileNetV2 Overview**: MobileNetV2 uses an inverted residual structure and depthwise separable convolutions, optimizing both speed and accuracy. Its lightweight design makes it suitable for real-time applications and environments with limited computational resources, making it a popular choice for mobile image classification tasks.


In [None]:
mobilenetv2_model, mobilenetv2_history = finetune_basemodel(MobileNetV2)

# Model Accuracies Comparison



In this section, we evaluate the performance of the different models on both the training and test datasets and compile the results into a DataFrame for comparison.

- **Models Evaluated**: The models included are:
  - Basic CNN
  - VGG16
  - EfficientNetB3
  - ResNet50
  - MobileNetV2

- **Accuracy Calculation**: The test and train accuracies are obtained by evaluating each model on their respective datasets. These results are stored in a pandas DataFrame for easy comparison.

- **Sorting**: The DataFrame is sorted by test accuracy to facilitate the comparison of model performance.

This analysis provides a clear overview of which model performs best in terms of accuracy on the test dataset.


In [None]:
models_accuracies_df = pd.DataFrame({
    'Model Name': ['basic_cnn', 'vgg16', 'efficientnetb3', 'resnet50', 'mobilenetv2'],
    'Test Accuracy': [cnn_model.evaluate(test_data)[1], 
                     vgg_model.evaluate(test_data)[1],
                     efficientnetb3_model.evaluate(test_data)[1],
                     resnet50_model.evaluate(test_data)[1],
                     mobilenetv2_model.evaluate(test_data)[1]
                    ],
    'Train Accuracy': [cnn_model.evaluate(train_data)[1], 
                     vgg_model.evaluate(train_data)[1],
                     efficientnetb3_model.evaluate(train_data)[1],
                     resnet50_model.evaluate(train_data)[1],
                     mobilenetv2_model.evaluate(train_data)[1]
                    ],
})

models_accuracies_df

In [None]:
models_accuracies_df = models_accuracies_df.sort_values(by='Test Accuracy', ascending=False)
models_accuracies_df

# Model Selection

MobileNetV2 and VGG16 have almost the same accuracy, so choose any of them. I will choose for now the VGG16 model.

In [None]:
best_model = vgg_model

In [None]:
best_model.save('malaria-classification-model-using-vgg16.h5')

# Model Prediction

### Model prediction of a single image

In [None]:
model = load_model('/kaggle/working/malaria-classification-model-using-vgg16.h5')

In [None]:
train_data.class_indices

In [None]:
label_name = ['Parasitized', 'Uninfected']

In [None]:
def get_prediction(image_path, true_label, input_size=(128, 128)):
    image = cv2.imread(image_path) 
    
    image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    image = cv2.resize(image, input_size)
    image = np.expand_dims(image, axis=0)
    image = image / 255.0
    predicted_label = int(np.round(model.predict(image))[0][0])

    
    plt.imshow(image_rgb)
    plt.title(f'predicted is {label_name[predicted_label]}, true is {true_label}')
    plt.axis('off')

In [None]:
get_prediction(parasitized_images_paths[5], 'Parasitized')

# Future Work
For future improvements, we could explore the following:
- Implementing data augmentation techniques to improve model generalization.
- Fine-tuning hyperparameters for better performance.
- Experimenting with other advanced architectures such as DenseNet or transformers.
