<a href="https://colab.research.google.com/github/AndyMDH/pneumonia_detection_cnn/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSCK506  End of Module: Pneumonia Detection through Convolutional Neural Network (CNN)



## Table of Contents
1. [Introduction](#section-1-introduction)
2. [Data Exploration & Analysis](#section-2-data-exploration--analysis)
3. [Data Preparation](#section-3-data-preparation)
4. [Create Vocabulary](#section-4-create-vocabulary)
5. [Feature Extraction](#section-5-feature-extraction)
6. [Seq2Seq Model Development](#section-6-seq2seq-model-development)
7. [Model Evaluation](#section-7-model-evaluation)
8. [Chatbot Implementation and Manual Testing](#section-8-chatbot-implementation-and-manual-testing)

---
## Introduction

Pneumonia poses a severe threat to human health, being a potentially life-threatening infectious illness that typically affects one or both lungs. It is frequently triggered by bacteria, notably Streptococcus pneumoniae. According to the World Health Organization (WHO), pneumonia is responsible for one in three deaths in India (Varshni et al., 2019). Medical practitioners often rely on X-ray scans to diagnose pneumonia, distinguishing between bacterial and viral types.

This Jupyter notebook delves into the realm of automated pneumonia detection using Convolutional Neural Networks (CNNs). Specifically, it addresses the task of training a CNN model to differentiate between healthy lung scans and those afflicted with pneumonia. The dataset utilised for this endeavor is sourced from the Kaggle competition repository, offering a collection of chest X-ray images categorised as pneumonia-positive and normal.


**This task involves, but is not limited to:**

a. CNN Model Development:

- Write code to train a CNN model using the provided dataset.
- Objective: Achieve optimal performance in distinguishing between healthy and pneumonia-infected lung images.

    - **Key considerations:**
      - Define CNN architecture, including convolution-pooling blocks.
      - Fine-tune parameters like strides, padding, and activation functions for accuracy.
      - Implement strategies to prevent overfitting and ensure model generalization.

b. Training and Evaluation:

- Train the CNN model using the provided training dataset.
Fine-tune hyperparameters using validation data to enhance performance.
- Evaluate the model's accuracy using a separate test dataset to validate pneumonia detection in chest X-ray images.

This Jupyter Notebook was collaboratively prepared by:

- Minh-Dat Andy Ho Huu
- Santiago Fernandez Blanco
- Ismael Saumtally
- Chi Chuen Wan
- Chui Yi Wong

### Import Dependencies

In [20]:
# Standard library imports
import itertools
import logging
import os
import re
import unicodedata
import urllib.request
from collections import defaultdict
from typing import List, Optional, Set, Tuple
from zipfile import ZipFile

# Related third party imports
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Conv2D, MaxPooling2D, Flatten, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import precision_recall_curve, roc_curve, accuracy_score, confusion_matrix, precision_score, recall_score
from sklearn.model_selection import train_test_split

# Warnings configuration
import warnings

Here we will also define a constant to decide whether to use the GPU (with CUDA specifically) or the CPU. If you don't have a GPU, set this to False. Later when we create tensors, this variable will be used to decide whether we keep them on CPU or move them to GPU.

In [21]:
USE_CUDA = True

In [22]:
warnings.filterwarnings(action='ignore',category=DeprecationWarning)
warnings.filterwarnings(action='ignore',category=FutureWarning)

### Download Pneumonia Dataset

The Corpus can be downloaded here: [Chest X-Ray Images (Pneumonia)](https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia?resource=download)

In [23]:
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

### Load Dataset into DataFrame

In [24]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Define your dataset directory and file path
DATASET_DIR = '/content/drive/My Drive/path/to/dataset'
DATASET_FILE = DATASET_DIR + '/your_dataset_file.ext'

# Check if the dataset file exists
if os.path.exists(DATASET_FILE):
    print(f'{DATASET_FILE} already exists')
else:
    print(f'The dataset file is not found at {DATASET_FILE}. Please make sure it is in the correct location.')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
The dataset file is not found at /content/drive/My Drive/path/to/dataset/your_dataset_file.ext. Please make sure it is in the correct location.


**Alternatively**, if you want to run this notebook from your local machine, you can run the code block below to download and unzip the x-ray files from Kaggle.

In [25]:
# def download_file(url, destination):
#     try:
#         urllib.request.urlretrieve(url, destination)
#         logger.info(f'Downloaded file from {url} to {destination}')
#     except Exception as e:
#         logger.error(f'Error downloading file: {e}')

# def extract_zip(zip_path, extract_path):
#     try:
#         with ZipFile(zip_path, 'r') as zip_ref:
#             zip_ref.extractall(extract_path)
#         logger.info(f'Extracted {zip_path} to {extract_path}')
#     except Exception as e:
#         logger.error(f'Error extracting zip file: {e}')

# def create_directory(directory):
#     if not os.path.exists(directory):
#         os.makedirs(directory)
#         logger.info(f'Created directory: {directory}')

# DATASET_NAME = 'chest_x_ray'
# DATASET_URL = 'https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia?resource=download'
# DATASET_DIR = os.path.join(DATASET_NAME)
# DATASET_ZIP = os.path.join(DATASET_DIR, 'archive.zip')

# # Check if dataset directory already exists
# if os.path.exists(DATASET_DIR):
#     print(f'{DATASET_NAME} already exists')
# else:
#     if os.path.exists(DATASET_ZIP):
#         create_directory(DATASET_DIR)
#         extract_zip(DATASET_ZIP, CORPUS_DIR)
#         os.remove(DATASET_ZIP)
#         print(f'{DATASET_URL_NAME} extracted')
#     else:
#         print(f'To obtain the "{DATASET_NAME}" dataset, please follow these steps:')
#         print(f'1. Manually download the WikiQA dataset from: {DATASET_URL}')
#         print(f'2. Place the downloaded "archive.zip" file in the "{DATASET_DIR}" folder.')
#         print(f'3. Rerun this script after placing the corpus in the correct location.')


To obtain the "chest_x_ray" dataset, please follow these steps:
1. Manually download the WikiQA dataset from: https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia?resource=download
2. Place the downloaded "archive.zip" file in the "chest_x_ray" folder.
3. Rerun this script after placing the corpus in the correct location.


In [27]:
data_directory = "/content/drive/MyDrive/Liverpool/CSCK506 Deep Learning/End of Module/archive/chest_xray/chest_xray"
train_images = data_directory + "/train"
val_images = data_directory + "/val"
test_images = data_directory + "/test"

# labels
labels = ['NORMAL', 'PNEUMONIA']

## Exploratory Data Analysis

In [29]:
# Function to count the number of images in each directory
def count_images(directory):
    return len(os.listdir(directory))

# Count the number of images in each directory
train_count = count_images(train_images)
val_count = count_images(val_images)
test_count = count_images(test_images)

# Print the counts
print("Number of training images:", train_count)
print("Number of validation images:", val_count)
print("Number of test images:", test_count)

# Visualize the distribution of images across datasets
plt.figure(figsize=(10, 6))
sns.barplot(x=["Train", "Validation", "Test"], y=[train_count, val_count, test_count])
plt.title("Distribution of Images Across Datasets")
plt.ylabel("Number of Images")
plt.show()

Number of training images: 3
Number of validation images: 3
Number of test images: 3


NameError: name 'sns' is not defined

<Figure size 1000x600 with 0 Axes>

In [None]:
# Check dataset dimensions
print("Training images shape:", train_images.shape)
print("Training labels shape:", train_labels.shape)
print("Test images shape:", test_images.shape)
print("Test labels shape:", test_labels.shape)

In [None]:
# Visualise sample images
plt.figure(figsize=(10, 10))
for i in range(25):
    plt.subplot(5, 5, i + 1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i], cmap=plt.cm.binary)
    plt.xlabel(class_names[train_labels[i]])
plt.show()

---
## Data Preprocessing


In [None]:
def create_augmented_data_generator(train_images, train_labels):
    # Call the ImageDataGenerator class for training data augmentation
    train_datagen = ImageDataGenerator(
        rotation_range=20,
        width_shift_range=0.1,
        height_shift_range=0.1,
        horizontal_flip=True,
        fill_mode='nearest',
        rescale=1./255.  # Rescale training data
    )

    # Pass in arguments to the flow method for training data
    train_data_generator = train_datagen.flow(
        x=train_images.reshape(-1, 28, 28, 1),  # Reshape images to fit NN input shape
        y=train_labels,
        batch_size=32
    )

    return train_data_generator

def create_test_data_generator(test_images, test_labels):
    # Call the ImageDataGenerator class to rescale images for test data (without augmentation)
    test_datagen = ImageDataGenerator(rescale=1./255.)

    # Pass in arguments to the flow method for test data
    test_data_generator = test_datagen.flow(
        x=test_images.reshape(-1, 28, 28, 1),  # Reshape images to fit NN input shape
        y=test_labels,
        batch_size=32
    )

    return test_data_generator

---
### References:

Varshni, D., Thakral, K., Agarwal, L., Nijhawan, R. and Mittal, A. (2019). Pneumonia Detection Using CNN based Feature Extraction. [online] IEEE Xplore. doi:https://doi.org/10.1109/ICECCT.2019.8869364.