# X-Ray Classification of Pneumionia using a Neural Network

#### Authors: Mitch Allison, Jordan Mang

## Project Overview:

Pneumonia affects nearly one out of every fifteen people worldwide annually. For most people contracting pneumonia isn't life threatening but of the near half billion people that contract it 2.5 million pass from this totally treatable illness. At most risk are seniors and children under 5 years old, for these people it is important to identify this illness in it's early stages before it has time to develop into a life threating situation.


## Business Problem:

St. Jude's Children's Hospital recognizes that in America pneumonia is not a leading cause of death in children but that world-wide every 43 seconds a child dies of this treatable disease. Most of these deaths are taking place in countries of Sub-Saharan Africa like the Congo, Guinea, and the Central African Republic.

By partnering with Europa, a company who specializes in handhelp X-Ray machines, St. Jude's plans on sending simple laptops and handheld X-Ray machines to countries with high mortality rates related to pneunomia. 

Due to a lack of medical staff who can properly identify pneumonia, St. Judes has asked us to build a machine learning model that can quickly and accurately identify pneumonia with x-ray images.

## Data Understanding Part 1:

Our model has been trained off of nearly 6,000 chest X-Ray's supplied to us through [Kaggle.com](https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia). This data was originally gathered by the Guangzhou Women and Children’s Medical Center and consist of X-Ray's of children one to five years of age.

These photos have been pre-cleaned and sorted according to class(healthy/pneumonia).

## Dummy Model:

For our first model we created a dummy model that would predict the majority class(Pneumonia) for all input photos.(More info can be found in the Dummy_model notebook)

We also created a function that would load the pictures into the notebook at a resized resolution, which we would end up using for the more complex models later on.

In [1]:
# Dummy model imports
from matplotlib import pyplot as plt

import numpy as np

from keras.preprocessing.image import ImageDataGenerator

from sklearn.metrics import confusion_matrix
from sklearn.dummy import DummyClassifier
from sklearn import metrics

In [2]:
def Get_Photo_Data(location, num_photos):
    '''
    Returns photos from data folder(resized, grayscaled) and binary class.
    
    '''
    datagen = ImageDataGenerator(rescale=1./255)
    
    data = datagen.flow_from_directory(
        location,
        target_size=(150, 150),
        batch_size=num_photos,
        color_mode='grayscale',
        class_mode='binary'
    )
    
    return data

In [3]:
#getting images and labels for models
train_photos = Get_Photo_Data('./data/archive/chest_xray/train/', 5216)
test_photos = Get_Photo_Data('./data/archive/chest_xray/test/', 624)
val_photos = Get_Photo_Data('./data/archive/chest_xray/val/', 16)

# unpack images and labels for CM/dummy model
train_data, train_labels = next (train_photos)
test_data, test_labels = next (test_photos)
val_data, val_labels = next (val_photos)

# create DummyModel on most frequent class
dummy_model =  DummyClassifier(strategy='most_frequent')
dummy_model.fit(train_data, train_labels)

Found 5216 images belonging to 2 classes.
Found 624 images belonging to 2 classes.
Found 16 images belonging to 2 classes.


DummyClassifier(strategy='most_frequent')

In [4]:
# creating predictions to evalaute model 
y_pred = (dummy_model.predict(test_data))

# getting metrics for model
acc = dummy_model.score(test_data, test_labels)
rec = metrics.recall_score(test_labels,y_pred)
pre = metrics.precision_score(test_labels,y_pred)

print(f"Dummy Model accuracy: {acc}")
print(f"Dummy Model recall: {rec}")
print(f"Dummy Model precision: {pre}")

Dummy Model accuracy: 0.625
Dummy Model recall: 1.0
Dummy Model precision: 0.625


## Dummy Model Performance Metrics:

#### NOTE: All model performance metrics are on the test set of unseen data.

![DM_CM](./graphs/CM_DUMMY.jpg)

We can see that our dummy model has an accuracy of 62.5%, as this model predicts the majority class only.

The next step is to create a first simple model(FSM) using a neural network

## FSM:

For our FSM, we created the simplest possible neural network. We loaded in the data without using the function.

The FSM neural network detailed below had the following layers:
1. Input: Conv2D layer taking in a 2d color photo of shape 256x256(activation=relu)
2. Pooling layer for Conv2D layer
3. Flatten layer
4. Hidden dense layer(activation=relu)
5. Output layer(activation=Sigmoid)

To fit the model we did so across 30 batches due to the file size, and used 10 epochs.

Our FSM code is contained below but this code is commented out for the following reasons:
1. Loading in the 1.6gb of photos multiple times can cause memory issues and takes time
2. Fitting the model takes some time

In [5]:
# FSM imports
import numpy as np
from matplotlib import pyplot as plt
from sklearn.metrics import classification_report,confusion_matrix

import tensorflow as tf
from tensorflow.keras import datasets, layers, models, Sequential
from tensorflow.keras import models, layers, optimizers, regularizers
from tensorflow.keras.metrics import Precision, Recall, BinaryAccuracy

from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Dropout

import cv2

import os

In [6]:
# Get photos for model training
# train = tf.keras.utils.image_dataset_from_directory('./data/archive/chest_xray/chest_xray/train')
# test = tf.keras.utils.image_dataset_from_directory('./data/archive/chest_xray/chest_xray/test')
# val = tf.keras.utils.image_dataset_from_directory('./data/archive/chest_xray/chest_xray/val')

In [7]:
# instantiate model
model = Sequential()

# Add layers to model
# input layer
model.add(Conv2D(16, (3,3), 1, activation='relu', padding = 'same', input_shape=(256,256,3)))

# add pooling layer(takes max from input window)
model.add(layers.MaxPooling2D((2,2)))

# flattens 2d to 1d
model.add(layers.Flatten())

# add dense layer
model.add(layers.Dense(16, activation='relu'))

# add output layer
model.add(layers.Dense(1, activation='sigmoid'))

# compile model with adam for binary model
model.compile('adam', loss=tf.losses.BinaryCrossentropy(), metrics=['accuracy'])

In [8]:
# fit model
# history = model.fit(train,
#                batch_size=30,
#                epochs=10,
#                validation_data=(val))

## FSM Performance Metrics:

![FSM_LOSS](./graphs/LOSS_FSM.jpg)

![FSM_ACCURACY](./graphs/ACCURACY_FSM.jpg)

Our FSM is an improvement over the dummy model, with a precision of 71%, accuracy of 74%, and recall of 98%. It's expected that recall will drop a bit as the model will likely not catch 100% of positive cases unless it is overfit.

We can see from the curves on the graphs above that running additional epochs is unlikely to improve model performance.

We can also see that our model is only marginally better than the dummy model, and so will require iteration to improve the model.

## Model Iteration:

After creating our FSM, we continued to iterate models to create more accurate models. The code for these models can be found in the model notebooks listed #2 - #6. We changed only a couple of things per iteration to make sure we had enough constant variables. If we saw an improvement we kept the change, and noted why it was kept.

2. Nodes in initial layer increased to 32.
3. 
    A. Added the function to load photos back in.(kept: simplify load-in process)
    B. Change input shape to (150, 150, 1).(kept: model speed was much faster)
    C. Input layer nodes changed to 64.(kept: model metrics marginally improved)
4. 
    A. Decreased batch size by half(30 to 15)
    B. Increased epochs(10 to 20)
5. Added 2nd convolutional stack(Conv2D + MaxPooling2D) behind 1st convolutional layer(kept: model metrics marginally improved)
6. Added 0.15 dropout to the dense layer(kept: model metrics marginally improved)
7. Combine and resplit train/test/validate to get more suitable numbers for each.(kept: achieved final model performance metrics)

![MODEL_ITERATION](./graphs/MODEL_ITERATION.jpg)

For most models tested during the iteration process, the performance of the model was only improving ever so slightly, or in a different direction that would decrease the score from another metric(ex. trading off 0.005 accuracy for 0.005 loss). It wasn't until we changed the dataset where the model was able to reach the performance we were looking for, without having an overly complex model.

## Data Understanding Part 2:

The original set has a class imbalance of 1(Normal):2.7(Pneumonia). The provided data was pre-split into train/test/validate sets, but with only 16 images in the validate set.

When we chose to combine and rebalance the photos to a ratio to .65/.2/.15 we saw our model performance improve dramatically and reach the metrics that we were looking for. After combining the dataset manually, you can use the following function to split the data.

This code is commented out so as not to split the data again locally.

In [9]:
# This tool will need to be pip installed if not installed already
# !pip install split-folders

# Import
# import splitfolders

# splitfolders.ratio("./data/combined/", output="combined_ttv", seed=42, ratio=(.65, .15, .2), group_prefix=None, move=True)

## Final Model Code:

We've put the code required here for the final model, but have commented out code where required to save time and memory.

The full final model code can be found in the FinalModel notebook.

In [10]:
# Imports
import numpy as np
from matplotlib import pyplot as plt
from sklearn.metrics import classification_report,confusion_matrix
from sklearn import metrics

import tensorflow as tf
from tensorflow.keras import datasets, layers, models, Sequential
from tensorflow.keras import models, layers, optimizers, regularizers
# from tensorflow.keras.models import Sequential
from tensorflow.keras.metrics import Precision, Recall, BinaryAccuracy

from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Dropout
from keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import load_model

import cv2

import os

In [11]:
def Get_Photo_Data(location, num_photos):
    '''
    Returns photos from data folder(resized, grayscaled) and binary class.
    
    '''
    datagen = ImageDataGenerator(rescale=1./255)
    
    data = datagen.flow_from_directory(
        location,
        target_size=(150, 150),
        batch_size=num_photos,
        color_mode='grayscale',
        class_mode='binary'
    )
    
    return data

In [12]:
#getting images and labels for models
# train_photos = Get_Photo_Data('./combined_ttv/train/', 3805)
# test_photos = Get_Photo_Data('./combined_ttv/test/', 1174)
# val_photos = Get_Photo_Data('./combined_ttv/val/', 877)

# # unpack images and labels for CM/dummy model
# train_data, train_labels = next (train_photos)
# test_data, test_labels = next (test_photos)
# val_data, val_labels = next (val_photos)

In [13]:
model = Sequential()


# input layer
model.add(Conv2D(64, (3,3), 1, activation='relu', padding = 'same', input_shape=(150,150,1)))

# add pooling layer(takes max from input window)
model.add(layers.MaxPooling2D((2,2)))

# add 2nd Conv2D layer
model.add(Conv2D(32, (3,3), 1, activation='relu', padding = 'same'))

# add pooling layer(takes max from above Conv2D layer)
model.add(layers.MaxPooling2D((2,2)))

# flattens 2d to 1d
model.add(layers.Flatten())

# add dense layer
model.add(layers.Dense(16, activation='relu'))

# add dropout layer
model.add(Dropout(0.15))

# add output layer
model.add(layers.Dense(1, activation='sigmoid'))

# compile model with adam for binary model
model.compile('adam', loss=tf.losses.BinaryCrossentropy(), metrics=['accuracy',
                                                                    Precision(name='precision'),
                                                                    Recall(name='recall')])

In [14]:
# fit model
# history = model.fit(train_data,
#                train_labels,
#                batch_size=30,
#                epochs=10,
#                validation_data=(val_data, val_labels))

## Final Model Performance Metrics:

![FM_LOSS](./graphs/LOSS_FINAL.jpg)

![FM_ACCURACY](./graphs/ACCURACY_FINAL.jpg)

Our final model is able to achieve metrics which are satisfactory to us considering the amount of model complexity and time for the project. We have saved this model in the model folder for outside use.

- Accuracy: 0.95
- Precision: 0.96
- Recall: 0.98

Of these stats, the most important one to us is recall. We're more ok with additional false positives than false negatives considering the use case: it's safer to incorrectly identify someone as having pneumonia and not having the disease than missing it.

## Next Steps:

1. Continue to iterate for better model performance: With additional time we could continue to iterate the model to see if we could improve model metrics
2. Augment image data: Data is king with neural networks, and although we have more than a gigabyte of images to train the neural network on, more data is always better. By augmenting the given photos(applying shake, rotate, inversion, etc) we could increase our dataset by 4 fold.
3. Deploy model with hand-held X-ray machines: We could deploy this model with hand-held X-ray machines to those in need to be able to identify pneumonia with an X-ray without trained medical staff required.