# Pneumonia Classification

Pneuomia has a mortality rate between 5% to 10%, making the infection extremely dangerous especially in poorer nations which have limited health care infrastructure.

Creating a machine learning model will allow for quick identification for pneumonia which will allow for treatment to be administered quickly to save lifes.

Due to the nature of the problem CNN seem to be the most effective method to classifiy the data.

## Basic Problem Analysis
This is a supervised problem with two classes pneumonia or healthy.

The data is in the form of images which will be fed into the model which will return the probability of the classes

Dataset: https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia

In [None]:
! pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download paultimothymooney/chest-xray-pneumonia
! unzip chest-xray-pneumonia.zip

In [14]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [15]:
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [16]:
rm -rf ./logs/

In [17]:
BATCH_SIZE = 32
IMAGE_SIZE = (150, 150)
train_dir = '/content/chest_xray/train'
valid_dir = '/content/chest_xray/val'
test_dir = '/content/chest_xray/test'

# Data Augmentation

Due to the small dataset augmenting the data will be necessary to avoid overfitting.

The transformations decided on are:
* Rescale to between 0, 1
* Shear
* Zoom
* Width shift
* Height shift
* Horizontal flip
* Fill

In [18]:
train_datagen = ImageDataGenerator(rescale = 1./255,
                                   rotation_range = 40,
                                   shear_range = 0.2,
                                   zoom_range = 0.2,
                                   width_shift_range = 0.2,
                                   height_shift_range = 0.2,
                                   horizontal_flip = True,
                                   fill_mode = 'nearest')

datagen = ImageDataGenerator(rescale=1./255)

train_data = train_datagen.flow_from_directory(train_dir, class_mode='binary', batch_size=BATCH_SIZE, shuffle=True, target_size=IMAGE_SIZE)
valid_data = datagen.flow_from_directory(valid_dir, class_mode='binary', batch_size=BATCH_SIZE, shuffle=True, target_size=IMAGE_SIZE)
test_data = datagen.flow_from_directory(test_dir, class_mode='binary', batch_size=BATCH_SIZE, target_size=IMAGE_SIZE)

Found 5216 images belonging to 2 classes.
Found 16 images belonging to 2 classes.
Found 624 images belonging to 2 classes.


In [19]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D, MaxPool2D, Dense, Dropout, Flatten, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import BinaryCrossentropy
import datetime

In [24]:
from keras import backend as K

def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

# Callbacks

Using Early Stopping and ModelCheck point allow to get the best weights of the model and stop overfitting from affecting the models proformace.

Using Tensorboard for easy visualistaion of the models scores, outputs.

In [20]:
# Callbacks
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping, ModelCheckpoint, TensorBoard

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")

tensorboard_callback = TensorBoard(log_dir=log_dir, histogram_freq=1)

from keras.callbacks import EarlyStopping, ModelCheckpoint

callbacks = [EarlyStopping(monitor = 'val_loss', patience = 5),
                 ModelCheckpoint(filepath = 'best_model.h5', monitor = 'val_loss', save_best_only = True), tensorboard_callback]


# Models

In [21]:
model_1 = Sequential([

    Conv2D(64, 3, padding='same', activation='relu'),
    Conv2D(64, 3, padding='same', activation='relu'),
    MaxPool2D(2),

    Conv2D(128, 3, padding='same', activation='relu'),
    Conv2D(128, 3, padding='same', activation='relu'),
    MaxPool2D(2),

    Conv2D(256, 3, padding='same', activation='relu'),
    Conv2D(256, 3, padding='same', activation='relu'),
    MaxPool2D(2),

    Conv2D(512, 3, padding='same', activation='relu'),
    Conv2D(512, 3, padding='same', activation='relu'),
    MaxPool2D(2),

    Flatten(),
    Dense(256, activation='relu'),
    Dense(128, activation='relu'),
    Dense(1, activation='sigmoid')
])

model_1.compile(
    optimizer=Adam(),
    loss=BinaryCrossentropy(),
    metrics=['accuracy', recall_m, precision_m, f1_m]
)


In [None]:
model_1.fit(train_data, epochs=25, validation_data=test_data, callbacks=callbacks)

In [None]:
model_1.evaluate(test_data)

In [None]:
%tensorboard --logdir logs/fit

In [None]:
model_2 = Sequential([

    Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)),
    MaxPool2D((2, 2)),
    
    Conv2D(32, (3, 3), activation='relu'),
    MaxPool2D((2, 2)),

    Flatten(),
    Dense(128, activation='relu'),
    Dense(64, activation='relu'),

    Dense(1, activation='sigmoid')
])
model_2.compile(
    optimizer=Adam(),
    loss=BinaryCrossentropy(),
    metrics=['accuracy', recall_m, precision_m, f1_m]
)

In [None]:
model_2.fit(train_data, epochs=25, validation_data=test_data, callbacks=callbacks)

In [None]:
model_2.evaluate(test_data)

In [None]:
%tensorboard --logdir logs/fit

In [27]:
model_3 = Sequential([
    Conv2D(16, 3, padding='same', input_shape=(150, 150, 3), activation='relu'),
    Conv2D(16, 3, padding='same', activation='relu'),
    MaxPool2D(2),

    Conv2D(32, 3, padding='same', activation='relu'),
    Conv2D(32, 3, padding='same', activation='relu'),
    MaxPool2D(2),

    Conv2D(64, 3, padding='same', activation='relu'),
    Conv2D(64, 3, padding='same', activation='relu'),
    MaxPool2D(2),

    Conv2D(128, 3, padding='same', activation='relu'),
    Conv2D(128, 3, padding='same', activation='relu'),
    MaxPool2D(2),
    Dropout(.2),

    Conv2D(256, 3, padding='same', activation='relu'),
    Conv2D(256, 3, padding='same', activation='relu'),
    MaxPool2D(2),
    Dropout(.2),

    Flatten(),
    Dense(512, activation='relu'),
    Dense(128, activation='relu'),
    Dense(64, activation='relu'),

    Dense(1, activation='sigmoid')
])
model_3.compile(
    optimizer=Adam(),
    loss=BinaryCrossentropy(),
    metrics=['accuracy', recall_m, precision_m, f1_m]
)


In [None]:
model_3.fit(train_data, epochs=25, validation_data=test_data, callbacks=callbacks)

In [None]:
model_3.evaluate(test_data)

In [None]:
%tensorboard --logdir logs/fit

# Transfer Learning

The CNN models are not getting an acceptable accuracy score.

A transfer learning model has the potential to get a higher metrics.

The VGG16 model trained on the imagenet will be used as it was used to classify 1000 classes with a high accuracy.

In [None]:
from tensorflow.keras.applications.vgg16 import VGG16

base  = VGG16(weights='imagenet',  include_top=False, input_shape=(150, 150, 3))
base.trainable = False

In [None]:
model_4 = Sequential([
    base,
    Flatten(),
    Dense(256, activation='relu'),
    Dense(128, activation='relu'),
    Dense(1, activation='sigmoid')
])

model_4.compile(
        optimizer=Adam(),
    loss=BinaryCrossentropy(),
    metrics=['accuracy', recall_m, precision_m, f1_m]
)

In [None]:
model_4.fit(train_data, epochs=25, validation_data=test_data, callbacks=callbacks)

# Conclusion 

The transfer learning model got the best accuracy of around 90%, which is not good enough when the decision has a impact on someones medical treatment. Therefore before the model could be pretential used in the real world the accuracy needs to improve.

## Improvements

The models are held back by the small dataset, so the most effective method to achieve better model metrics would be to gather more data.

The models would also improve from finetuning methods such as finding the ideal learning rate.
