# Build and Train a Model
In this notebook we preprocess the data to get it ready for a model, extend a pre-trained model from Keras, train the model on
the x-ray images in our dataset, and evaluate model performance. The goal is the be able to accurately classify pneumonia with a
reasonable level of accuracy. A good baseline accuracy for this task is ?, we hope to come close to this baseline
and potentially be able to outperform it.

TODO: Find the baseline accuracy from the article

# Import Necessary Packages

In [1]:
import numpy as np
import pandas as pd
import os
from glob import glob
%matplotlib inline
import matplotlib.pyplot as plt
from itertools import chain
from random import sample
from sklearn import model_selection, metrics
import tensorflow as tf
from skimage import io
from keras.preprocessing.image import ImageDataGenerator

## Setup the Data
In this section we setup the data to get it ready for our model. I parse the `Finding Label` to make create binary
classification labels for each type of disease for every observation. I also create a `path` column for each `Image Index`
so I can get to the location of the actual images.

In [None]:
# Read in the Data
all_xray_df = pd.read_csv('/data/Data_Entry_2017.csv')

# Convert `Finding Labels` to binary labels for each disease
all_labels = np.unique(list(chain(*all_xray_df['Finding Labels'].map(lambda x: x.split('|')).tolist())))
all_labels = [x for x in all_labels if len(x)>0]
for label in all_labels:
    if len(label)>1:
        all_xray_df[label] = all_xray_df['Finding Labels'].map(lambda diseases: 1.0 if label in diseases else 0)

# Create 'path' column for each 'Image Index'
all_image_paths = {os.path.basename(x): x for x in
                   glob(os.path.join('/data','images*', '*', '*.png'))}
all_xray_df['path'] = all_xray_df['Image Index'].map(all_image_paths.get)

# Check the results
print(f'Images Found: {len(all_image_paths)}, Total Observations: {all_xray_df.shape[0]}')
all_xray_df.sample(3)



## Create Training and Validation Sets
In this section I create the training and validation sets for the model. I am using an 90/10 split of the data with
90% of the positive pneumonia cases in the training set and 10% of the positive pneumonia cases in the validation set.
Also for the training set the proportion of positive to negative pneumonia cases will be 50/50.

In [2]:
def create_splits(df, validation_percent):
    # create original train / validation split
    train_data, val_data = model_selection.train_test_split(df, test_size=validation_percent, stratify=df['Pneumonia'])

    # Update training set to be 50/50 split
    pos_indexes = train_data[train_data.Pneumonia == 1].index.tolist()
    neg_indexes = train_data[train_data.Pneumonia == 0].index.tolist()
    neg_sample = sample(neg_indexes, len(pos_indexes))
    train_data = train_data.loc[pos_indexes+neg_sample]

    # Check splits
    print(f'Total Pneumonia Cases: {df[df.Pneumonia==1].shape[0]} \
    {(1-validation_percent)*100}% Pneumonia Cases: {int(df[df.Pneumonia==1].shape[0]*(1-validation_percent))} \
    {validation_percent*100}% Pneumonia Cases: {int(df[df.Pneumonia==1].shape[0]*validation_percent)}')
    print(f'Pneumonia Cases in Training set: {train_data[train_data.Pneumonia==1].shape[0]} \
    Pneumonia Cases in Validation set: {val_data[val_data.Pneumonia==1].shape[0]}')

    print(f'Train Set Size: {train_data.shape[0]}, \
    Pos %: {train_data[train_data.Pneumonia==1].shape[0] / train_data.shape[0] *100}, \
    Neg %: {train_data[train_data.Pneumonia==0].shape[0] / train_data.shape[0] *100}')

    print(f'Original Set Size: {df.shape[0]} \
    Pos %: {df[df.Pneumonia==1].shape[0] / df.shape[0] *100} \
    Neg %: {df[df.Pneumonia==0].shape[0] / df.shape[0] *100}')

    print(f'Validation Set Size: {val_data.shape[0]}, \
    Pos %: {val_data[val_data.Pneumonia == 1].shape[0] / val_data.shape[0] *100}, \
    Neg % {val_data[val_data.Pneumonia == 0].shape[0] / val_data.shape[0] *100}')

    return train_data, val_data

train_df, val_df = create_splits(all_xray_df, 0.1)


NameError: name 'all_xray_df' is not defined

## Implement Image Augmentation and Data Generators
In this section I set up image augmentation to generate a more diverse training set, and create training and validation
set generators to use for model training and testing.

In [None]:
IMG_SIZE = (224, 224)

# Create Image Data Generator for training set
train_idg = ImageDataGenerator(rescale=1.0/255.0,
                         horizontal_flip = True,
                         vertical_flip = False,
                         rotation_range = 0.5,
                         shear_range = 0.1,
                         zoom_range = 0.15)

train_generator = train_idg.flow_from_dataframe(dataframe=train_df, directory=None,
                                          x_col='path', y_col='Pneumonia',
                                          class_mode = 'binary', target_size=IMG_SIZE, batch_size=24)

# Create a generator for the test set
validation_idg = ImageDataGenerator(rescale=1.0/255.0)
validation_generator = validation_idg.flow_from_dataframe(dataframe=val_df, directory=None,
                                                      x_col='path', y_col='Pneumonia',
                                                      class_mode='binary', target_size=IMG_SIZE, batch_size=11212)

In [None]:
# Check a random batch of training and validation data to make sure everything looks okay
# Training Examples
t_x, t_y = next(train_generator)
fig, m_axs = plt.subplots(4, 4, figsize = (16, 16))
for (c_x, c_y, c_ax) in zip(t_x, t_y, m_axs.flatten()):
    c_ax.imshow(c_x[:,:,0], cmap = 'bone')
    if c_y == 1: 
        c_ax.set_title('Pneumonia')
    else:
        c_ax.set_title('No Pneumonia')
    c_ax.axis('off')

# Validation Examples
val_x, val_y = next(validation_generator)
fig, m_axs = plt.subplots(2, 2, figsize = (16, 16))
for (c_x, c_y, c_ax) in zip(t_x, t_y, m_axs.flatten()):
    c_ax.imshow(c_x[:,:,0], cmap = 'bone')
    if c_y == 1:
        c_ax.set_title('Pneumonia')
    else:
        c_ax.set_title('No Pneumonia')
    c_ax.axis('off')


## Extending a Pre-trained Model
In this section I use a pre-trained model from Keras called VGG16 to attempt to solve this classification problem.
TODO: Add more details after training (Potentially try other model architectures)

In [None]:
# Take a look at the pre-trained VGG16 Architecture
vgg16 = tf.keras.applications.VGG16(include_top=True, weights='imagenet')
vgg16.summary()

In [None]:
# Setup our own model that is an extension of VGG16
def setup_model(model, input_size, output_size: int, num_active_layers: int, lr):
    # Remove the final layer and replace it with our own classification layer
    new_model = tf.keras.Sequential()
    new_model.add(tf.keras.layers.Input(shape=(224,224,3), dtype='float32'))
    for layer in model.layers[:-1]:
        new_model.add(layer)
    new_model.add(tf.keras.layers.Dense(output_size,activation='sigmoid'))

    # Freeze all but num_active_layers layers
    for layer_num, layer in enumerate(new_model.layers):
        if layer_num < len(new_model.layers) - num_active_layers:
            layer.trainable = False
        else:
            layer.trainable = True

    # Setup optimizer, loss function, and metrics
    optimizer = tf.keras.optimizers.Adam(lr=lr)
    loss = 'binary_crossentropy'
    metrics = ['binary_accuracy']
    new_model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

    return new_model

new_vgg_model = setup_model(vgg16, (224,224,3), 1, 7, 1e-4)
new_vgg_model.summary()


## Setup Model Callbacks
Below is some helper code that will allow us to add callbacks to our model. This will save the 'best' version of the model
by comparing it to previous epochs of training. The 'patience' parameter tells the code how long to wait without seeing
any improvements to the chosen metric before quitting.

In [None]:
weight_path = "{}_my_model.best.hdf5".format('xray_class')

checkpoint = tf.keras.callbacks.ModelCheckpoint(weight_path,
                                                monitor= 'val_acc',
                                                verbose=1,
                                                save_best_only=True,
                                                mode= 'max',
                                                save_weights_only = True)

early = tf.keras.callbacks.EarlyStopping(monitor= 'val_acc', mode= 'max', patience=15)

callbacks_list = [checkpoint, early]

## Train the Model
Everything is already set up for this, all we have to do is run the model and look at the results below.

In [None]:
vgg_history = new_vgg_model.fit_generator(train_generator, validation_data = (val_x,val_y), epochs = 100, callbacks = callbacks_list)

## Evaluate Performance of Model
#TODO Make note about how this is not that helpful in evaluting the model

In [None]:
# Plot Model History
def plot_history(model_history):
    N = len(model_history.history['loss'])
    plt.style.use('ggplot')
    plt.figure()
    plt.plot(np.arange(0,N), model_history.history['loss'], label='train loss')
    plt.plot(np.arange(0,N), model_history.history['val_loss'], label='valuation loss')
    plt.plot(np.arange(0,N), model_history.history['binary_accuracy'], label='train accuracy')
    plt.plot(np.arange(0,N), model_history.history['val_binary_accuracy'], label='valuation accuracy')
    plt.title("Model Training Results")
    plt.xlabel("Epoch #")
    plt.ylabel("Loss")
    plt.legend(loc="upper left")
    plt.plot()

plot_history(vgg_history)



## Look at Performance Statistics

In [1]:
# Get Model Predictions
new_vgg_model.load_weights(weight_path)
pred_Y = new_vgg_model.predict(val_x, batch_size = 11212, verbose = 1)

def plot_roc_curve(model_name, true_y, pred_y):
    fig, ax = plt.subplots(1,1, figsize=(9,9))
    fpr, tpr, thresholds = metrics.roc_curve(true_y, pred_y)
    c_ax.plot(fpr, tpr, label=f'{model_name} AUC: {metrics.auc(fpr,tpr)}')
    ax.legend()
    ax.set_xlabel("False Positive Rate")
    ax.set_ylabel("True Positive Rate")
    return


def plot_precision_recall_curve(model_name, true_y, pred_y):
    fig, ax = plt.subplots(1,1, figsize=(9,9))
    precision, recall, thresholds = metrics.precision_recall_curve(true_y,pred_y)
    ax.plot(recall, precision, label=f'{model_name} AP Score: {metrics.average_precision_score(true_y,pred_y)}')
    plt.legend()
    ax.set_xlabel("Recall")
    ax.set_ylabel("Precision")
    return

def calculate_f1_scores(precision, recall):
    return [(2*p*r)/(p+r) for p,r in zip(precision,recall)]

def plot_f1_score(model_name, true_y, pred_y):
    """ f1 = 2*(precision*recall) / (precision + recall)"""
    fig, ax = plt.subplots(1,1,figsize=(9,9))
    precision, recall, thresholds = metrics.precision_recall_curve(true_y,pred_y)
    ax.plot(thresholds, precision, label=f'{model_name} Precision')
    ax.plot(thresholds, recall, label=f'{model_name} Recall')
    ax.plot(thresholds, calculate_f1_scores(precision, recall), label=f'{model_name} F1 Score')
    plt.legend()
    ax.set_xlabel("Threshold Values")
    return



NameError: name 'new_vgg_model' is not defined

## Decide on classification threshold
Once you feel you are done training, you'll need to decide the proper classification threshold that optimizes your model's
performance for a given metric (e.g. accuracy, F1, precision, etc.  You decide)
If this is a preliminary screening we want high recall so that we do not miss anyone that potentially has the disease.
If the cost of the follow up exam is very expensive we want high precision because we do not want to waste patients money.
If missing the exam leads to death we want high recall
Maximize F1 score if you want to balance precision and recall


In [None]:
#TODO
## Find the threshold that optimize your model's performance,
## and use that threshold to make binary classification. Make sure you take all your metrics into consideration.

In [None]:
## Let's look at some examples of predicted v. true with our best model: 

# Todo

# fig, m_axs = plt.subplots(10, 10, figsize = (16, 16))
# i = 0
# for (c_x, c_y, c_ax) in zip(valX[0:100], testY[0:100], m_axs.flatten()):
#     c_ax.imshow(c_x[:,:,0], cmap = 'bone')
#     if c_y == 1: 
#         if pred_Y[i] > YOUR_THRESHOLD:
#             c_ax.set_title('1, 1')
#         else:
#             c_ax.set_title('1, 0')
#     else:
#         if pred_Y[i] > YOUR_THRESHOLD: 
#             c_ax.set_title('0, 1')
#         else:
#             c_ax.set_title('0, 0')
#     c_ax.axis('off')
#     i=i+1


## Save model architecture to a .json:


In [None]:
model_json = new_model.to_json()
with open("my_model.json", "w") as json_file:
    json_file.write(model_json)