# Testing the Model
In this notebook, the **unseen** test data will be used to evaluate the final performance and therefore generalisability of the model.
<br>A confusion matrix will be generated to show which samples were correctly/incorrectly classified and the overall accuracy and recall of the model.
<br><br> It is important to note the metric used to evaluate the final performance of the model. For an imbalanced dataset accuracy can be misleading. For example:
<br> If given a credit card dataset of 99% genuine transactions with only 1% fraudulent. If we wish to classify fraudulent cases, we may find our model has a 99% accuracy but (without generating confusion matrix to see) this could mean we correctly identify the genuine cases (99%) but miss all fraudulent cases (1%) resulting a redundant model.

In [2]:
%load_ext autoreload
%autoreload 2
import os
import data_prep as dp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import keras
from keras.preprocessing.image import ImageDataGenerator
from keras import backend as K
from keras.models import Sequential, load_model
from keras.layers import Dense, Conv2D, Flatten, MaxPool2D
from keras.layers import SeparableConv2D, BatchNormalization, Dropout
from keras.applications.vgg16 import VGG16
from keras.optimizers import Adam,SGD,Adagrad,Adadelta,RMSprop
from keras.utils import to_categorical

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [3]:
# Set the environment seed for Python
os.environ['PYTHONHASHSEED'] = '0'

seed=101

# Set seed for Numpy
np.random.seed(seed)

### Load in Data

In [4]:
imagetype = '.jpeg'
directory = 'Docs/all_xrays/test/'
subfolders = ['normal','virus','bacteria']
class_labels = [0,1,1] # Change for binary or multi-class models

test_data,test_labels = dp.image_data_and_labels(imagetype, directory, subfolders, class_labels)


imagetype = '.jpeg'
directory = 'Docs/all_xrays/train/'
subfolders = ['normal','virus','bacteria']
class_labels = [0,1,1] # Change for binary or multi-class models

train_data,train_labels = dp.image_data_and_labels(imagetype, directory, subfolders, class_labels)


In [5]:
print("Total number of test examples: ", test_data.shape)
print("Total number of test labels:", test_labels.shape)
print("Total number of train examples: ", train_data.shape)
print("Total number of train labels:", train_labels.shape)

Total number of test examples:  (1175, 224, 224, 3)
Total number of test labels: (1175, 2)
Total number of train examples:  (4625, 224, 224, 3)
Total number of train labels: (4625, 2)


### Reproduce Model Architecture
Due to an error in this version of Keras we cannot directly load the model weights - there is currently no fix but several workarounds. Inconveniently this requires you know the model architecture. From here you can 'train' the model with 0 epochs (to initialise the weights at some value) from there you can load in the weights and test the model. This is what is done below.

In [6]:
model_predict = Sequential()

model_predict.add(VGG16(include_top = False, input_shape =(224,224,3)).layers[0])
model_predict.add(VGG16(include_top = False, input_shape = (224,224,3)).layers[1])
model_predict.add(VGG16(include_top = False, input_shape = (224,224,3)).layers[2])
model_predict.add(VGG16(include_top = False, input_shape = (224,224,3)).layers[3])

model_predict.add(Conv2D(filters=64, kernel_size=(3,3), padding="same", activation="relu", name='Conv1_1'))
model_predict.add(Conv2D(filters=64, kernel_size=(3,3), padding="same", activation="relu", name='Conv1_2'))
model_predict.add(MaxPool2D(pool_size=(2,2),strides=(2,2), name='Pool1'))

model_predict.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu", name='Conv2_1'))
model_predict.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu", name='Conv2_2'))
model_predict.add(MaxPool2D(pool_size=(2,2),strides=(2,2), name='Pool2'))

model_predict.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu", name='Conv3_1'))
model_predict.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu", name='Conv3_2'))
model_predict.add(MaxPool2D(pool_size=(2,2),strides=(2,2), name='Pool3'))

model_predict.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu", name='Conv4_1'))
model_predict.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu", name='Conv4_2'))
model_predict.add(MaxPool2D(pool_size=(2,2),strides=(2,2), name='Pool4'))

model_predict.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu", name='Conv5_1'))
model_predict.add(BatchNormalization())
model_predict.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu", name='Conv5_2'))
model_predict.add(BatchNormalization())
model_predict.add(MaxPool2D(pool_size=(2,2),strides=(2,2), name='Pool5'))

model_predict.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu", name='Conv6_1'))
model_predict.add(BatchNormalization())
model_predict.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu", name='Conv6_2'))
model_predict.add(BatchNormalization())
model_predict.add(MaxPool2D(pool_size=(2,2),strides=(2,2), name='Pool6'))

model_predict.add(Flatten(name="Flatten"))
model_predict.add(Dense(units=1024,activation="relu", name='Dense1'))
model_predict.add(Dense(units=512,activation="relu", name='Dense2'))
model_predict.add(Dense(units=2, activation="softmax", name='Result'))

for layer in model_predict.layers[:3]:
    layer.trainable=False

### Compile and Train for 0 Epochs just to Initialise Layers

In [7]:
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint, EarlyStopping

opt = Adam(lr=0.005)
checkpoint = ModelCheckpoint("baseline.h5", monitor='val_acc', verbose=1, save_best_only=True, save_weights_only=False, mode='auto', period=1)
early = EarlyStopping(monitor='val_loss', min_delta=0, patience=10, verbose=1, mode='auto')

model_predict.compile(optimizer=opt, loss=keras.losses.categorical_crossentropy, metrics=['accuracy'])
model_predict.fit(x=train_data,y=train_labels,epochs=0,callbacks=[checkpoint,early],class_weight={0:2.7,1:1}) #Change for binary/multiclass


<keras.callbacks.History at 0x1c37b7d710>

### Load in Weights and Test Model

In [8]:
model_predict.load_weights('Best model/Model_4_2class_datagen_OneOf.h5')

from sklearn.metrics import confusion_matrix

# Predicted labels - argmax will select the largest value (the highest probability) and the corresponding label.
preds = model_predict.predict(test_data, batch_size = 16)
preds = np.argmax(preds, axis = -1)

# Original labels 
orig_test_labels = np.argmax(test_labels, axis = -1)

# Generate a confusion matrix
cm = confusion_matrix(orig_test_labels, preds)
cm

# Confusion Matrix Layout
#
#                 True
#                +    -
# Predicted   +  tp   fp
#             -  fn   tn

array([[283,  35],
       [ 15, 842]])

## Evalutation Metrics
In this section the model will be evaluated using a **Confusion Matrix**. This can be used to calculate metrics for evaluating classifiers by calculating four values:
* **TP** - **True Positives** (Number of Positive cases correctly predicted)
* **FP** - **False Positives** (Number of Negative cases predicted as Positive)
* **TN** - **True Negatives** (Number of Negative cases correctly predicted)
* **FN** - **False Negatives** (Number of Positive cases predicted as Negative)

These are then used in the blow metric calculations:

* **TPR - True Positive Rate** = $ \frac{tp}{tp \; + \; fn} $

* **FPR - False Positive Rate** = $ \frac{fp}{fp \; + \; tn} $

* **PPV - Precision** = $ \frac{tp}{tp \; + \; fp} $

* **SPC - Specificity** = $ \frac{tn}{tn \; + \; fn} $

* **ACC - Accuracy** = $ \frac{tpr}{tpr \; + \; fn} $

* **F1** = $ \frac{tpr \; * \; pvv}{tpr \; + \; pvv} $




In [82]:
# Calculate TPR, FPR. Beware of accuracy in imbalanced datasets - we should use ROC curve.
tp = cm[0][0]
fp = cm[0][1]
tn = cm[1][1]
fn = cm[1][0]


tpr = round(tp * 100 / (tp + fn), 2)
fpr = round(fp * 100 / (fp + tn), 2)
ppv = round(tp * 100 / (tp + fp), 2)
scc = round(tn * 100 / (tn + fn), 2)
acc = round((tp + tn) * 100 / (tp+fp+tn+fn), 2)
f1 = round(2 * (tpr * ppv) / (tpr + ppv), 2)

print(f'''Confusion Matrix:\n\n                True \n                +   - \nPredicted   + {cm[0]}\n            - {cm[1]}\n
\nThe metrics for this model are:
\nTPR: {tpr}% Sick Patients Predicted Sick (Recall)
\nFPR: {fpr}% Healthy Patients Predicted Sick (Fall-out)
\nPrecision: {ppv}% True Sick Patients / Sick Predictions
\nSpecificity: {scc}% True Healthy Patients / Healthy Predictions
\nAccuracy: {acc}% Overall Correct Predictions
\nF1: {f1} "True" Accuracy - Our final model evaluation metric''')



Confusion Matrix:

                True 
                +   - 
Predicted   + [283  35]
            - [ 15 842]


The metrics for this model are:

TPR: 94.97% Sick Patients Predicted Sick (Recall)

FPR: 3.99% Healthy Patients Predicted Sick (Fall-out)

Precision: 88.99% True Sick Patients / Sick Predictions

Specificity: 98.25% True Healthy Patients / Healthy Predictions

Accuracy: 95.74% Overall Correct Predictions

F1: 91.88 "True" Accuracy - Our final model evaluation metric
