# Lab Six -  Convolutional Network Architectures
Amory Weinzierl, Fidelia Nawar, and Hayden Center

In this lab, you will select a prediction task to perform on your dataset, evaluate a deep learning architecture and tune hyper-parameters. If any part of the assignment is not clear, ask the instructor to clarify. 

This report is worth 10% of the final grade. Please upload a report (<b>one per team</b>) with all code used, visualizations, and text in a rendered Jupyter notebook. Any visualizations that cannot be embedded in the notebook, please provide screenshots of the output. The results should be reproducible using your report. Please carefully describe every assumption and every step in your report.

<b>Dataset Selection</b>

Select a dataset identically to lab two (images). That is, the dataset must be image data. In terms of generalization performance, it is helpful to have a large dataset of identically sized images. It is fine to perform binary classification or multi-class classification.

## Preparation (3 pts)

- [<b>1.5 points</b>] Choose and explain what metric(s) you will use to evaluate your algorithm’s performance. You should give a <b>detailed argument for why this (these) metric(s) are appropriate on your data. That is, why is the metric appropriate</b> for the task (e.g., in terms of the business case for the task). Please note: rarely is accuracy the best evaluation metric to use. Think deeply about an appropriate measure of performance.
- [<b>1.5 points</b>] Choose the method you will use for dividing your data into training and testing (i.e., are you using Stratified 10-fold cross validation? Shuffle splits? Why?). <b>Explain why your chosen method is appropriate or use more than one method as appropriate</b>. Convince me that your cross validation method is a realistic mirroring of how an algorithm would be used in practice. 

In [None]:
# Importing packages and reading in dataset
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow.keras as keras

print('Pandas:', pd.__version__)
print('Numpy:',  np.__version__)
print('Tensorflow:', tf.__version__)
print('Keras:',  keras.__version__)

In [None]:
%%time

#source: https://www.geeksforgeeks.org/how-to-convert-images-to-numpy-array/
from PIL import Image

#source: https://stackoverflow.com/questions/10377998/how-can-i-iterate-over-files-in-a-given-directory
from pathlib import Path

#directory name
paths = {
    "TRAIN": './Coronahack-Chest-XRay-Dataset/train/',
    "TEST":  './Coronahack-Chest-XRay-Dataset/test/'    
}
metadata = pd.read_csv('Chest_xray_Corona_Metadata.csv')

h, w = 64, 64

tf.random.set_seed(2)
np.random.seed(0) # using this to help make results reproducible

images = metadata[["X_ray_image_name", "Dataset_type"]]
X_data = []
y_data = metadata["Label"]
for idx, img in images.iterrows():
    name = img["X_ray_image_name"]
    path = img["Dataset_type"]
    img_arr = np.asarray(Image.open(paths[path] + name).convert('L').resize((h,w)))
    X_data.append(img_arr)

In [None]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
_X = np.expand_dims(np.array(X_data), axis=-1)/255 - 0.5
_y = le.fit_transform(np.array(y_data))

print(_X.shape, _y.shape)

In [None]:
import matplotlib.pyplot as plt

display_imgs = np.concatenate((_X[0:9], _X[-9:]))
labels = np.concatenate((y_data[0:9], y_data[-9:]))
def plot_gallery(images, titles, n_row=3, n_col=3):
    plt.figure(figsize=(n_col * n_col, 6 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    #normal scans tended towards front
    for i in range(n_row * n_col):
        plt.subplot(n_row * 2, n_col, i + 1)
        plt.imshow(images[i], cmap=plt.cm.gray)
        plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())
    #pnemonia scans toward back so we pulled some from the back 
    #for demonstration purposes
    for j in range(n_row * n_col):
        plt.subplot(n_row * 2, n_col, n_row * n_col + j + 1)
        plt.imshow(images[-1*j], cmap=plt.cm.gray)
        plt.title(titles[-1*j], size=12)
        plt.xticks(())
        plt.yticks(())
        
plot_gallery(display_imgs, labels)

#### Evaluation Metric

The primary evaluation metrics we are using for our model are recall and precision. Recall measures the percentage of positive cases that were identified correctly, and precision measures the percentage of positive predictions that were correct.

These metrics emphasizes correct positive identifications, which is applicable to evaluate our solution because we want to minimize the amount of undetected pneumonia lungs, though recall is the more important metric, as it can be used to minimize the false negative rate. Having a low false negative rate is important in this situation because a diagnosis of a "Normal" lung condition when it is in fact penumonia is detrimental and possibly fatal to the patient. On the same token, it's necessary that healthy lungs are not misclassified as pneumonia because that would create unnecessary issues for a healthy patient. Because of this, we chose to use recall and precision, specifically the native Keras implementation of both, to evaluate our CNN solution.

#### Dividing Data

We are using stratified 10-fold cross validation in order to split up the data into training and test sets. We chose to use this method because almost 3/4 of our the lungs in our dataset are labeled as having pneumonia, whereas only 1/4 is labeled as healthy. Thus, if we did a random split/shuffle, there may be disproportionate amounts of pneumonia classification in the training variables, which would make the classification for the testing data less accurate. With 
stratified 10-fold cross validation, we can make a more effective model and also help with generalizing. It allows us to select training and testing sets while also decreasing overall variance because of the 10 folds, which will fit each CNN on each fold. This would be a realistic measuring of a real-world application of the algorithm because with smaller test sets, there is higher variance. Stratified cross validation reduces this variance by averaging over k different partitions, so the performance estimate is less sensitive to the partitioning of the data. We also chose 10 folds because this value has been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

Additionally, we will be using an 80/20 split, where the 80% test set will be used for cross validation, and then used to train our final models for statistical comparisons of performance on the 20% split, as cross validation does not render a final trained model, and is only useful for comparing our evaluation metrics.

In [None]:
from sklearn.model_selection import train_test_split

X, X_final, y, y_final = train_test_split(_X, _y, test_size=0.2, stratify=_y)

## Modeling (6 pts)

- [<b>1.5 points</b>]  Setup the training to use data expansion in Keras. Explain why the chosen data expansion techniques are appropriate for your dataset. 
- [<b>2 points</b>] Create a convolutional neural network to use on your data using Keras. Investigate at least two different convolutional network architectures (and investigate changing some parameters of each architecture--at minimum have two variations of each network for a total of four models trained). Use the method of train/test splitting and evaluation metric that you argued for at the beginning of the lab. Visualize the performance of the training and validation sets per iteration (use the "history" parameter of Keras).
- [<b>1.5 points</b>] Visualize the final results of the CNNs and interpret the performance. Use proper statistics as appropriate, especially for comparing models. 
- [<b>1 points</b>] Compare the performance of your convolutional network to a standard multi-layer perceptron (MLP) using the receiver operating characteristic and area under the curve. Use proper statistical comparison techniques.  

We are using Keras's built in ImageDataGenerator for our data expansion. In reshaping all of our images to 128x128, many of the images were already stretched and squashed in different directions, and so expanding our dataset to stretch and squash them more randomly will hopefully remove any hidden biases that the different image sizes may have created. Additionally, since all of the xrays are more or less similarly oriented, we can add a slight rotational adjustment. However, since the images should all be uniquely oriented horizontally (because the heart is always located to one side of the body) and vertically (all of the images have the patients neck and shoulders on the top side of the image), it would not be useful to flip the images.

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
    rotation_range=1,
    width_shift_range=0.05,
    height_shift_range=0.05)

datagen.fit(X)

In [None]:
from tensorflow.keras.layers       import Conv2D, MaxPooling2D, GlobalAveragePooling2D, BatchNormalization
from tensorflow.keras.layers       import Dense, Dropout, Flatten, Activation
from tensorflow.keras.models       import Model, Sequential
from tensorflow.keras.callbacks    import EarlyStopping
from tensorflow.keras.utils        import plot_model
from tensorflow.keras.regularizers import l2
from sklearn.model_selection import StratifiedKFold
from sklearn.utils import class_weight

loss = 'binary_crossentropy'
optimizer = 'rmsprop'
metrics = [keras.metrics.Precision(), keras.metrics.Recall()]
batch_size = 128
epochs = 5
verbose = 1
n_splits = 10
kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=1234)
scores = []

In [None]:
def plot_histories(histories, title):
    fig, subplots = plt.subplots(2,3,figsize=(15,n_splits*4))
    fold_names = ["Fold " + str(fold) for fold in range(n_splits)]
    for fold_no, history in enumerate(histories):
        keys = list(history.history.keys())
        
        subplots[0,0].plot(history.history[keys[0]], label=fold_no)
        subplots[0,0].set_title('Binary Crossentropy')

        subplots[0,1].plot(history.history[keys[1]], label=fold_no)
        subplots[0,1].set_title('Precision')
        subplots[0,1].set_ylim(0.4, 1.1)

        subplots[0,2].plot(history.history[keys[2]], label=fold_no)
        subplots[0,2].set_title('Recall')
        subplots[0,2].set_ylim(0.4, 1.1)
        
        subplots[1,0].plot(history.history[keys[3]], label=fold_no)
        subplots[1,0].set_title('Validation Binary Crossentropy')

        subplots[1,1].plot(history.history[keys[4]], label=fold_no)
        subplots[1,1].set_title('Validation Precision')
        subplots[1,1].set_ylim(0.4, 1.1)

        subplots[1,2].plot(history.history[keys[5]], label=fold_no)
        subplots[1,2].set_title('Validation Recall')
        subplots[1,2].set_ylim(0.4, 1.1)
    handles, labels = subplots[1,2].get_legend_handles_labels()
    fig.suptitle(title, fontsize=16)
    fig.legend(handles, labels, title="Fold #")

### Model 1 - Basic Architecture

In [None]:
def build_basic_model(kernel_size, metrics):
    reg = l2(0.00001)
    cnn = Sequential()

    cnn.add(Conv2D(filters=32,
                kernel_size=kernel_size,
                kernel_regularizer=reg,
                padding='same',
                activation='relu'))
    cnn.add(Conv2D(filters=32,
                kernel_size=kernel_size,
                kernel_regularizer=reg,
                padding='same',
                activation='relu'))
    cnn.add(MaxPooling2D(pool_size=(2, 2)))

    cnn.add(Conv2D(filters=64,
                kernel_size=kernel_size,
                kernel_regularizer=reg,
                padding='same',
                activation='relu'))
    cnn.add(Conv2D(filters=64,
                kernel_size=kernel_size,
                kernel_regularizer=reg,
                padding='same',
                activation='relu'))
    cnn.add(MaxPooling2D(pool_size=(2, 2)))

    cnn.add(Dropout(0.25))
    cnn.add(Flatten())
    cnn.add(Dense(128, activation='relu',
                kernel_regularizer=reg))
    cnn.add(Dropout(0.5))
    cnn.add(Dense(1, activation='sigmoid',
                kernel_regularizer=reg))

    cnn.compile(loss=loss,
                optimizer=optimizer,
                metrics=metrics)
    
    return cnn

In [None]:
def basic_model(kernel_size, metrics):
    print("Basic Architecture")
    print("Kernel Size:", kernel_size,'\n')

    fold_no = 0
    histories = []
    eval_scores = []
    for train_index, test_index in kf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        cnn = build_basic_model(kernel_size, metrics)

        print('Fold',fold_no)
        print('')
        
        history = cnn.fit(datagen.flow(X_train, y_train, batch_size=batch_size), 
                          steps_per_epoch=int(len(X_train)/batch_size),
                          epochs=epochs, verbose=verbose,
                          validation_data=(X_test, y_test))

        print('')
        scores = cnn.evaluate(X_test, y_test, verbose=verbose)
        print('-' * 110)

        histories.append(history)
        eval_scores.append(scores)

        fold_no += 1

    eval_scores = np.array(eval_scores)
    print("Average Performance")
    print(f"Precision:  {round(np.mean(eval_scores[:,1]), 5)}")
    print(f"Recall:     {round(np.mean(eval_scores[:,2]), 5)}")
    
    return histories, eval_scores

#### Variation 1

In [None]:
%%time

histories1, eval_scores1 = basic_model(3, metrics)
scores.append(('basic', 3, eval_scores1))

In [None]:
title1 = "CNN - Kernel Size: 3"
plot_histories(histories1, title1)

#### Variation 2

In [None]:
%%time

histories2, eval_scores2 = basic_model(5, metrics)
scores.append(('basic', 5, eval_scores2))

In [None]:
title2 = "CNN - Kernel Size: 5"
plot_histories(histories2, title2)

### Model 2 - Network in Network Architecture

In [None]:
def build_nin_model(kernel_size, metrics):
    reg = l2(0.00001)
    cnn = Sequential()

    cnn.add(Conv2D(filters=32,
                kernel_size=kernel_size,
                kernel_regularizer=reg,
                padding='same'))
    cnn.add(Activation('relu'))
    cnn.add(Conv2D(filters=32,
                kernel_size=(1,1),
                kernel_regularizer=reg,
                padding='same'))
    cnn.add(Activation('relu'))
    cnn.add(Conv2D(filters=32,
                kernel_size=(1,1),
                kernel_regularizer=reg,
                padding='same'))
    cnn.add(Activation('relu'))
    cnn.add(MaxPooling2D(pool_size=(2, 2)))
    cnn.add(Dropout(0.2))

    cnn.add(Conv2D(filters=64,
                kernel_size=kernel_size,
                kernel_regularizer=reg,
                padding='same'))
    cnn.add(Activation('relu'))
    cnn.add(Conv2D(filters=64,
                kernel_size=(1,1),
                kernel_regularizer=reg,
                padding='same'))
    cnn.add(Activation('relu'))
    cnn.add(Conv2D(filters=64,
                kernel_size=(1,1),
                kernel_regularizer=reg,
                padding='same'))
    cnn.add(Activation('relu'))
    cnn.add(MaxPooling2D(pool_size=(2, 2)))
    cnn.add(Dropout(0.2))

    cnn.add(Conv2D(filters=64,
                kernel_size=kernel_size,
                kernel_regularizer=reg,
                padding='same'))
    cnn.add(Activation('relu'))
    cnn.add(Conv2D(filters=64,
                kernel_size=(1,1),
                kernel_regularizer=reg,
                padding='same'))
    cnn.add(Activation('relu'))
    cnn.add(Conv2D(filters=1,
                kernel_size=(1,1),
                kernel_regularizer=reg,
                padding='same'))
    cnn.add(Activation('relu'))

    cnn.add(Flatten())
    cnn.add(Dense(1, activation='sigmoid',
                kernel_regularizer=reg))

    cnn.compile(loss=loss,
                optimizer=optimizer,
                metrics=metrics)
    
    return cnn

In [None]:
# Architecture based on https://www.kaggle.com/bingdiaoxiaomao/network-in-network-nin-with-keras

def nin_model(kernel_size, metrics):
    print("NiN Architecture")
    print("Kernel Size:", kernel_size,'\n')

    fold_no = 0
    histories = []
    eval_scores = []
    for train_index, test_index in kf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        cnn = build_nin_model(kernel_size, metrics)

        print('Fold',fold_no)
        print('')
        
        history = cnn.fit(datagen.flow(X_train, y_train, batch_size=batch_size), 
                          steps_per_epoch=int(len(X_train)/batch_size),
                          epochs=epochs, verbose=verbose,
                          validation_data=(X_test, y_test))

        print('')
        scores = cnn.evaluate(X_test,y_test, verbose=verbose)
        print('-' * 110)
        
        histories.append(history)
        eval_scores.append(scores)

        fold_no += 1

    eval_scores = np.array(eval_scores)
    print("Average Performance")
    print(f"Precision:  {round(np.mean(eval_scores[:,1]), 5)}")
    print(f"Recall:     {round(np.mean(eval_scores[:,2]), 5)}")
    
    return histories, eval_scores

#### Variation 1

In [None]:
%%time

histories3, eval_scores3 = nin_model(3, metrics)
scores.append(('nin', 3, eval_scores3))

In [None]:
title3 = "NiN - Kernel Size: 3"
plot_histories(histories3, title3)

#### Variation 2

In [None]:
%%time

histories4, eval_scores4 = nin_model(5, metrics)
scores.append(('nin', 5, eval_scores4))

In [None]:
title4 = "NiN - Kernel Size: 5"
plot_histories(histories4, title4)

### Comparing Models

In [None]:
def plot_histories(histories, title):
    fig, subplots = plt.subplots(2,3,figsize=(15,n_splits*4))
    for fold_no, history in enumerate(histories):
        keys = list(history.history.keys())
        
        subplots[0,0].plot(history.history[keys[0]], label=fold_no)
        subplots[0,0].set_title('Binary Crossentropy')

        subplots[0,1].plot(history.history[keys[1]], label=fold_no)
        subplots[0,1].set_title('Precision')
        subplots[0,1].set_ylim(0.4, 1.1)

        subplots[0,2].plot(history.history[keys[2]], label=fold_no)
        subplots[0,2].set_title('Recall')
        subplots[0,2].set_ylim(0.4, 1.1)
        
        subplots[1,0].plot(history.history[keys[3]], label=fold_no)
        subplots[1,0].set_title('Validation Binary Crossentropy')

        subplots[1,1].plot(history.history[keys[4]], label=fold_no)
        subplots[1,1].set_title('Validation Precision')
        subplots[1,1].set_ylim(0.4, 1.1)

        subplots[1,2].plot(history.history[keys[5]], label=fold_no)
        subplots[1,2].set_title('Validation Recall')
        subplots[1,2].set_ylim(0.4, 1.1)
    handles, labels = subplots[1,2].get_legend_handles_labels()
    fig.suptitle(title, fontsize=16)
    fig.legend(handles, labels, title="Fold #")
    fig.show

In [None]:
def plot_histories2(histories, titles):
    for hists, title in zip(histories, titles):
        plot_histories(hists, title)

In [None]:
plot_histories2([histories1, histories2, histories3, histories4], [title1, title2, title3, title4])

### Explanation of Comparison of Models

## McNemar Test for Models

In [None]:
from statsmodels.stats.contingency_tables import mcnemar



### Explanation of McNemar Test Results for Models

## Implementation of MLP

In [None]:
def build_mlp(metrics):
    reg = l2(0.00001)
    
    mlp = Sequential()
    mlp.add( Dropout(0.25))
    mlp.add( Flatten() )
    mlp.add( Dense(input_dim=images.shape[1], units=100, activation='relu', kernel_regularizer= reg) )
    mlp.add( Dropout(0.5))
    mlp.add( Dense(units=50, activation='relu', kernel_regularizer= reg) )
    mlp.add( Dense(units=50, activation='relu', kernel_regularizer= reg) )
    mlp.add( Dense(1) )
    mlp.add( Activation('sigmoid') )

    mlp.compile(loss='binary_crossentropy', optimizer='adam', metrics=['Precision', 'Recall'])
    
    return mlp

In [None]:
def mlp(metrics):
    print("MLP")
    
    fold_no = 0
    histories = []
    eval_scores = []
    for train_index, test_index in kf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        mlp = build_mlp(metrics)

        print('Fold',fold_no)
        print('')
        
        history = mlp.fit(X_train, y_train, batch_size=batch_size, epochs=epochs*10, shuffle=True, verbose=1,
                          validation_data=(X_test,y_test))

        print('')
        scores = mlp.evaluate(X_test,y_test, verbose=verbose)
        print('-' * 110)
        
        histories.append(history)
        eval_scores.append(scores)

        fold_no += 1

    eval_scores = np.array(eval_scores)
    print("Average Performance")
    print(f"Precision:  {round(np.mean(eval_scores[:,1]), 5)}")
    print(f"Recall:     {round(np.mean(eval_scores[:,2]), 5)}")
    
    return histories, eval_scores

In [None]:
%%time

histories5, eval_scores5 = mlp(metrics)
scores.append(('mlp', 0, eval_scores5))

In [None]:
title5 = "MLP"
plot_histories(histories5, title5)

## Comparison of Models vs MLP

In [None]:
plot_histories2(list(range(n_splits)), [histories1, histories2, histories3, histories4, histories5], [title1,title2,title3,title4,title5])

### Explanation of Comparison of Models vs MLP

## McNemar Test for Models vs MLP

In [None]:
# from sklearn import metrics as mt

# mlp = build_mlp(.0001, metrics)
# y_hat = mlp.predict(X_final)
# #print(y_hat, y_final)
# #print(np.count_nonzero(np.round(y_hat)))
# #print(len(y_hat))

# yhat = np.round(y_hat)

# #print(mt.confusion_matrix(y_final,yhat))
# print(mt.classification_report(y_final,yhat))

# mlp.evaluate(X_final, y_final, verbose=verbose)

from sklearn.metrics import confusion_matrix

bm1 = build_basic_model(3, metrics)
bm1.fit(datagen.flow(X, y, batch_size=batch_size), 
                     steps_per_epoch=int(len(X)/batch_size),
                     epochs=epochs, verbose=verbose)
bm1_results = bm1.predict(X_final)
bm1_results[bm1_results != 1] = 0
bm1_confusion = confusion_matrix(y_final, bm1_results.flatten()).ravel()

bm2 = build_basic_model(5, metrics)
bm2.fit(datagen.flow(X, y, batch_size=batch_size), 
                     steps_per_epoch=int(len(X)/batch_size),
                     epochs=epochs, verbose=verbose)
bm2_results = bm2.predict(X_final)
bm2_results[bm2_results != 1] = 0
bm2_confusion = confusion_matrix(y_final, bm2_results.flatten()).ravel()

nin1 = build_nin_model(3, metrics)
nin1.fit(datagen.flow(X, y, batch_size=batch_size), 
                     steps_per_epoch=int(len(X)/batch_size),
                     epochs=epochs, verbose=verbose)
nin1_results = nin1.predict(X_final)
nin1_results[nin1_results != 1] = 0
nin1_confusion = confusion_matrix(y_final, nin1_results.flatten()).ravel()

nin2 = build_nin_model(5, metrics)
nin2.fit(datagen.flow(X, y, batch_size=batch_size), 
                     steps_per_epoch=int(len(X)/batch_size),
                     epochs=epochs, verbose=verbose)
nin2_results = nin2.predict(X_final)
nin2_results[nin2_results != 1] = 0
nin2_confusion = confusion_matrix(y_final, nin2_results.flatten()).ravel()

mlp = build_mlp(metrics)
mlp.fit(datagen.flow(X, y, batch_size=batch_size), 
                     steps_per_epoch=int(len(X)/batch_size),
                     epochs=epochs, verbose=verbose)
mlp_results = mlp.predict(X_final)
mlp_results[mlp_results != 1] = 0
mlp_confusion = confusion_matrix(y_final, mlp_results.flatten()).ravel()

In [None]:
confusion_matrix = [
    ['', 'bm1', 'bm2', 'nin1', 'nin2', 'mlp'],
    ['True Negative',  bm1_confusion[0], bm2_confusion[0], nin1_confusion[0], nin2_confusion[0], mlp_confusion[0]],
    ['False Positive', bm1_confusion[1], bm2_confusion[1], nin1_confusion[1], nin2_confusion[1], mlp_confusion[1]],
    ['False Negative', bm1_confusion[2], bm2_confusion[2], nin1_confusion[2], nin2_confusion[2], mlp_confusion[2]],
    ['True Positive',  bm1_confusion[3], bm2_confusion[3], nin1_confusion[3], nin2_confusion[3], mlp_confusion[3]],
                   ]
#https://stackoverflow.com/questions/13214809/pretty-print-2d-python-list
s = [[str(e) for e in row] for row in confusion_matrix]
lens = [max(map(len, col)) for col in zip(*s)]
fmt = '\t'.join('{{:{}}}'.format(x) for x in lens)
table = [fmt.format(*row) for row in s]
print ('\n'.join(table))

In [None]:
#reate contingency tables

both_correct = [
    ['', 'bm1', 'bm2', 'nin1', 'nin2', 'mlp'],
    ['bm1', 0, 0, 0, 0, 0, 0, 0, 0, 0],
    ['bm2', 0, 0, 0, 0, 0, 0, 0, 0, 0],
    ['nin1', 0, 0, 0, 0, 0, 0, 0, 0, 0],
    ['nin2', 0, 0, 0, 0, 0, 0, 0, 0, 0],
    ['mlp', 0, 0, 0, 0, 0, 0, 0, 0, 0]
]

first_correct = [
    ['', 'bm1', 'bm2', 'nin1', 'nin2', 'mlp'],
    ['bm1', 0, 0, 0, 0, 0, 0, 0, 0, 0],
    ['bm2', 0, 0, 0, 0, 0, 0, 0, 0, 0],
    ['nin1', 0, 0, 0, 0, 0, 0, 0, 0, 0],
    ['nin2', 0, 0, 0, 0, 0, 0, 0, 0, 0],
    ['mlp', 0, 0, 0, 0, 0, 0, 0, 0, 0]
]

both_incorrect = [
    ['', 'bm1', 'bm2', 'nin1', 'nin2', 'mlp'],
    ['bm1', 0, 0, 0, 0, 0, 0, 0, 0, 0],
    ['bm2', 0, 0, 0, 0, 0, 0, 0, 0, 0],
    ['nin1', 0, 0, 0, 0, 0, 0, 0, 0, 0],
    ['nin2', 0, 0, 0, 0, 0, 0, 0, 0, 0],
    ['mlp', 0, 0, 0, 0, 0, 0, 0, 0, 0]
]

for basic1, basic2, net1, net2, multi, y_vals in zip('bm1_results', 'bm2_results', 'nin1_results', 'nin2_results', 'mlp_results', 'y_test'):
    if(basic1 == targetVal):
        both_correct[1][1] += 1;
        
        if(basic2 == targetVal):
            both_correct[1][2] += 1;
        else:
            first_correct[1][2] += 1;
            
        if(resnetVal == targetVal):
            both_correct[1][3] += 1;
        else:
            first_correct[1][3] += 1;
            
        if(resnet2Val == targetVal):
            both_correct[1][4] += 1;
        else:
            first_correct[1][4] += 1;
            
        if(mlpVal == targetVal):
            both_correct[1][5] += 1;
        else:
            first_correct[1][5] += 1;
    else:
        both_incorrect[1][1] += 1;
        
        if(cnn2Val != targetVal):
            both_incorrect[1][2] += 1;
            
        if(resnetVal != targetVal):
            both_incorrect[1][3] += 1;
            
        if(resnet2Val != targetVal):
            both_incorrect[1][4] += 1;
            
        if(mlpVal != targetVal):
            both_incorrect[1][5] += 1;
            
    if(cnn2Val == targetVal):
        both_correct[2][2] += 1;
        
        if(cnnVal == targetVal):
            both_correct[2][1] += 1;
        else:
            first_correct[2][1] += 1;
            
        if(resnetVal == targetVal):
            both_correct[2][3] += 1;
        else:
            first_correct[2][3] += 1;
            
        if(resnet2Val == targetVal):
            both_correct[2][4] += 1;
        else:
            first_correct[2][4] += 1;
            
        if(mlpVal == targetVal):
            both_correct[2][5] += 1;
        else:
            first_correct[2][5] += 1;
    else:
        both_incorrect[2][2] += 1;
        
        if(cnnVal != targetVal):
            both_incorrect[2][1] += 1;
            
        if(resnetVal != targetVal):
            both_incorrect[2][3] += 1;
            
        if(resnet2Val != targetVal):
            both_incorrect[2][4] += 1;
            
        if(mlpVal != targetVal):
            both_incorrect[2][5] += 1;
            
    if(resnetVal == targetVal):
        both_correct[3][3] += 1;
        
        if(cnn2Val == targetVal):
            both_correct[3][2] += 1;
        else:
            first_correct[3][2] += 1;
            
        if(cnnVal == targetVal):
            both_correct[3][1] += 1;
        else:
            first_correct[3][1] += 1;
            
        if(resnet2Val == targetVal):
            both_correct[3][4] += 1;
        else:
            first_correct[3][4] += 1;
            
        if(mlpVal == targetVal):
            both_correct[3][5] += 1;
        else:
            first_correct[3][5] += 1;
    else:
        both_incorrect[3][3] += 1;
        
        if(cnn2Val != targetVal):
            both_incorrect[3][2] += 1;
            
        if(cnnVal != targetVal):
            both_incorrect[3][1] += 1;
            
        if(resnet2Val != targetVal):
            both_incorrect[3][4] += 1;
            
        if(mlpVal != targetVal):
            both_incorrect[3][5] += 1;
            
    if(resnet2Val == targetVal):
        both_correct[4][4] += 1;
        
        if(cnn2Val == targetVal):
            both_correct[4][2] += 1;
        else:
            first_correct[4][2] += 1;
            
        if(resnetVal == targetVal):
            both_correct[4][3] += 1;
        else:
            first_correct[4][3] += 1;
            
        if(cnnVal == targetVal):
            both_correct[4][1] += 1;
        else:
            first_correct[4][1] += 1;
            
        if(mlpVal == targetVal):
            both_correct[4][5] += 1;
        else:
            first_correct[4][5] += 1;
    else:
        both_incorrect[4][4] += 1;
        
        if(cnn2Val != targetVal):
            both_incorrect[4][4] += 1;
            
        if(resnetVal != targetVal):
            both_incorrect[4][4] += 1;
            
        if(cnnVal != targetVal):
            both_incorrect[4][1] += 1;
            
        if(mlpVal != targetVal):
            both_incorrect[4][5] += 1;
            
    if(mlpVal == targetVal):
        both_correct[5][5] += 1;
        
        if(cnn2Val == targetVal):
            both_correct[5][2] += 1;
        else:
            first_correct[5][2] += 1;
            
        if(resnetVal == targetVal):
            both_correct[5][3] += 1;
        else:
            first_correct[5][3] += 1;
            
        if(resnet2Val == targetVal):
            both_correct[5][4] += 1;
        else:
            first_correct[5][4] += 1;
            
        if(cnnVal == targetVal):
            both_correct[5][1] += 1;
        else:
            first_correct[5][1] += 1;
    else:
        both_correct[5][5] += 1;
        
        if(cnn2Val != targetVal):
            both_incorrect[5][2] += 1;
            
        if(resnetVal != targetVal):
            both_incorrect[5][3] += 1;
            
        if(resnet2Val != targetVal):
            both_incorrect[5][4] += 1;
            
        if(cnnVal != targetVal):
            both_incorrect[5][1] += 1;

print('both correct')
s = [[str(e) for e in row] for row in both_correct]
lens = [max(map(len, col)) for col in zip(*s)]
fmt = '\t'.join('{{:{}}}'.format(x) for x in lens)
table = [fmt.format(*row) for row in s]
print ('\n'.join(table))

print('\nboth incorrect')
s = [[str(e) for e in row] for row in both_incorrect]
lens = [max(map(len, col)) for col in zip(*s)]
fmt = '\t'.join('{{:{}}}'.format(x) for x in lens)
table = [fmt.format(*row) for row in s]
print ('\n'.join(table))

print('\nfirst correct')
s = [[str(e) for e in row] for row in first_correct]
lens = [max(map(len, col)) for col in zip(*s)]
fmt = '\t'.join('{{:{}}}'.format(x) for x in lens)
table = [fmt.format(*row) for row in s]
print ('\n'.join(table))

In [None]:
try:
    print('cnn and cnn2')
    b = contingency_table_first_correct[1][2]
    c = contingency_table_first_correct[2][1]
    print(((b-c)**2) / (b+c))
    print()
except:
    print('cnn and cnn2 are too similar \n')

try:
    print('cnn and resnet')
    b = contingency_table_first_correct[1][3]
    c = contingency_table_first_correct[3][1]
    print(((b-c)**2) / (b+c))
    print()
except:
    print('cnn and resnet are too similar \n')

try:
    print('cnn and resnet2')
    b = contingency_table_first_correct[1][4]
    c = contingency_table_first_correct[4][1]
    print(((b-c)**2) / (b+c))
    print()
except:
    print('cnn and resnet2 are too similar \n')

try:
    print('cnn and mlp')
    b = contingency_table_first_correct[1][5]
    c = contingency_table_first_correct[5][1]
    print(((b-c)**2) / (b+c))
    print()
except:
    print('cnn and mlp are too similar \n')
    

try:
    print('cnn2 and resnet')
    b = contingency_table_first_correct[2][3]
    c = contingency_table_first_correct[3][2]
    print(((b-c)**2) / (b+c))
    print()
except:
    print('cnn2 and resnet are too similar \n')

try:
    print('cnn2 and resnet2')
    b = contingency_table_first_correct[2][4]
    c = contingency_table_first_correct[4][2]
    print(((b-c)**2) / (b+c))
    print()
except:
    print('cnn2 and resnet2 are too similar \n')

try:
    print('cnn2 and mlp')
    b = contingency_table_first_correct[2][5]
    c = contingency_table_first_correct[5][2]
    print(((b-c)**2) / (b+c))
    print()
except:
    print('cnn2 and mlp are too similar \n')

try:
    print('resnet and resnet2')
    b = contingency_table_first_correct[3][4]
    c = contingency_table_first_correct[4][3]
    print(((b-c)**2) / (b+c))
    print()
except:
    print('resnet and resnet2 are too similar \n')

try:
    print('resnet and mlp')
    b = contingency_table_first_correct[3][5]
    c = contingency_table_first_correct[5][3]
    print(((b-c)**2) / (b+c))
    print()
except:
    print('resnet and mlp are too similar \n')

try:
    print('resnet2 and mlp')
    b = contingency_table_first_correct[4][5]
    c = contingency_table_first_correct[5][4]
    print(((b-c)**2) / (b+c))
    print()
except:
    print('resnet2 and mlp are too similar \n')

In [None]:
def roc(model):
    kfold = StratifiedKFold(n_splits=4).split(train_images, train_targets)

    mean_tpr = 0.0
    mean_fpr = np.linspace(0, 1, 100)
    all_tpr = []

    for i, (train, test) in enumerate(kfold):
        probas = model.predict(train_images[test])

        perclass_mean_tpr = 0.0
        roc_auc = 0
        
        fpr, tpr, thresholds = roc_curve(train_targets[test],
                                         probas)
        perclass_mean_tpr += np.interp(mean_fpr, fpr, tpr)
        perclass_mean_tpr[0] = 0.0
        roc_auc += auc(fpr, tpr)

        mean_tpr += perclass_mean_tpr
        plt.plot(mean_fpr,perclass_mean_tpr,'--',lw=1,label='Mean Class ROC fold %d (area = %0.2f)' % (i+1, roc_auc))

    plt.legend(loc='best')
    plt.grid()
    plt.show()

In [None]:
roc(mlp)
roc(resnet)

### Explanation of McNemar Test Results for Models vs MLP

## Exceptional Work (1 pt)

- You have free reign to provide additional analyses. 
- One idea (<b>required for 7000 level students</b>): Use transfer learning to pre-train the weights of your initial layers of your CNN. Compare the performance when using transfer learning to training without transfer learning (i.e., compare to your best model from above) in terms of classification performance. 