<a href="https://colab.research.google.com/github/KeshavGulati/Flexbox-ch-04/blob/master/performance_metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Measuring Model Performance in Python
## Philip Mathew (AI Lead)

First, let's just quickly train a model on a dataset. We'll be using MNIST for this specific notebook, and I'm just going to use a simple ANN for classification. Don't sweat these details much, they're not the focus of this problem.

What is important however, is the modifications I'm making. MNIST is a very balanced dataset, but I'm going to be making the training data unbalanced in order to demonstrate the effects of imbalance on accuracy. To do so, I'm going to make it so that the number of examples in the training set tapers off as the nominal value of the label increases.

In [None]:
import random

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import *
import matplotlib.pyplot as plt

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train.reshape(x_train.shape[0], 784), x_test.reshape(x_test.shape[0], 784)

class_prop = list(reversed(np.linspace(0, 1, num=10)))

tmp = []
labels = []
for i in range(10):
  # For each class, reduce the total number of examples form that class to
  # fit the desired proportions
  class_vecs = x_train[np.where(y_train == i)]
  class_vecs = class_vecs[0:max(5, int(5000 * class_prop[i]))]
  for vec in class_vecs:
    tmp.append(vec)
    labels.append(i)

# Shuffles training and testing set concurrently
c = list(zip(tmp, labels))
random.shuffle(c)
a, b = zip(*c)
x_train = np.asarray(a)
y_train = np.asarray(b)

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))

classes = list(range(10))
freqs1, freqs2 = np.bincount(y_train), np.bincount(y_test)
freqs1 = [freqs1[i] if i < len(freqs1) else 0 for i in classes]
freqs2 = [freqs2[i] if i < len(freqs2) else 0 for i in classes]

ax1.bar(classes, freqs1)
ax1.set_xticks(classes)
ax1.set_title('Distribution of Training Data')

ax2.bar(classes, freqs2)
ax2.set_xticks(classes)
ax2.set_title('Distribution of Test Data');

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=28)

In [None]:
model.fit(x_train, y_train)

In [None]:
y_pred = model.predict(x_test)

Now, the most widely used library for measuring performance is naturally ```sklearn```, speacifically the ```sklearn.metrics``` module. I encourage all of you to read through the functions in there to understand the various metrics used to measure models.

In practice, however, theres' one specific method that's often used: ```sklearn.metrics.classification_report()```. In short, it computes all of the easy metrics (like accuracy, [recall, precision](https://en.wikipedia.org/wiki/Precision_and_recall), etc). In binary classification problems it computes extra metrics like AUC, however since MNIST is a multiclass problem we don't get to see this capability.

In [None]:
from sklearn.metrics import *

print(classification_report(y_test, y_pred))

Now, as we can see, this model has an ~85% accuracy rating (Note: This will most certainly be different for you, but shouldn't be more than like 5% off), which is honestly atrocious as far as MNIST goes. In general, however, it's not a bad score. With that said, let's draw up the confusion matrix.

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix

def plot_confmat(true_labels, pred_labels):
    """
    Plots a confusion matrix from given data
    """
    fig2, ax = plt.subplots(1, 1, num=2, figsize=(10, 10))

    cm = confusion_matrix(true_labels, pred_labels)
    cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]  # normalize the confusion matrix
    for pair in np.argwhere(np.isnan(cm_norm)):
        cm_norm[pair[0]][pair[1]] = 0

    annot = np.zeros_like(cm, dtype=object)
    for i in range(annot.shape[0]):  # Creates an annotation array for the heatmap
        for j in range(annot.shape[1]):
            annot[i][j] = f'{cm[i][j]}\n{round(cm_norm[i][j] * 100, ndigits=3)}%'

    ax = sns.heatmap(cm_norm, annot=annot, fmt='', cbar=True, cmap=plt.cm.magma, vmin=0, ax=ax) # plot the confusion matrix

    ax.set(xlabel='Predicted Label', ylabel='Actual Label')

    fig2.tight_layout()

plot_confmat(y_test, y_pred)

(sidenote: feel free to steal this function, I'll take hot chocolate as royalties)

Now the confusion matrix demonstrates the problems here. As we can see, the model downright terrible accuracy when predicting classes 8 and 9. As such, in order to get a better model, we'd need to find a way to fix this. Notice that we would not have seen this if we took the overall accuracy score at face value.