In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Review of Model Evaluation

This topic covers things like:
- Confusion Matrix
- True/False Positives/Negatives
- Accuracy
- Recall
- Error
- Specificity
- Precision
- F1-score

## Let's start with the Confusion Matrix
This is a way to quantify the performance of a classification model from which we can then obtain such things as accuracy, specificity, precision, f1-score and others

A Confusion Matrix looks like this:

In [2]:
# Predicted     0   1
# Actual:    0 TN  FP
# Actual:    1 FN  TP 

Where we have two axis: predicted(what classification model predicts) and the actual values. Notice that in this case we only have two predictors: 0 and 1, so our confusion matrix is 2x2.

**Q for teacher:** Can classification models predict more than 2 states/and if so will we have an nxn confusion matrix, where n is the # of states?

Let's test this out by using logistic regression on a diabetes dataset

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

df = pd.read_csv('diabetes.csv')

feature_cols = ['Pregnancies', 'Insulin', 'BMI', 'Age']
feature_df = df[feature_cols]

X = feature_df.to_numpy()
y = df['Outcome'].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=0)

log_reg = LogisticRegression(solver="lbfgs")
log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_test)

We can generate our own confusion matrix:

In [4]:
def generate_confusion_matrix(actual_list, predicted_list):
    out = [[0., 0.],[0., 0.]]

    assert(len(actual_list) == len(predicted_list))
    
    for actual,predicted in zip(actual_list, predicted_list):
        # True Negatives/positives
        if actual == predicted: # True {something}
            if actual == 0: # True Negatices
                out[0][0] += 1
            else: # True Positives
                out[1][1] += 1
        else: # False {something}
            if predicted == 1: # False Positives
                out[0][1] += 1
            else: # False negatives
                out[1][0] += 1
    
    return np.array(out)

In [5]:
generate_confusion_matrix(y_test, y_pred)

array([[114.,  16.],
       [ 46.,  16.]])

Or we can use an sklearn module:

In [6]:
from sklearn.metrics import confusion_matrix

c_matrix = confusion_matrix(y_test, y_pred)
c_matrix

array([[114,  16],
       [ 46,  16]])

**What does this tell us?**
Well, we can get 4 values from this:
- True Negatives - [0][0] - Predicted was 0 and so was actual
- True Positives - [1][1] - Predicted was 1 and so was actual
- False Negatives - [1][0] - Predicted was 0 but actual was 1 (These are very bad!)
- False Positives - [0][1] - Predicted was 1 but actual was 0

From these values we can calculate all sorts of things, let's go through them!

### Accuracy
**Overall, how often is our classifier correct**  
> TruePositives + TrueNegatives / sum(ALL)

In [7]:
def get_accuracy(confusion_matrix):
    total = confusion_matrix.sum().sum()
    tp_plus_tn = confusion_matrix[1][1] + confusion_matrix[0][0]
    return tp_plus_tn/total

get_accuracy(c_matrix)

0.6770833333333334

### Error
**Overall, how often is our classifier incorrect**  
> FalsePositives + FalseNegatives / sum(ALL)

In [8]:
def get_error(confusion_matrix):
    total = confusion_matrix.sum().sum()
    fp_plus_fn = confusion_matrix[0][1] + confusion_matrix[1][0]
    return fp_plus_fn/total

get_error(c_matrix)

0.3229166666666667

### Recall (Want it to be very low!)
**When the actual value is positive, how often is the prediction correct**  
> TruePositives / TruePositive + FalseNegatives

In [9]:
def get_recall(confusion_matrix):
    tp_plus_tn = confusion_matrix[1][1] + confusion_matrix[1][0]
    tp = confusion_matrix[1][1]
    return tp/tp_plus_tn

get_recall(c_matrix)

0.25806451612903225

### Precision
**When a positive value is predicted, how often is the prediction correct**  
> TruePositives / TruePositives + FalsePositives

In [10]:
def get_precision(confusion_matrix):
    tp_plus_fp = confusion_matrix[1][1] + confusion_matrix[0][1]
    tp = confusion_matrix[1][1]
    return tp/tp_plus_fp

get_precision(c_matrix)

0.5

### Specificity
**When the actual value is negative, how often is the prediction correct**  
> TrueNegatives / TrueNegatives + FalsePositives

In [11]:
def get_specificity(confusion_matrix):
    tn_plus_fp = confusion_matrix[0][0] + confusion_matrix[0][1]
    tn = confusion_matrix[0][0]
    return tn/tn_plus_fp

get_specificity(c_matrix)

0.8769230769230769

### F1-score
**What we can use to evaluate some classification models in some cases**  
> 2 * (Precision * Recall) / (Precision + Recall)  

The higher the F1 score, the better. But if Recall is 0, we can get a 0 F1-score, but the model can still be good. Have to take it case by case

In [12]:
def get_f1_score(confusion_matrix):
    numerator = 2*get_precision(confusion_matrix) * get_recall(confusion_matrix)
    denominator = get_precision(confusion_matrix) + get_recall(confusion_matrix)
    return numerator/denominator

get_f1_score(c_matrix)

0.3404255319148936

## We can also evaluate models using sklearn
We can use k-means cross validation to obtain values like recall/f1-score for our logistic regression model

### What is k-means cross validation? 
It is a technique used to validate models by splitting the input data into k # of parts, and one by one making 1-1/k of that data the training data, and the other part the test data(X and y respectively). This gives an output of k f1-scores or recalls(or whatever other metric we want), and we can take the mean and get the average performance of our model 

When doing logistic regression, we can also apply weights to the outcome classes. For example in the case below, we have more 0 outcomes than 1:

In [13]:
df = pd.read_csv('diabetes.csv')

feature_cols = ['Pregnancies', 'Insulin', 'BMI', 'Age']
feature_df = df[feature_cols]

X = feature_df.to_numpy()
y = df['Outcome']

print(f'Outcome value counts: \n{y.value_counts()}')

Outcome value counts: 
0    500
1    268
Name: Outcome, dtype: int64


Hence, here is how we can instantiate our logistic regression model:

In [16]:
log_reg = LogisticRegression(class_weight={1: 500/268}, solver='lbfgs')

#### TODO: figure out what the '1' is

We can then, right away, get some metric by using cross_val_score function of of that logistic regression model with sklearn.  
In this case we are going to be using **5-means** cross validation

In [17]:
from sklearn.model_selection import cross_val_score 

all_accuracies = cross_val_score(estimator=log_reg, X=X, y=y, cv=5, scoring='accuracy')
mean_accuracy = all_accuracies.mean()

print(f'All Accuracies: {all_accuracies}')
print(f'Mean Accuracy: {mean_accuracy}')

All Accuracies: [0.64935065 0.65584416 0.64935065 0.70588235 0.65359477]
Mean Accuracy: 0.6628045157456922


In [18]:
all_f1_scores = cross_val_score(estimator=log_reg, X=X, y=y, cv=5, scoring='f1')
mean_f1_score = all_f1_scores.mean()

print(f'All f1 scores: {all_f1_scores}')
print(f'Mean f1 score: {mean_f1_score}')

All f1 scores: [0.578125   0.55462185 0.54237288 0.64       0.576     ]
Mean f1 score: 0.5782239460190857


**Notice** that the f1 score here is higher than what we got before. This is because we **did not** assign a weight to our logistic regression model

### Note for the future:
When faced with choosing which classification model to do. Let's say between logistic regression and SVM, we should choose the one with the least variance in the cross validation score, but if both are very similar, choose the one that gives the highest accuracy mean!  

Also, when doing k-means cross validation, k is usually chosen as 5

## Choosing the best possible hyper-parameters for SVM machine(polynomial/rbf)
We can use what is called as grid-search that chooses the best C and gamma parameters for us. 

In [22]:
from sklearn import svm
from sklearn.model_selection import GridSearchCV

def svc_param_selection(X, y, nfolds):
    Cs = [0.001, 0.01, 0.1, 1, 10]
    gammas = [0.001, 0.01, 0.1, 1]
    param_grid = {'C': Cs, 'gamma' : gammas}
    grid_search = GridSearchCV(svm.SVC(kernel='linear'), param_grid, cv=nfolds)
    grid_search.fit(X, y)
    return grid_search.best_params_

Let's say we want to find the best C to use for using an SVM with the same diabetes data

In [23]:
y = np.array(y)

In [None]:
svc_param_selection(X,y,5)