In [None]:
%git clone https://github.com/Hotsnown/seminaire-bordeaux-2022.git seminaire &> /dev/null
%pip install nbautoeval &> /dev/null
from evaluation.jour2.listes.listes import exo_create_list, exo_add_list, exo_lenght, exo_get_item, exo_is_empty, exo_less_than_5, exo_first_last

## Evaluating a classification model

### Agenda
* What is the purpose of model evaluation, and what are some common evaluation procedures?
* What is the usage of classification accuracy, and what are its limitations?
* How does a confusion matrix describe the performance of a classifier?
* What metrics can be computed from a confusion matrix?
* How can you adjust classifier performance by changing the classification threshold?
* What is the purpose of an ROC curve?
* How does Area Under the Curve (AUC) differ from classification accuracy?

### Review of model evaluation
* Need a way to choose between models: different model types, tuning parameters, and features
* Use a model evaluation procedure to estimate how well a model will generalize to out-of-sample data
* Requires a model evaluation metric to quantify the model performance

### Model evaluation procedures
1. Training and testing on the same data
    * Rewards overly complex models that "overfit" the training data and won't necessarily generalize
2. Train/test split
    * Split the dataset into two pieces, so that the model can be trained and tested on different data
    * Better estimate of out-of-sample performance, but still a "high variance" estimate
    * Useful due to its speed, simplicity, and flexibility
3. K-fold cross-validation
    * Systematically create "K" train/test splits and average the results together
    * Even better estimate of out-of-sample performance
    * Runs "K" times slower than train/test split

### Model evaluation metrics
* Regression problems: Mean Absolute Error, Mean Squared Error, Root Mean Squared Error
* Classification problems: Classification accuracy



### Classification accuracy

Pima Indian Diabetes dataset from the UCI Machine Learning Repository

In [None]:
# read the data into a Pandas DataFrame
import pandas as pd
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv(url, header=None, names=col_names)

In [None]:
# print the first 5 rows of data
pima.head()

Question: Can we predict the diabetes status of a patient given their health measurements?

In [None]:
# define X and y
feature_cols = ['pregnant', 'insulin', 'bmi', 'age']
X = pima[feature_cols]
y = pima.label

In [None]:
# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

In [None]:
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)

Classification accuracy: percentage of correct predictions

In [None]:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

Null accuracy: accuracy that could be achieved by always predicting the most frequent class

In [None]:
# examine the class distribution of the testing set (using a Pandas Series method)
y_test.value_counts()

In [None]:
# calculate the percentage of ones
y_test.mean()

In [None]:
# calculate the percentage of zeros
1 - y_test.mean()

In [None]:
# calculate null accuracy (for binary classification problems coded as 0/1)
max(y_test.mean(), 1 - y_test.mean())

In [None]:
# calculate null accuracy (for multi-class classification problems)
y_test.value_counts().head(1) / len(y_test)

Comparing the true and predicted response values

In [None]:
# print the first 25 true and predicted responses
from __future__ import print_function
print('True:', y_test.values[0:25])
print('Pred:', y_pred_class[0:25])

Conclusion:

* Classification accuracy is the easiest classification metric to understand
* But, it does not tell you the underlying distribution of response values
* And, it does not tell you what "types" of errors your classifier is making

### Confusion matrix

Table that describes the performance of a classification model

In [None]:
# IMPORTANT: first argument is true values, second argument is predicted values
print(metrics.confusion_matrix(y_test, y_pred_class))

Every observation in the testing set is represented in exactly one box
It's a 2x2 matrix because there are 2 response classes
The format shown here is not universal

Basic terminology

* True Positives (TP): we correctly predicted that they do have diabetes
* True Negatives (TN): we correctly predicted that they don't have diabetes
* False Positives (FP): we incorrectly predicted that they do have diabetes (a "Type I error")
* False Negatives (FN): we incorrectly predicted that they don't have diabetes (a "Type II error")

In [None]:
# print the first 25 true and predicted responses
print('True:', y_test.values[0:25])
print('Pred:', y_pred_class[0:25])

In [None]:
# save confusion matrix and slice into four pieces
confusion = metrics.confusion_matrix(y_test, y_pred_class)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]

### Metrics computed from a confusion matrix

Classification Accuracy: Overall, how often is the classifier correct?

In [None]:
print((TP + TN) / float(TP + TN + FP + FN))
print(metrics.accuracy_score(y_test, y_pred_class))

Classification Error: Overall, how often is the classifier incorrect?

Also known as "Misclassification Rate"

In [None]:
print((FP + FN) / float(TP + TN + FP + FN))
print(1 - metrics.accuracy_score(y_test, y_pred_class))

Sensitivity: When the actual value is positive, how often is the prediction correct?

How "sensitive" is the classifier to detecting positive instances?
Also known as "True Positive Rate" or "Recall"

In [None]:
print(TP / float(TP + FN))
print(metrics.recall_score(y_test, y_pred_class))

Specificity: When the actual value is negative, how often is the prediction correct?

How "specific" (or "selective") is the classifier in predicting positive instances?

In [None]:
print(TN / float(TN + FP))

False Positive Rate: When the actual value is negative, how often is the prediction incorrect?

In [None]:
print(FP / float(TN + FP))

Precision: When a positive value is predicted, how often is the prediction correct?

How "precise" is the classifier when predicting positive instances?

In [None]:
print(TP / float(TP + FP))
print(metrics.precision_score(y_test, y_pred_class))

Many other metrics can be computed: F1 score, Matthews correlation coefficient, etc.

Conclusion:

Confusion matrix gives you a more complete picture of how your classifier is performing
Also allows you to compute various classification metrics, and these metrics can guide your model selection
Which metrics should you focus on?

Choice of metric depends on your business objective
Spam filter (positive class is "spam"): Optimize for precision or specificity because false negatives (spam goes to the inbox) are more acceptable than false positives (non-spam is caught by the spam filter)
Fraudulent transaction detector (positive class is "fraud"): Optimize for sensitivity because false positives (normal transactions that are flagged as possible fraud) are more acceptable than false negatives (fraudulent transactions that are not detected)