# Evaluation Metrics
Ocha data science study group https://github.com/Fadouagh/data-ocha.git
Author: Fadoua Ghourabi (fadouaghourabi@gmail.com)

Date: July 4, 2019

In [70]:
import numpy as np
from sklearn.model_selection import train_test_split

## Accuracy, R$^2$

It is common sense to evaluate the performance of a machine learning model. So far in this study group, we used two evaluation metrics:
- accuracy of a classification model is the fraction of correctly classified samples
$$\text{accuracy} = \frac{\text{number of correctly classified samples}}{\text{total number of samples}}$$
- score R$^2$ of a regression model is defined as follows
$$R^2 = 1 - \frac{\sum_{i=1}^{m}(y^{(i)} - \hat{y^{(i)}})^2}{\sum_{i=1}^{m}({y^{(i)}}-\overline{y})^2},$$ where $m$ is the number of samples/observation, $y^{(i)}$ is the target value of sample $i$, $\hat{y^{(i)}}$ is the predicted value and $\overline{y}$ is the mean.

However, accuracy and $R^2$ are not the only metrics. In some situations, they might be not appropriate. It is important **to select an evaluation metric that suits your application**.

### Example 1: critical application

Suppose we are asked to make a classification model for early detection of cancer. If the result is negative "no cancer", the patient is assumed healthy. If the result is positive "possibly cancer", the patient receives additional diagnosis. Our model achieved high accuracy of 98%. Because the critical aspect of the application, we can't be satisfied with the 98% score. We need to ask ourselves what are the consequences of the model errors. If a healthy person is classified "possibly cancer", meaning an incorrect positive prediction or **false positive**, the person would go through expensive medical test (and possibly unnecessary distress). If a sick patient is classified "no cancer", meaning an incorrect negative prediction or **false negative**, serious health issues could be undetected. Besides the high accuracy, it is clear that the model should avoid false negative as much as possible.  

### Example 2: application with imbalanced dataset

Suppose we you given a classification problem where one class is more frequent than the other. For instance, 99% of the data belongs to class A and 1% to class B. Even if we make a dummy classifier that returns class A all the time, we will get a high accuracy of 99%!

In [71]:
from sklearn.datasets import load_digits

In [72]:
digits = load_digits()

In [73]:
print(digits.DESCR)

.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each blo

In [74]:
digits.target

array([0, 1, 2, ..., 8, 9, 8])

In [75]:
np.sum(digits.target == 9)/digits.target.shape[0]*100

10.01669449081803

Goal: We make a model to classify the digits into "nine" and "not nine". First, we transform the data for binary classification. We change the target values 9 to True and the rest to False.

In [77]:
y = digits.target == 9

In [78]:
y

array([False, False, False, ..., False,  True, False])

In [79]:
X_train, X_test, y_train, y_test = train_test_split(digits.data, y, random_state = 0)

In [80]:
from sklearn.dummy import DummyClassifier

In [81]:
dummy_majority = DummyClassifier(strategy="most_frequent").fit(X_train, y_train)
print("Train score {}".format(dummy_majority.score(X_train, y_train)))
print("Test score {}".format(dummy_majority.score(X_test, y_test)))

Train score 0.9012620638455828
Test score 0.8955555555555555


In [82]:
from sklearn.tree import DecisionTreeClassifier

In [83]:
tree = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train)
print("Train score {}".format(tree.score(X_train, y_train)))
print("Test score {}".format(tree.score(X_test, y_test)))

Train score 0.9383815887156645
Test score 0.9177777777777778


In [84]:
from sklearn.linear_model import LogisticRegression

In [85]:
log_reg = LogisticRegression().fit(X_train, y_train)
print("Train score {}".format(log_reg.score(X_train, y_train)))
print("Test score {}".format(log_reg.score(X_test, y_test)))

Train score 0.9948032665181886
Test score 0.9755555555555555




In [86]:
from sklearn.neighbors import KNeighborsClassifier

In [87]:
knn = KNeighborsClassifier().fit(X_train, y_train)
print("Train score {}".format(knn.score(X_train, y_train)))
print("Test score {}".format(knn.score(X_test, y_test)))

Train score 0.9955456570155902
Test score 0.9955555555555555


**Question.** Which model among dummy classification, logistic regression and decision tree would you select for digit recognition?

### Confusion matrix

A confusion matrix is a 2x2 array where rows are the true classes in $y$ and the columns are the predicted classes in $\hat{y}$. Each cell corresponds to the number of samples. 

|           ${ }$    | predicted "not nine" | predicted "nine" |
|          ---  |--- | --- |
| true "not nine"| TN | FP |
| true "nine"    | FN | TP |

Where:
- TN: true negative
- TP: true positive 
- FN: false negative
- FP: false positive

In [88]:
from sklearn.metrics import confusion_matrix

In [89]:
pred_dummy = dummy_majority.predict(X_test)
pred_tree = tree.predict(X_test)
pred_log_reg = log_reg.predict(X_test)
pred_knn = knn.predict(X_test)

In [93]:
conf_dummy = confusion_matrix(y_test, pred_dummy)
conf_tree = confusion_matrix(y_test, pred_tree)
conf_log_reg = confusion_matrix(y_test, pred_log_reg)
conf_knn = confusion_matrix(y_test, pred_knn)

|           ${ }$    | predicted "not nine" | predicted "nine" |
|          ---  |--- | --- |
| true "not nine"| TN | FP |
| true "nine"    | FN | TP |

In [92]:
print("Confusion matrix of dummy classifier: \n{}".format(conf_dummy))
print("Confusion matrix of decision tree classifier: \n{}".format(conf_tree))
print("Confusion matrix of logistic regression classifier: \n{}".format(conf_log_reg))
print("Confusion matrix of KNN classifier: \n{}".format(conf_knn))

Confusion matrix of dummy classifier: 
[[403   0]
 [ 47   0]]
Confusion matrix of decision tree classifier: 
[[390  13]
 [ 24  23]]
Confusion matrix of logistic regression classifier: 
[[399   4]
 [  7  40]]
Confusion matrix of KNN classifier: 
[[402   1]
 [  1  46]]


As expected, the dummy classifier always predicts the most frequent class "not nine". The decision tree gives better results than the dummy classiffier. The logistic regression is slightly better than decision tree because it gives rise to fewer number of FP and FN. But the champion is KNN.

**Question.** Redefine the accuracy using TP, TN, FN and FP.
Accuracy = $\frac{\text{number of correctly classified samples}}{\text{total number of samples}} = \frac{TN + TP}{TP + TN + FP + FN}$

### Precision

Precision measures how samples predicted as positive are actually positive.
$$\text{Precision} = \frac{TP}{TP + FP}$$

Precision is a metric that can be used in application where the number of false positives should be negligeable. Example: a bank uses a model to predict whether an applicant can afford a big loan. A False positive applicant could lead to complicated legal procedure for the bank and the applicant.

In [94]:
from sklearn.metrics import precision_score

In [95]:
prec_dummy = precision_score(y_test, pred_dummy)
prec_tree = precision_score(y_test, pred_tree)
prec_log_reg = precision_score(y_test, pred_log_reg)
prec_knn = precision_score(y_test, pred_knn)
print("Precision of dummy classifier: \n{}".format(prec_dummy))
print("Precision of decision tree classifier: \n{}".format(prec_tree))
print("Precision of logistic regression classifier: \n{}".format(prec_log_reg))
print("Precision of KNN classifier: \n{}".format(prec_knn))

Precision of dummy classifier: 
0.0
Precision of decision tree classifier: 
0.6388888888888888
Precision of logistic regression classifier: 
0.9090909090909091
Precision of KNN classifier: 
0.9787234042553191


  'precision', 'predicted', average, warn_for)


### Recall

Recall measures the percentage of true positive samples are among positive predictions.
$$\text{Recall} = \frac{TP}{TP + FN}$$
Recall is usefull when we want to limit the number of false negatives, for instance, the previous example of cancer diognosis. 

In [96]:
from sklearn.metrics import recall_score

In [98]:
recall_dummy = recall_score(y_test, pred_dummy)
recall_tree = recall_score(y_test, pred_tree)
recall_log_reg = recall_score(y_test, pred_log_reg)
recall_knn = recall_score(y_test, pred_knn)
print("Recall of dummy classifier: \n{}".format(recall_dummy))
print("Recall of decision tree classifier: \n{}".format(recall_tree))
print("Recall of logistic regression classifier: \n{}".format(recall_log_reg))
print("Recall of KNN classifier: \n{}".format(recall_knn))

Recall of dummy classifier: 
0.0
Recall of decision tree classifier: 
0.48936170212765956
Recall of logistic regression classifier: 
0.851063829787234
Recall of KNN classifier: 
0.9787234042553191


### f-score or f$_1$-score

To summarize precision and recall, we can use another metric known as f-score.
$$F = 2.\frac{\text{precision}\times\text{recall}}{\text{precision}+\text{recall}}$$

f-score includes two metrics, precision and recall, in one number. However, it is harder to explain and interpret comparing to accuracy.

In [None]:
from sklearn.metrics import f1_score

In [99]:
f1_dummy = f1_score(y_test, pred_dummy)
f1_tree = f1_score(y_test, pred_tree)
f1_log_reg = f1_score(y_test, pred_log_reg)
f1_knn = f1_score(y_test, pred_knn)
print("Recall of dummy classifier: \n{}".format(f1_dummy))
print("Recall of decision tree classifier: \n{}".format(f1_tree))
print("Recall of logistic regression classifier: \n{}".format(f1_log_reg))
print("Recall of KNN classifier: \n{}".format(f1_knn))

Recall of dummy classifier: 
0.0
Recall of decision tree classifier: 
0.5542168674698795
Recall of logistic regression classifier: 
0.8791208791208791
Recall of KNN classifier: 
0.9787234042553191


  'precision', 'predicted', average, warn_for)


In [100]:
from sklearn.metrics import classification_report

 We can compute all three metrics for each class using function ``classification_report``. By the way, support is the number of occurence of each class in the target $y$. ``macro avg`` is the average of f-scores. ``weighted avg`` is the average of f-scores weighted by the support.

In [101]:
print(classification_report(y_test, pred_log_reg))

              precision    recall  f1-score   support

       False       0.98      0.99      0.99       403
        True       0.91      0.85      0.88        47

    accuracy                           0.98       450
   macro avg       0.95      0.92      0.93       450
weighted avg       0.98      0.98      0.98       450



### Metrics for multiclass classification