# Evaluation Metrics of Classifiers
<div class="alert alert-block alert-info">
<b>Content:</b> In this notebook, we play with different evaluation metrics of classifiers.

* For that purpose, we consider toy examples, where we have 10 instances with their correct and their "predicted" label.
* We run different metrics and analyses on these samples and discuss the output.
</div>


In [None]:
import sklearn.metrics as ms
import matplotlib.pyplot as plt
import numpy as np

In [None]:
bin_y =      [0,0,0,1,1,1,1,1,1,1]
bin_y_pred = [0,1,1,1,1,1,1,0,0,0]

In [None]:
multi_y =      [0,0,0,1,1,1,2,2,2,2]
multi_y_pred = [0,0,1,1,1,1,2,2,0,1]

In [None]:
def plot_confusion(task='both'):
    if (task=='bin'):
        ms.ConfusionMatrixDisplay.from_predictions(bin_y, bin_y_pred)
    if (task=='multi'):
        ms.ConfusionMatrixDisplay.from_predictions(multi_y, multi_y_pred)
    if (task=='both'):
        fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,6))
        ms.ConfusionMatrixDisplay.from_predictions(bin_y, bin_y_pred, ax=axes[0])
        ms.ConfusionMatrixDisplay.from_predictions(multi_y, multi_y_pred, ax=axes[1])

# Accuracy

$$acc=\frac{\text{\#correctly classified instances}}{\text{\#instances}}$$

* summarize quality in a single score
* intuitive formula
* ignorant towards imbalanced classes or differences in costs      of missclassification for different classes


In [None]:
plot_confusion()

In [None]:
ms.accuracy_score(bin_y, bin_y_pred), \
ms.accuracy_score(multi_y, multi_y_pred)

In [None]:
(1+4)/10, (2+3+2)/10

# Balanced Accuracy

$$bacc=\frac{1}{k}\sum_{i=1}^k{\frac{\text{\#correctly classified instances of class } C_i}{|C_i|}}$$
* average accuracy over all classes
* each class has the same weight
* average of recall per class

In [None]:
plot_confusion()

In [None]:
ms.balanced_accuracy_score(bin_y, bin_y_pred), \
ms.balanced_accuracy_score(multi_y, multi_y_pred)

In [None]:
( 1/(1+2) + 4/(3+4) ) / 2    ,    ( 2/(2+1+0) + 3/(0+3+0) + 2/(1+1+2) ) / 3

# Precision and Recall ... and F1
* focus usually on the detection of one class (the positive class) among many elements of the other
* e.g. spam: find the few good mails among the massive amounts of spam
* recommenders: find the few items a user is interested in within the vast set of available products

## Precision

$$precision(C_i)=\frac{\text{\#correctly classified instances of class }C_i}{\text{\#instances classified as } C_i}$$

* $0\le precision(C_i)\le 1$
* in binary classification: $\frac{tp}{(tp+fp)}$
* How many of the elements identified as class $C_i$ are actually $C_i$?

## Recall (Sensitivity)
$$recall(C_i)=\frac{\text{\#correctly classified instances of class }C_i}{\text{\#instances of class } C_i}$$
* $0\le recall(C_i)\le 1$
* in binary classification: $\frac{tp}{(tp+fn)}$
* How many of the elements of class $C_i$ have been recognized as $C_i$?

In [None]:
plot_confusion('bin')

In [None]:
precision=ms.precision_score(bin_y, bin_y_pred)
precision, 4/(2+4)

In [None]:
recall=ms.recall_score(bin_y, bin_y_pred)
recall, 4/(3+4)

### Problems
1. Precision alone is no useful metric.
2. Recall alone is no useful metric.

Why? Examples:
If we classify everything as class $1$, then the recall of class $1$ will be $1$, precision will be very low.
If we classify only one element of class $1$ correctly and everything as class $0$, then precision will be $1$, recall will be very low.

Only when both metrics are high, the classification is good. Therefore: f1-measure

## F1 measure
$$f1(C_i)=\frac{2}{\frac{1}{precision(C_i)} + \frac{1}{recall(C_i)}}$$
* $0\le f1(C_i)\le 1$
* f1 is high when both, recall and precision are high

In [None]:
ms.f1_score(bin_y, bin_y_pred), (2/(1/precision + 1/recall))

# Multiclass Precision and Recall ... and F1
* the binary measures focus on one class against another
* there are several ways to transport this idea to multiclass problems

## Macro Averages
* compute precision/recall per class
* average precision/recalls scores over all classes

In [None]:
plot_confusion('multi')

In [None]:
precision=ms.precision_score(multi_y, multi_y_pred, average='macro')
precision, 1/3*(2/(2+0+1) + 3/(3+1+1) + 2/(2+0+0))

In [None]:
recall=ms.recall_score(multi_y, multi_y_pred, average='macro')
recall, 1/3*(2/(2+1+0) + 3/(0+3+0) + 2/(1+1+2))

In [None]:
ms.f1_score(multi_y, multi_y_pred, average='macro'), \
1/3 * ( 2/(1/(2/(2+0+1))+1/(2/(2+1+0))) + 2/(1/(3/(3+1+1))+1/(3/(0+3+0))) + 2/(1/(2/(2+0+0))+1/(2/(1+1+2))) ) 

## Micro Average
* create an overall "global" binary task by pooling one confusion matrix per class

In [None]:
multi_y, multi_y_pred

In [None]:
multi_y_0    =[1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
multi_y_pred_0=[1, 1, 0, 0, 0, 0, 0, 0, 1, 0]
multi_y_1    =[0, 0, 0, 1, 1, 1, 0, 0, 0, 0]
multi_y_pred_1=[0, 0, 1, 1, 1, 1, 0, 0, 0, 1]
multi_y_2    =[0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
multi_y_pred_2=[0, 0, 0, 0, 0, 0, 1, 1, 0, 0]
multi_y_pooled=np.concatenate([multi_y_0, multi_y_1, multi_y_2])
multi_y_pred_pooled=np.concatenate([multi_y_pred_0, multi_y_pred_1, multi_y_pred_2])

In [None]:
precision=ms.precision_score(multi_y, multi_y_pred, average='micro')
precision, ms.precision_score(multi_y_pooled, multi_y_pred_pooled)

In [None]:
recall=ms.recall_score(multi_y, multi_y_pred, average='micro')
recall, ms.recall_score(multi_y_pooled, multi_y_pred_pooled)

In [None]:
ms.f1_score(multi_y, multi_y_pred, average='micro')

# ROC (Receiver Operator Characteristics)
* for binary predictions with probabilities
* reward classifiers that are sure with high probability
* contains the true positives rate (recall) and the false positives rate (false alarms) for different probability levels
* level $p$ means: Compute the classification metric after using $p$ as threshold and classifying everything as $1$ where the probability is higher or equal to $p$
* starts at (0,0), ends at (1,1)
* curve steps up for correctly classified members of the positive class, moves to the right for misclassified actual negatives 

In [None]:
bin_y

In [None]:
# probabilities for the positive class (1)
bin_proba        =[0.1, 0.55, 0.6, 0.9, 0.7, 0.7, 0.6, 0.4, 0.4, 0.2]

bin_proba_random =[0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]

bin_proba_best   =[0.1, 0.1, 0.2, 0.9, 0.8, 0.7, 0.6, 0.9, 0.9, 0.7]


In [None]:
roc=ms.roc_curve(bin_y, bin_proba)
roc

(false-positives-rate, true-positives-rate, thresholds)

In [None]:
roc_random=ms.roc_curve(bin_y, bin_proba_random)
roc_random

In [None]:
roc_best=ms.roc_curve(bin_y, bin_proba_best)
roc_best

In [None]:
plt.plot(roc[0], roc[1], c='black')
plt.xlabel('fp rate') # fp/(tn+fp) (false alarm rate)
plt.ylabel('tp rate') # tp/(tp+fp) (precision)
plt.plot(roc_random[0], roc_random[1], c='red')
plt.plot(roc_best[0], roc_best[1], c='blue')
#plt.plot(roc_test[0], roc_test[1])
#plt.plot([0,1], [0,1], c='green')

## Area under the ROC Curve
* Integral under the ROC Curve
* higher is better
* should be higher than 0.5 (= random guessing)
* 1 means each positive instance has a higher probability than each negative instance

In [None]:
ms.roc_auc_score(bin_y, bin_proba_best)

In [None]:
ms.roc_auc_score(bin_y, bin_proba)

In [None]:
ms.roc_auc_score(bin_y, bin_proba_random)

<div class="alert alert-block alert-info">
<b>Take Aways:</b> 

* A large variety of classification evaluation metrics.
* Confusion Matrixes for a class-wise investigation.
</div>

<div class="alert alert-block alert-success">
<b>Play with:</b> 
    
* a different ficticious toy example
* apply various measures to previous notebooks to assess the performance from different perspectives
</div>