
# ML/Python Course Material
### [Mohamad Dia](http://mohamaddia.me)
Feb 11 2020

# T2: F-1 score

### Learning Objectives

* Understand why the accuracy metric is not always enough to evaluate the performance on a real-world example (corona virus detection).
<br>
* Learn how to compute the recall, precision, F-1 score and understand the intuition behind these metrics.
<br>
* Learn how to use scikit-learn for performance evaluation via a coding demo on the breast cancer dataset.



### Prerequisites

* Basic knowledge in machine learning (binary classification, logistic regression, SVM, ...).
<br>
* Familiarity with Python and Scikit-learn library


### Resources:
* Performance Evaluation in Machine Learning: The Good, The Bad, The Ugly and The Way Forward, P. Flach, AAAI 2019.
<br>
* The Relationship Between Precision-Recall and ROC Curves, J. Davis and M. Goadrich, ICML 2006.
<br>
* The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets, T. Saito and M. Rehmsmeier, PLoS ONE 2015.


#### 1. Performance Evaluation
Evaluating the performance of any machine learning algorithm is a very essential task. Consider a binary classification problem where you have trained a certain classifier on the training dataset, and then you perform the prediction on the test dataset. The main question that arises is how good is your model? How to assess the testing performance? And what kind of metric to use?

A very common and intuitive thing to do is to measure the accuracy of the prediction by computing the ratio of the correctly predicted instances to the total number of instances in the test dataset. For example, assume you are doing binary classification for images of cats vs. dogs. After training, you test your model on 100 images. Your classifier outputs 55 cats (among them 50 are correct) and 45 dogs (among them 43 are correct). Hence, the accuracy of your classifier in $(50+43)/100 = 93\%$ and the error rate is $7\%$. In this particular example, it looks like the accuracy is a meaningful metric to use for assessment. If the accuracy is very high (above $90\%$ for example), we say that our model is performing good. Otherwise, we say that our model's performance is bad. However, the notion of accuracy is not always a good metric to look at and it can be very misleading as we will see soon. In some specific problems, having an accuracy as high as $99\%$ does not necessarily mean that our model is performing good. Hence, we need to define other metrics to assess the performance.

#### 2. Accuracy is not Always a Good Metric!

By examining a real-world classification example, we will see that the accuracy is not always enough to assess the performance, especially when dealing with **imbalanced** dataset or trying to avoid a special type of errors.

<u>Example: binary classification of a rare fatal disease (imbalanced dataset and low miss-classification rate of disease is required):</u>
<br>
Assume we are doing binary classification based on some medical data of the patients in order to predict whether the patient has the "corona virus" (positive) or not (negative). The corona virus is a rare disease. Hence, we expect that our datasets (both the training and the testing) are highly imbalanced, which means that we expect to have much more negative than positive.  After training three different classifiers, we test them on 1000 instances among them 990 are negative and only 10 are positive. The classifiers give the following predictions:

* Classifier 1 (SVM): predicts 950 negative (among them 945 are correct predictions) and 50 positive (among them 5 are correct predictions). The accuracy is $(945 + 5)/1000 = 95\%$.
* Classifier 2 (logistic regression): predicts 958 negative (among them 949 are correct) and 42 positive (among them 1 is correct). The accuracy is $(949+1)/1000 = 95\%$.
* Classifier 3 (deterministic): always predicts negative without looking into the data. Thus, it outputs 1000 negative. The accuracy is $99\%$.

Let's try to assess the performance of these three classifiers. Classifier 1 and 2 have the same accuracy but it is clear that they perform differently. Both classifiers miss-classified 50 patients. However, the type of errors are different which have  different effects for such application. Classifier 1 was able to detect half of the corona virus while classifier 2 detected only one case. The risk of failing to detect the fatal corona virus on a sick patient is much higher than the risk of asking a healthy patient to do additional diagnosis. Hence, classifier 1 looks safer for such application. This difference is not reflected in the accuracy metric. Another very important shortcoming of the accuracy metric appears when we look at the accuracy of classifier 3. This "dumb" classifier does literally nothing, it always predicts the negative class whatever the input is. Yet, classifier 3 seems to attain much better accuracy than the first two. You can conclude now that the accuracy measure is not enough to distinguish between the first two classifiers (even if the dataset is balanced). Moreover, it gives a misleading impression about the performance of the third classifier.

<!---<u>Example 2: Binary classification of fatal disease (low miss-classification rate of disease is required):</u>
<br>
We will illustrate here another shortcoming of the accuracy metric that arises even for balanced datasets. Assume we are doing binary classification based on some medical data of the patients in order to predict whether the patient has the "corona virus" (positive) or not (negative). Assume further that our dataset is somehow balanced. After training two different classifiers, we test them on 1000 instances among them 500 are positive. The classifiers give the following predictions:

* Classifier 1 (SVM): predicts 600 positive (among them 500 are correct) and 400 negative (all of them are correct). The accuracy is $(500 + 400)/1000 = 90\%$.
* Classifier 2 (logistic regression): predicts 400 positive (all of them are correct) and 600 negative (among them 500 are correct). The accuracy is $(400+500)/1000 = 90\%$.

Classifier 1 and 2 have the same accuracy but their performances are clearly different. Both classifiers miss-classified 100 patients. However, the type of errors are different which have  different effects for such application. The cost of failing to detect the fatal corona virus on a sick patient is much higher than the cost of sending a healthy patient to more tests. Hence, classifier 1 looks safer for such application. This difference is not reflected in the accuracy metric.--->

#### 3. Performance Metrics: Recall, Precision, and F-1 Score

We will define now three additional metrics that help in a better performance evaluation and in analyzing some additional characteristics not reflected in the accuracy metric.

<img src="balanced.png"/>

In a binary classification, assume we have excess to the ground truth labels: the positive instances (P) and the negative ones (N). After training our preferred classifier on the training dataset and applying it to the test dataset, our classifier splits the dataset into four event as shown in the conceptual example of the figure above (both for balanced and imbalanced datasets): true positive (TP), true negative (TN), false positive (FP), and false negative (FN). Notice that our famous accuracy metric is nothing but $(TP+TN)/(P+N)$. We will define two additional metrics now:

* Recall (or sensitivity or true positive rate "TPR"): $recall = TP/P = TP/(TP+FN)$.
* Precision (or positive predictive value "PPV"): $precision = TP/(TP+FP)$.

In fact, what we want is to make both FN and FP as small as possible while having low sensitivity to the class imbalance when exists (i.e. without including the TN). The recall and precision provide this for us! The recall reflects the FN and the precision reflects the FP. Hence, we need to have both the recall and precision as high as possible. What if we merge these two metrics in one metric?

F-1 score combines the recall and precision in a weighted average. Therefore, F-1 takes both false positives and false negatives into account (it is the harmonic mean of recall and precision):

$$F1 = 2 (recall \times precision) / (recall + precision)$$

Let's go back to our corona virus example and evaluate the performance with the new metrics:
* Classifier 1: accuracy 95%, recall 50%, precision 10%, F-1 score 0.167.
* Classifier 2: accuracy 95%, recall 10%, precision 23.8%, F-1 score 0.141.
* Classifier 3: accuracy 99%, recall 0%, precision 0%, F-1 score 0.

You can see that classifier 1 is the preferred model since it has the highest F-1 score, although it does not have the highest accuracy. Since classifier 1 is better in recall and classifier 2 is better in precision, it is hard to make a comparison. F-1 score combines the two metrics and gives a unique score that helps us decide.

#### 4. Scikit-learn Demo (Breast Cancer Dataset)

In this demo, we will illustrate how to evaluate the performance of a binary classifier on the "breast cancer" dataset.

In [1]:
# import the necessary module from scikit learn
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, recall_score, precision_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

# load the dataset
X, y = load_breast_cancer(return_X_y=True)

# split data for traing and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [2]:
# train a linear binary classifier using SVM
model = LinearSVC(random_state=15).fit(X_train, y_train)

In [3]:
# predict the labels of the test dataset
predictions = model.predict(X_test)

We will now evaluate the performance of our model. First we want to compute all the four events shown in the figure above (i.e. TN, FP, FN, TP). Scikit-learn allows this via the confusion matrix:

In [4]:
# compute the confusion matrix
conf_mat = confusion_matrix(y_test, predictions)
print("confusion matrix: \n", conf_mat)

confusion matrix: 
 [[67  0]
 [32 89]]


In [5]:
# pritn the four events
tn, fp, fn, tp = conf_mat.ravel()
print("TN = ", tn, "FP = ", fp, "FN = ", fn, "TP = ", tp)

TN =  67 FP =  0 FN =  32 TP =  89


We will now compute the accuracy, recall, precision, and F-1 score:

In [6]:
acc = accuracy_score(y_test, predictions)
recall = recall_score(y_test, predictions)
precision = precision_score(y_test, predictions)
f1 = f1_score(y_test, predictions)

In [7]:
print(" Accuracy = ", acc*100, "%\n", "Recall = ", recall*100, "%\n", "Precision = ", precision*100, "%\n",
      "F-1 score = ", f1)

 Accuracy =  82.97872340425532 %
 Recall =  73.55371900826447 %
 Precision =  100.0 %
 F-1 score =  0.8476190476190476
