# Evaluating Classification Models

Since we are dealing with classification models, where the output is a class/label, we will use the most common evaluation metrics
used in model evaluation in AI:
* Precision
* Recall
* F1-Score
* Accuracy
* Confusion matrix

The **confusion matrix** is a table that compares the predicted labels to the actual ones.
In our case we are dealing with binary classification, so it will be a 2x2 matrix.

<img src="https://cdn.prod.website-files.com/660ef16a9e0687d9cc27474a/662c42677529a0f4e97e4f96_644aea65cefe35380f198a5a_class_guide_cm08.png" width="500px">

source: https://www.evidentlyai.com/classification-metrics/confusion-matrix

The above example shows how the table is structured. The predicted labels on the horizontal, and actual labels on the vertical.
* The **True Positives (TP)** are where the model predicted the label as positive and the actual was positive.
* The **False Positives (FP)** are where the model predicted the label as positive, but the actual was negative.
* The **False Negatives (FN)** are where the model predicted the label as negative, but it was actually positive.
* The **True Negatives (TN)** are where the model predicted the label as negative, and it was actually negative.

We need the these values from the confusion matrix in order to calculate **precision, recall, f1-score, and accuracy.**

An easier way to remember is like this:
* TP - Predicted positive, actually positive.
* FP - Predicted positive, actually negative.
* TN - Predicted negative, actually negative.
* FN - Predicted negative, actually negative.


## 1. Getting the confusion matrix

The easiest way is to use the confusion_matrix function from sklearn

In [43]:
from sklearn.metrics import confusion_matrix
import numpy as np

# confusion_matrix has two parameters, the actual labels and predicted.
# lets use some random labels from numpy to test.

np.random.seed(42)
random_actual = np.random.randint(0,2,size=200)
random_predicted = np.random.randint(0,2,size=200)

conf = confusion_matrix(y_true=random_actual,
                 y_pred=random_predicted)
conf

array([[51, 49],
       [48, 52]])

In sklearn's confusion_matrix function, it treats the labels as numbers in ascending order, so it orders them left to right, and top down.

So the way the confusion matrix is interpreted is slightly different from the image above.

This is the way it outputs
```python
                    Predicted
                    Negative(0) Positive(1)

Actual   Negative(0)     TN          FP
         Positive(1)     FN          TP

```
Source: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

## 2. Calculating metrics

Now that we have the confusion matrix, we can now calculate precision, recall, f1-score, and accuracy.

Formulas:

```markdown
Metrics for positive labels:
    Precision = TP / (TP+FP)
    Recall = TP / (TP+FN)

Metrics for negative labels:
    Precision = TN / (TN+FN)
    Recall = TN / (TN+FP)

F1-score = (2 * P * R) / (P + R)

Accuracy = (TP + TN) / (TP + TN + FP + FN)
```

You can get more resources here: https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd


```markdown
**[Note]** 
By default when doing binary classification, sklearn will calculate the precision, recall, and f1-score for only the positive label. 
This is because these functions are made to go beyond just 2 labels, and when the labels are 3 or more, instead it returns an average precision, recall, and f1_score.
For the purpose of this demonstration we will calculate these metrics only for the positive labels.
```

In [42]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

precision = precision_score(y_true=random_actual, y_pred=random_predicted)
recall = recall_score(y_true=random_actual, y_pred=random_predicted)
f1 = f1_score(y_true=random_actual, y_pred=random_predicted)
acc = accuracy_score(y_true=random_actual, y_pred=random_predicted)

print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")
print(f"Accuracy: {acc:.4f}")

Precision: 0.5149
Recall: 0.5200
F1-score: 0.5174
Accuracy: 0.5150


Another function that will also neatly display these metrics is called `classification_report`
You will notice it calculates all the metrics for each individual class/label, and also calculates the averages.

* Macro avg - Normal average
* Weighted avg - Average that considers the proportion/number of labels each class has.

In [41]:
from sklearn.metrics import classification_report

# here we can see the precision, recall and f1_score for each indivdual label
print("Binary Classification Report")
report_binary = classification_report(y_true=random_actual, y_pred=random_predicted)
print(report_binary)

# Multiclass
np.random.seed(42)
multi_actual = np.random.randint(1,5,size=200)
multi_pred = np.random.randint(1,5,size=200)
print("-"*50)
print("Multiclass Classification Report")
report_multi = classification_report(y_true=multi_actual, y_pred=multi_pred)
print(report_multi)

Binary Classification Report
              precision    recall  f1-score   support

           0       0.52      0.51      0.51       100
           1       0.51      0.52      0.52       100

    accuracy                           0.52       200
   macro avg       0.52      0.52      0.51       200
weighted avg       0.52      0.52      0.51       200

--------------------------------------------------
Multiclass Classification Report
              precision    recall  f1-score   support

           1       0.24      0.26      0.25        46
           2       0.23      0.20      0.21        46
           3       0.34      0.31      0.33        54
           4       0.25      0.28      0.26        54

    accuracy                           0.27       200
   macro avg       0.26      0.26      0.26       200
weighted avg       0.27      0.27      0.26       200



## 3. Evaluation example with csv dataset