# About
This notebook houses code from my Medium article titled "Evaluating Multi-label Classifiers", published in TowardsDataScience--Medium's top datascience publication.

You can find the post [here](https://towardsdatascience.com/evaluating-multi-label-classifiers-a31be83da6ea).

### Importing Python libraries
Apart from the usual visualization (`matplotlib` and `seaborn`) and numerical computation libraries (`numpy`), we’ll use several of sklearn's modules.
- `make_multilabel_classification`: to generate multilabel data
- `multilabel_confusion_matrix`: to generate confusion matrices.
- `classification_report`: to generate a classification report
- `MultiLabelBinarizer`: to binarize the labels in our training data
- `precision_score`, `recall_score` and `f1_score`: to compute these metrics

In [1]:
import numpy as np
from sklearn.datasets import make_multilabel_classification
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import multilabel_confusion_matrix, classification_report
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import precision_score, recall_score, f1_score

# Generating our data
We'll use `make_multilabel_classification` to generate a thousand datapoints with 2 features and 3 classes. We'll leave the other paramaters to their default values.

In [2]:
X, y = make_multilabel_classification(n_samples=1000, n_features=2, n_classes=3)

In [3]:
X.shape, y.shape

((1000, 2), (1000, 3))

In [4]:
X

array([[47., 11.],
       [29., 13.],
       [23., 34.],
       ...,
       [19., 41.],
       [43., 13.],
       [19., 20.]])

In [5]:
y

array([[0, 0, 1],
       [0, 1, 1],
       [1, 1, 0],
       ...,
       [1, 1, 0],
       [0, 0, 1],
       [1, 1, 1]])

### Note
Since make_multilabel_classification directly returns binarized output, we don't have to binarizer it ourselves before modelling. But just to show how it's done, we'll use the `MultiLabelBinarizer`.

## The MultiLabelBinarizer
Sklearn's `MultiLabelBinarizer` binarizers lables in multi-label classification problems. The resulting binary string represents the presence or absence of a label.

In our case, we have three classes A, B and C. So post encoding, we'll have binary strings 3 bits long.
As an example, a data point with labels (A, B) will be encoded as [1, 1, 0], since A, B are present but C isn't.

In [6]:
multilabel_binarizer = MultiLabelBinarizer(classes=['A', 'B', 'C'])

We'll need to fit our binarizer first. Since we don't have any data with actual class labels, we'll create pass it some mock data. 

In [7]:
multilabel_binarizer.fit(['A', 'B', 'C'])

MultiLabelBinarizer(classes=['A', 'B', 'C'])

Let's test it. We'll use the same example as we did in the introduction to this section.
Let's encode the labels (A, B).

In [8]:
multilabel_binarizer.transform([['A', 'B']])

array([[1, 1, 0]])

We get the expected result.

### Note
In any machine learning approach, the next step after preprocessing would be to train a model. We won't be doing that here, since the purpose of this article is to learn how metrics are calculated, and not modeling itself. So we'll assume that we trained some model and we'll evaluate on a test set with three points.

# Evaluation
Since the focus of my article is more on low-level aspects of these metrics, let's take three datapoints as our test set. We'll randomly create these points and also randomly decide some predictions.


Let's say we have the following three expected labels for out test set. We'll store it in a variable called `y_expected`.


In [9]:
y_expected = [
    ['A', 'C'],
    ['C'],
    ['A', 'B', 'C']
]

Let's use our binarizer on this to convert it to a more useful format.

In [10]:
y_expected = multilabel_binarizer.transform(y_expected)
print(y_expected)

[[1 0 1]
 [0 0 1]
 [1 1 1]]


Similarly, let's assume that for the same test set, our model made the following predictions. They're stored in `y_pred`.

In [11]:
y_pred = [
    ['A', 'B'],
    ['C'],
    ['B', 'C']
]

We'll binarize this too.

In [12]:
y_pred = multilabel_binarizer.transform(y_pred)
print(y_pred)

[[1 1 0]
 [0 0 1]
 [0 1 1]]


## Generating confusion matrices
Confusion matrices are generated for each class and returned as a list by the `multilabel_confusion_matrix` function.

In [13]:
matrix = multilabel_confusion_matrix(y_expected, y_pred)

### Confusion Matrix for Class A
The expected output is
```
[[1, 0],
 [1, 1]]
```

In [14]:
confusion_matrix_A = matrix[0]
print(confusion_matrix_A)

[[1 0]
 [1 1]]


This is consistent with our expected matrix.
Similar computations can be done for the other two classes -- B and C.

In [15]:
confusion_matrix_B = matrix[1]
print(confusion_matrix_B)

[[1 1]
 [0 1]]


In [16]:
confusion_matrix_C = matrix[2]
print(confusion_matrix_C)

[[0 0]
 [1 2]]


## Computing Precision, Recall and F1-score for Class A
In the article, we manually calculated these metrics for class A. Let's use sklearn to see if our results are consistent with it.

### Precision for class A
```
manually computed: 1.0
```

Sklearn's `precision_score` function returns an array of precision scores when the `average` parameter is set to `None` and we get scores for individual classes.

In [17]:
precision = precision_score(y_expected, y_pred, average=None)

Precision for class A would be the first element of this array.

In [18]:
precision_A = precision[0]
print(precision_A)

1.0


This value, returned by sklearn matches with the result of our calculations.

### Recall for class A
```
manually computed: 0.5
```

Sklearn's `recall_score` function returns an array of recall scores when the `average` parameter is set to `None` and we get scores for individual classes.

In [19]:
recall = recall_score(y_expected, y_pred, average=None)

Recall for class A would be the first element of this array.

In [20]:
recall_A = recall[0]
print(recall_A)

0.5


This value, returned by sklearn matches with the result of our calculations.

### F1-Score for class A
```
manually computed: 0.667
```

Sklearn's `f1_score` function can be used here.

In [21]:
f1_scores = f1_score(y_expected, y_pred, average=None)

F1-score for class A would be the first element of this array.

In [22]:
f1_score_A = f1_scores[0]
print(round(f1_score_A, 3))

0.667


This value, returned by sklearn matches with the result of our calculations.

### Summary of Scores for our classes
Just like we did for class A, we can get the scores for B and C.

In [23]:
print("Precision, Recall and F1-score for class B: {0:.3f} {1:.3f} {2:.3f}".format(precision[1], recall[1], f1_scores[1]))
print("Precision, Recall and F1-score for class C: {0:.3f} {1:.3f} {2:.3f}".format(precision[2], recall[2], f1_scores[2]))

Precision, Recall and F1-score for class B: 0.500 1.000 0.667
Precision, Recall and F1-score for class C: 1.000 0.667 0.800


## Aggregate metrics
For precision, recall and F1-score, we can compute aggregate metrics:
- macro average
- micro average
- weighted average
- sample average

To check the correctness of our manual calculations of the aggregate values for precision, we'll use sklearn to calculate the same metric and compare the two.

### Macro Average for Precision
This is computed by passing `average="macro"` to sklearn's `precision_score` function.
```
expected: 0.833
```

In [24]:
precision_score(y_expected, y_pred, average='macro')

0.8333333333333334

This matches our calculated result. We'll do the same for the other aggregates.

### Micro Average for Precision
This is computed by passing `average="micro"` to sklearn's `precision_score` function.
```
expected: 0.8
```

In [25]:
precision_score(y_expected, y_pred, average='micro')

0.8

This also matches our calculated result.

### Weighted Average for Precision
This is computed by passing `average="weighted"` to sklearn's `precision_score` function.
```
expected: 0.9166
```

In [26]:
precision_score(y_expected, y_pred, average='weighted')

0.9166666666666666

This also matches our calculated result. Finally, we'll compute samples average over precision.

### Samples Average for Precision
This is computed by passing `average="samples"` to sklearn's `precision_score` function.
```
expected: 0.833
```

In [27]:
precision_score(y_expected, y_pred, average='samples')

0.8333333333333334

All of these are consistent with our calculations.

## The Classification Report
Using sklearn's `classification_report` function, we can quickly print out the entire classification report of our classifier on the 3 test data points. This report provides scores for:
- precision, recall and f1-score for individual classes
- support for individual classes
- aggregates over these metrics -- micro, macro, weighted and samples average
- total support

In [28]:
print(classification_report(y_expected, y_pred, output_dict=False, target_names=['class A', 'class B', 'class C']))

              precision    recall  f1-score   support

     class A       1.00      0.50      0.67         2
     class B       0.50      1.00      0.67         1
     class C       1.00      0.67      0.80         3

   micro avg       0.80      0.67      0.73         6
   macro avg       0.83      0.72      0.71         6
weighted avg       0.92      0.67      0.73         6
 samples avg       0.83      0.72      0.77         6

