# Performance Measures  I

In this notebook, we will implement **performance measures** for evaluating and comparing classifiers in machine learning. 

At the start, we will implement a function for computing *confusion matrices*.

It serves as a basis for computing the subsequent performance measures with a multi- or single-class focus.

Finally, we will compare the implemented performance measures using a simple exemplary classification task.

### **Table of Contents**
1. [Confusion Matrix](#confusion-matrix)
2. [Performances Measures with a Multi-class Focus](#multi-class)
3. [Performance Measures with a Single-class Focus](#single-class)
4. [Comparison of Performance Measures](#comparison)

In [None]:
%load_ext autoreload
%autoreload 2

import numpy as np
import math
import matplotlib.pyplot as plt

### **1. Confusion Matrix** <a class="anchor" id="confusion-matrix"></a>

The confusion matrix $\mathbf{C}_\mathcal{T}(h) \in \mathbb{N}^{|\mathcal{Y}| \times |\mathcal{Y}|}$ is a table or matrix that is commonly used to evaluate the performance of a
classifier $h: \mathcal{X} \rightarrow \mathcal{Y}$. It summarizes the predictions made by the classifier $h$ on a test set $\mathcal{T} \subset \mathcal{X} \times \mathcal{Y}$. The (unormalized) entries of the confusion matrix are defined as:

BEGIN SOLUTION

$$
C_{\mathcal{T}}^{ij}(h) = \sum_{(\mathbf{x}, y) \in \mathcal{T}} \delta\left(y=i \wedge h(\mathbf{x})=j\right).
$$

END SOLUTION

There exist other variants of a confusion matrix, where the entries of the confusion matrix are normalized row-wise, column-wise, or by the total sum of entries. We implement the function [`confusion_matrix`](../e2ml/evaluation/_performance_measures.py) in the [`e2ml.evaluation`](../e2ml/evaluation) subpackage.
Once, the implementation has been completed, we check its validity for simple examples. 

In [None]:
from e2ml.evaluation import confusion_matrix

# Check ranges of class labels.
y_1 = [0, -1, 2]
y_2 = [0, 1, 2]
check = False
try:
    confusion_matrix(y_true=y_1, y_pred=y_2)
except ValueError:
    check = True
assert check, 'There must be a ValueError because of invalid values.'
check = False
try:
    confusion_matrix(y_true=y_2, y_pred=y_1)
except ValueError:
    check = True
assert check, 'There must be a ValueError because of invalid values.'

# Check type of class labels.
y_1 = ["hello", "new", "test"]
y_2 = [0, 1, 2]
check = False
try:
    confusion_matrix(y_true=y_1, y_pred=y_2)
except ValueError:
    check = True
assert check, 'There must be a TypeError because of invalid value types.'
check = False
try:
    confusion_matrix(y_true=y_2, y_pred=y_1)
except ValueError:
    check = True
assert check, 'There must be a TypeError because of invalid value types.'

# Check unequal array lengths.
y_1 = [0, -1, 2, 3]
y_2 = [0, 1, 2]
check = False
try:
    confusion_matrix(y_true=y_1, y_pred=y_2)
except ValueError:
    check = True
assert check, 'There must be a ValueError because of unqueal array lengths.'
check = False
try:
    confusion_matrix(y_true=y_2, y_pred=y_1)
except ValueError:
    check = True
assert check, 'There must be a ValueError because of unqueal array lengths.'


# Test correct computation for various simple examples.
# BEGIN SOLUTION
y_true = [0, 0, 1, 1]
y_pred = [1, 0, 1, 0]
C_true = [[1, 1], [1, 1]]
C = confusion_matrix(y_true=y_true, y_pred=y_pred)
np.testing.assert_array_equal(C_true, C)
C_true = [[1, 1, 0], [1, 1, 0], [0, 0, 0]]
C = confusion_matrix(y_true=y_true, y_pred=y_pred, n_classes=3)
np.testing.assert_array_equal(C_true, C)
C_true = np.full((2, 2), fill_value=0.25)
C = confusion_matrix(y_true=y_true, y_pred=y_pred, n_classes=2, normalize="all")
np.testing.assert_array_equal(C_true, C)
C_true = np.full((2, 2), fill_value=0.5)
C = confusion_matrix(y_true=y_true, y_pred=y_pred, n_classes=2, normalize="true")
np.testing.assert_array_equal(C_true, C)
C = confusion_matrix(y_true=y_true, y_pred=y_pred, n_classes=2, normalize="pred")
np.testing.assert_array_equal(C_true, C)
# END SOLUTION

### **2. Performance Measures with a Multi-class Focus** <a class="anchor" id="multi-class"></a>

The accuracy $\mathrm{ACC}_\mathcal{T}(h) \in [0, 1]$ of a classifier $h$ on a test set $\mathcal{T}$ is one of the most know performance measures and can be computed  as the complement of the empirical risk $R_\mathcal{T}(h)$ or using the confusion matrix $\mathbf{C}_{\mathcal{T}}(h)$ as follows:

BEGIN SOLUTION

$$
\mathrm{ACC}_{\mathcal{T}}(h) = 1 - R_\mathcal{T}(h) = \frac{\sum_{y \in \mathcal{Y}}C^{yy}_{\mathcal{T}}(h)}{\sum_{i \in \mathcal{Y}}\sum_{j \in \mathcal{Y}}C^{i j}_{\mathcal{T}}(h)}.
$$

END SOLUTION

We implement the function [`accuracy`](../e2ml/evaluation/_performance_measures.py) in the [`e2ml.evaluation`](../e2ml/evaluation) subpackage.
Once, the implementation has been completed, we check its validity for simple examples. 

In [None]:
from e2ml.evaluation import accuracy

# Check unequal array lengths.
y_1 = [0, -1, 2, 3]
y_2 = [0, 1, 2]
check = False
try:
    print(accuracy(y_true=y_1, y_pred=y_2))
except ValueError:
    check = True
assert check, 'There must be a ValueError because of unqueal array lengths.'
check = False
try:
    accuracy(y_true=y_2, y_pred=y_1)
except ValueError:
    check = True
assert check, 'There must be a ValueError because of unqueal array lengths.'


# Test correct computation for various simple examples.
# BEGIN SOLUTION
y_true = [0, 0, 1, 1]
y_pred = [1, 0, 1, 0]
acc_true = 0.5
assert acc_true == accuracy(y_true=y_true, y_pred=y_pred)
y_true = ["hello", "test", "test", "hello"]
y_pred = ["hello", "test", "test", "test"]
acc_true = 0.75
assert acc_true == accuracy(y_true=y_true, y_pred=y_pred)
# END SOLUTION

#### **Question:**
2. (a) What are limitations of the accuracy as performance measure?

   BEGIN SOLUTION
   
   Limitations of accuracy are:
   - lack of information conveyed by these measures on the varying degree of importance on
the class-specific performance,
   - inability to convey meaningful information in the case of skewed class distribution,
   - inability to distinguish the importance of errors across different classes with unequal misclassification costs.
   
   END SOLUTION

Cohen’s $\kappa$ represents a more realistic estimate of classifier effectiveness, which is the proportion of labels that the classifier gets right over and above chance agreement. We can compute this performance measure according to:

BEGIN SOLUTION

$$
\kappa_\mathcal{T}(h) = \frac{p^o_\mathcal{T}(h) - p^e_\mathcal{T}(h)}{1 - p^e_\mathcal{T}(h)}, 
$$

where $p^o_\mathcal{T}(h)$ is the relative observed agreement among classifier and true labels, while $p^e_\mathcal{T}(h)$ is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of classifier/true labels randomly assigning each class.

END SOLUTION

We implement the function [`cohen_kappa_score`](../e2ml/evaluation/_performance_measures.py) in the [`e2ml.evaluation`](../e2ml/evaluation) subpackage. Once, the implementation has been completed, we check its validity for simple examples. 

In [None]:
from e2ml.evaluation import cohen_kappa

# Test correct computation for various simple examples.
# BEGIN SOLUTION
y_true = [0, 0, 1, 1]
y_pred = [1, 0, 1, 0]
kappa_true = 0.0
assert kappa_true == cohen_kappa(y_true=y_true, y_pred=y_pred)
y_true = [1, 1, 1, 1, 0]
y_pred = [1, 1, 1, 1, 1]
kappa_true = 0.0
assert kappa_true == cohen_kappa(y_true=y_true, y_pred=y_pred)
# END SOLUTION

### **3. Performance Measures with a Single-class Focus** <a class="anchor" id="single-class"></a>

The F measure combines precision and recall in a score by computing the weighted harmonic mean of both. For any $\alpha \in \mathbb{R}_{>0}$, the F measure can be given as:

BEGIN SOLUTION

$$
F^{\alpha}_{\mathcal{T}}(h) = \frac{(1+\alpha)\left(\mathrm{Prec}_\mathcal{T}(h) \cdot \mathrm{Rec}_\mathcal{T}(h)\right)}{\alpha \cdot \mathrm{Prec}_\mathcal{T}(h) + \mathrm{Rec}_{\mathcal{T}}(h)}.
$$

END SOLUTION

The F1 measure or balance F measure weights the recall and precision of the classifier evenly via $\alpha=1$. The macro F1 measure is an extension toward multi-class problems. Its idea is to compute the F1 score for each class and then taking the arithmetic mean of these scores.

We implement the function [`macro_f1_measure`](../e2ml/evaluation/_performance_metrics.py) in the [`e2ml.evaluation`](../e2ml/evaluation) subpackage. Once, the implementation has been completed, we check its validity for simple examples. 

In [None]:
from e2ml.evaluation import macro_f1_measure

# Test correct computation for various simple examples.
# BEGIN SOLUTION
y_true = [0, 0, 1, 1]
y_pred = [1, 0, 1, 0]
macro_f1_true = 0.5
assert macro_f1_true == macro_f1_measure(y_true=y_true, y_pred=y_pred)
y_true = [1, 1, 1, 1, 0, 2, 2, 2, 2, 3]
y_pred = [1, 1, 1, 1, 0, 2, 2, 2, 2, 0]
macro_f1_true = 2./3.
assert macro_f1_true == macro_f1_measure(y_true=y_true, y_pred=y_pred)
# END SOLUTION

### **4. Comparison of Performance Measures** <a class="anchor" id="comparison"></a>
In the following, we perform an exemplary evaluation study to compare the performance measures accuracy, Cohen's kappa, and macro F1. Therefore, we fit a logistic regression model on a synthetic data set and compute the corresponding measures.

In [None]:
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression

# Generate classification dataset.
X, y = make_blobs(n_samples=[20, 50, 400], random_state=0)

# Visualize the dataset.
plt.scatter(X[:, 0], X[:, 1], c=y) # <-- SOLUTION
plt.show() # <-- SOLUTION

# Split the dataset into 80% training data and 20% test data.
# BEGIN SOLUTION
indices = np.arange(len(X))
size = int(0.8 * len(indices))
test = np.random.RandomState(0).choice(indices, replace=False, size=size)
train = np.setdiff1d(indices, test)
# END SOLUTION

# Fit a logistic regression model on the training data.
lr = LogisticRegression(max_iter=2000, random_state=0) # <- SOLUTION
lr.fit(X[train], y[train]) # <- SOLUTION

# Evaluate and print the three performance measures on the training and test set.
# BEGIN SOLUTION
y_pred = lr.predict(X)
print(f"Training accuracy: {accuracy(y[train], y_pred[train])}")
print(f"Test accuracy: {accuracy(y[test], y_pred[test])}")
print(f"Training Cohen's kappa: {cohen_kappa(y[train], y_pred[train])}")
print(f"Test Cohen's kappa: {cohen_kappa(y[test], y_pred[test])}")
print(f"Training macro F1 measure: {macro_f1_measure(y[train], y_pred[train])}")
print(f"Test macro F1 measure: {macro_f1_measure(y[test], y_pred[test])}")
# END SOLUTION