Supported by sklearn: https://scikit-learn.org/stable/modules/model_evaluation.html
https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics

# Classification Labels

Binary classification

- precision_recall_curve(y_true, probas_pred)
- roc_curve(y_true, y_score)

Binary + multiclass classification

- balanced_accuracy_score(y_true, y_pred)
- cohen_kappa_score(y1, y2)
- confusion_matrix(y_true, y_pred)
- hinge_loss(y_true, pred_decision)
- matthews_corrcoef(y_true, y_pred)
- roc_auc_score(y_true, y_score)

Binary + multiclass + multilabel classification

- accuracy_score(y_true, y_pred)
- classification_report(y_true, y_pred)
- f1_score(y_true, y_pred)
- fbeta_score(y_true, y_pred)
- hamming_loss(y_true, y_pred)
- jaccard_score(y_true, y_pred)
- log_loss(y_true, y_pred)
- multilabel_confusion_matrix(y_true, y_pred)
- precision_recall_fscore_support(y_true)
- precision_score(y_true, y_pred)
- recall_score(y_true, y_pred)
- roc_auc_score(y_true, y_score)
- zero_one_loss(y_true, y_pred)

Binary + multilabel classification

- average_precision_score(y_true, y_score)


# Confusion Matrix

![](img/tptnfpfn.png)

# Confusion Matrix - Extended

![](img/tptnfpfn-extended.png)


# Support

Support is the number of actual occurrences of the class in the test data set. Imbalanced support in the training data may indicate the need for stratified sampling or rebalancing.

# Precision

What proportion of predictied positives is truly positive? In other words, of all the actual positive class, how many were correctly predicted? In the case of multi-class classification, this is calculated for each class (this class / not this class):

$\frac{TP}{TP + FP}$

# Recall

aka Sensitivity

What proportion of actual positives are predicted positive? In other words, of all the predicted positive class, how many were actually positive? In the case of multi-class classification, this is calculated for each class (this class / not this class):

$\frac{TP}{TP + FN}$


# Specificity

Number of examples correctly predicted to be negative out of total true negatives

$\frac{TN}{TN + FP}$


# Type I Error

aka False Positive Rate

Number of examples falsely predicted to be positive out of total true negatives

$\frac{FP}{FP + TN}$


# Type II Error

aka False Negative Rate

Number of examples falsely predicted to be negative out of total true positives

$\frac{FN}{FN + TP}$

# Accuracy

What proportion of all classes were correctly predicted?

$\frac{TP + TN}{TP + TN + FP + FN}$ 

# Balanced Accuracy

Binary classification, balanced accuracy:

$\frac{1}{2} ( \frac{TP}{TP + FN} + \frac{TN}{TN + FP} )$

To extend to multiclass classification,

$\frac{1}{\text{n classes}} \sum \frac{TP}{TP + FN}$

over each class,  where $\frac{\text{correctly classified}}{\text{total true in class}}$

# (Per-Class) F1-Score

F1-score uses the harmonic mean instead of the arithmetic mean:

$2 \times \frac{precision \times recall}{precision + recall}$

It will always be between precision and recall, but it gives a larger weight to / penalizes lower numbers.


# (Combined) Macro-F1, Macro-Precision, Macro-Recall

Arithmetic mean of per-class f1-scores, precision, and recall.

# (Combined) Weighted-F1, Weighted-Precision, Weighted-Recall

Weighted average of per-class f1-scores, precision, and recall by the number of examples in each class.

# (Combined) Micro-F1, Micro-Precision, Micro-Recall

Calculate micro-precision and micro-recall by calculating each metric as a sum across all classes. TP is a sum across all classes and will have some value. Because FP and FN look for false predictions across all classes, FP == FN for all classes instead of for each class individually. This causes precision == recall, which causes f1-score to be the same. Calculating accuracy from the generalize formula above, we can also prove that accuracy is the same. So:

precision = recall = f1-score = accuracy

Great article that covers this: https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1


# F1-Score Caution

As the eminent statistician David Hand explained, “the relative importance assigned to precision and recall should be an aspect of the problem”. Classifying a sick person as healthy has a different cost from classifying a healthy person as sick, and this should be reflected in the way weights and costs are used to select the best classifier for the specific problem you are trying to solve. The standard F1-scores do not take any of the domain knowledge into account.

# Mathews Correlation Coefficient (MCC)

https://en.wikipedia.org/wiki/Matthews_correlation_coefficient

MCC, originally devised for binary classification on unbalanced classes, has been extended to evaluates multiclass classifiers by computing the correlation coefficient between the observed and predicted classifications. A coefficient of +1 represents a perfect prediction, 0 is similar to a random prediction and −1 indicates an inverse prediction.

$\frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}$

MCC is more informative than F1 score and accuracy in evaluating binary classifications problems because it takes into account the balance  ratios / size of the four confusion matrix categories (TP, TN, FP, FN). This is especially helpful when dealing with a highly imbalanced dataset, since your algorithm may be predicting most/all of the dominant class. By considering the proportion of each class of the confusion matrix in its formula, its score is high only if your classifer is doing well on both the negative and positive examples.


# Kappa Score, aka Cohen's Kappa Coefficient

This score measures the degree of agreement between two evaluators (admissions example), but in data science we are looking at the agreement between the true and predicted values. The kappa score considers how much better the agreements are over and beyond chance agreements. Thus, in addition to Agree, the kappa formula also uses the expected proportion of chance agreements; let’s call this number ChanceAgree.

$KappaScore = \frac{Agree-ChanceAgree}{1-ChanceAgree} $

Note that the numerator calculates the difference between Agree and ChanceAgree. If Agree=1, we have perfect agreement. In this case, the kappa score is 1, regardless of ChanceAgree. In contrast, if Agree=ChanceAgree, kappa is 0, signifying that the professors’ agreement is by chance. If Agree is smaller than ChanceAgree, the kappa score is negative, denoting that the degree of agreement is lower than chance agreement.

$Agree = \frac{TP}{N}$ across all examples

To calculate the probability of a particular class for each evaluator, calculate the following for each class:

$Prob_{true}(ThisClass) = \frac{N_{ThisClass}}{N}, $
$Prob_{pred}(ThisClass) = \frac{N_{ThisClass}}{N}$

To calculate ChanceAgree for a particular class:

$ChanceAgree(ThisClass) = Prob_{true}(ThisClass) \times Prob_{pred}(ThisClass)$

Summing up the above probability for each class, we get the probability that agreement on any of the classes happened by chance:

$ChanceAgree = ChanceAgree(ThisClass)+ChanceAgree(AnotherClass)+ChanceAgree(YetAnotherClass)$


This is one of the best metrics for evaluating multi-class classifiers on imbalanced datasets.

The traditional metrics from the classification report are biased towards the majority class and assumes an identical distribution of the actual and predicted classes.

In [4]:
from sklearn.metrics import cohen_kappa_score

# Cross-Entropy / Log Loss

https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a

Log loss is usually used when there are just two possible outcomes that can be either 0 or 1. Cross entropy is usually used when there are three or more possible outcomes.

Cross-entropy measures the extent to which the predicted probabilities match the given data, and is useful for probabilistic classifiers such as Naïve Bayes. It is a more generic form of the logarithmic loss function, which was derived from neural network architecture, and is used to quantify the cost of inaccurate predictions. The classifier with the lowest log loss is preferred.

(Most associated with Log Loss)

$- \sum^N_i \sum^M_j y_{ij} \times ln p_{ij}$


Example on how to calculate:

1. Make a one-hot encoding of the true labels, eg. $y_1$=A=[1,0,0]
2. Use your model to predict probabilities , eg. $p_1$=[0.7,0.2,0.1]
3. Take the element-wise product of the label and probability: $y_1⋅p_1=ln(0.7)$
4. Do the same for all the other samples and take the negative of the sum

https://jamesmccaffrey.wordpress.com/2016/09/25/log-loss-and-cross-entropy-are-almost-the-same/#:~:text=Log%20loss%20is%20usually%20used,three%20or%20more%20possible%20outcomes.&text=In%20words%2C%20cross%20entropy%20is,probabilities%20times%20the%20actual%20probabilities.

The equation I've had more success with (more associated with Cross-Entropy):

$-(y_t log(y_p) + (1 - y_t) log(1 - y_p))$

In [21]:
# import libraries
from math import log

In [22]:
def logloss(y_true, y_pred):
    return -(1 / len(y_true)) * sum([( t * log(p) ) + ( ( 1 - t ) * ( log(1 - p) ) ) for t,p in zip(y_true, y_pred)]) 

logloss([1,0,0], [0.7,0.2,0.1])

0.22839300363692283

In [23]:
from sklearn.metrics import log_loss

In [24]:
log_loss([1,0,0], [0.7,0.2,0.1])

0.22839300363692283

# ROC-AUC Score

AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.

AUC helps us to compare one ROC curve to another by calculating the area under the curve. Generally, the more area it captures the better.


---

Some important characteristics of ROC-AUC are:

- The value can range from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0. However auc score of a random classifier for balanced data is 0.5
- ROC-AUC score is independent of the threshold set for classification because it only considers the rank of each prediction and not its absolute value. The same is not true for F1 score which needs a threshold value in case of probabilities output

AUC is desirable for the following two reasons:

- AUC is **scale-invariant**. It measures how well predictions are ranked, rather than their absolute values.
- AUC is **classification-threshold-invariant**. It measures the quality of the model's predictions irrespective of what classification threshold is chosen.

However, both these reasons come with caveats, which may limit the usefulness of AUC in certain use cases:

- **Scale invariance is not always desirable.** For example, sometimes we really do need well calibrated probability outputs, and AUC won’t tell us about that.
- **Classification-threshold invariance is not always desirable.** In cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimize one type of classification error. For example, when doing email spam detection, you likely want to prioritize minimizing false positives (even if that results in a significant increase of false negatives). AUC isn't a useful metric for this type of optimization.

---

The probabilistic interpretation of ROC-AUC score is that if you randomly choose a positive case and a negative case, the probability that the positive case outranks the negative case according to the classifier is given by the AUC.

Mathematically, it is calculated by area under curve of sensitivity (TPR) vs.
FPR(1-specificity). Ideally, we would like to have high sensitivity & high specificity, but in real-world scenarios, there is always a tradeoff between sensitivity & specificity.


# Binary Classification

# Precision-Recall Curve

The precision-recall curve shows the tradeoff between precision and recall for different threshold. A high area under the curve represents both high recall and high precision. This is more commonly used with binary classification but can be used in multi-label classification as well.

![](img/precision-recall.png)

# ROC Curve

Start predicting every example as positive and calculate a point. Then increase the threshold until one sample would be predicted as negative. Then two. And so on until all examples are predicted to be negative. Plot each of those points.

Plot a point at (True Positive Rate, False Positive Rate).

Another way to interpret a point (0, 0.75) is that the model is correctly classifying 75% of positive samples and 100% of negative samples.

Depending on how many false positives or false negatives you're willing to accept in application, an optimal point can be selected.

You can replace False Positives with Precision when you have a highly imbalanced dataset where there are mostly negative examples.

---

The point that is farthest to the left of the slope = 1 line is the point at which the threshold decreases the proportion of the examples that are incorrectly classified as positive (false positives).

![](img/roc.png)