In [1]:
# Imports
import tensorflow as tf
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense


# [Metrics](https://keras.io/api/metrics/)

A metric is a function that is used to judge the performance of your model.

Metric functions are similar to loss functions, except that the results from evaluating a metric are not used when training the model. Note that you may use any loss function as a metric.

Available metrics in the keras library are
1. Accuracy metrics
    1. Accuracy class
    2. BinaryAccuracy class
    3. CategoricalAccuracy class
    4. TopKCategoricalAccuracy class
    5. SparseTopKCategoricalAccuracy class
2. Probabilistic metrics
    1. BinaryCrossentropy class
    2. CategoricalCrossentropy class
    3. SparseCategoricalCrossentropy class
    4. KLDivergence class
    5. Poisson class
3. Regression metrics
    1. MeanSquaredError class
    2. RootMeanSquaredError class
    3. MeanAbsoluteError class
    4. MeanAbsolutePercentageError class
    5. MeanSquaredLogarithmicError class
    6. CosineSimilarity class
    7. LogCoshError class
4. Classification metrics based on True/False positives & negatives
    1. AUC class
    2. Precision class
    3. Recall class
    4. TruePositives class
    5. TrueNegatives class
    6. FalsePositives class
    7. FalseNegatives class
    8. PrecisionAtRecall class
    9. SensitivityAtSpecificity class
    10. SpecificityAtSensitivity class
5. Image segmentation metrics
    1. MeanIoU class
6. Hinge metrics for "maximum-margin" classification
    1. Hinge class
    2. SquaredHinge class
    3. CategoricalHinge class




***
## Introduction

Choosing the right metric is crucial while evaluating machine learning (ML) models; various metrics are proposed to evaluate ML models in different applications. In some applications looking at a single metric may not give you the whole picture of the problem you are solving, and you may want to use a subset of the metrics. We will discuss, a few of the metrics, but remember, there are other metrics exist too.

***


***
## Difference Between Metric and Loss Function

It is also worth mentioning that metric is different from loss function. Loss functions are functions that show a measure of the model performance and are used to train a machine learning model (using some kind of optimization), and are usually **differentiable** in model’s parameters. On the other hand, metrics are used to monitor and measure the performance of a model (during training, and test), and do not need to be differentiable. However if for some tasks the performance metric is differentiable, it can be used both as a loss function (perhaps with some regularizations added to it), and a metric, such as MSE.

***

## Confusion Matrix 

**Remember that confusion matrix is not a mteric, but it is an important concept to learn.**

One of the key concept in classification performance is confusion matrix, also known as error matrix, which is a tabular visualization of the model predictions versus the ground-truth labels. Each row of confusion matrix represents the instances in a predicted class and each column represents the instances in an actual class.

Let’s go through this with an example. Let’s assume we are building a binary classification to classify cat images from non-cat images. And let’s assume our test set has 1100 images (1000 non-cat images, and 100 cat images), with the below confusion matrix.

![ConfusionMatrix](image6.png)


- **Out of 100 cat images**, the model has predicted 90 of them correctly  and has mis-classified 10 of them. If we refer to the “cat” class as positive and the non-cat class as negative class, then 90 samples predicted as cat are considered as as true-positive, and the 10 samples predicted as non-cat are false negative.
- **Out of 1000 non-cat images**, the model has classified 940 of them correctly, and mis-classified 60 of them. The 940 correctly classified samples are referred as true-negative, and those 60 are referred as false-positive.

As we can see diagonal elements of this matrix denote the correct prediction for different classes, while the off-diagonal elements denote the samples which are mis-classified.


***
## Classification metrics based on True/False positives & negatives

### Classification Accuracy

Classification Accuracy is measured using the relationship

\begin{equation}
Accuracy = \frac{Number\ of\ Correct\ Predictions}{Total\ Number\ of\ Prediction}
\end{equation}

In Keras, `tf.keras.metrics.Accuracy(name="accuracy", dtype=None)` can be used to calculate it.

***



### Precision class

There are many cases in which classification accuracy is not a good indicator of your model performance. 

One of these scenarios is when your class distribution is imbalanced (one class is more frequent than others). In this case, even if you predict all samples as the most frequent class you would get a high accuracy rate, which does not make sense at all (because your model is not learning anything, and is just predicting everything as the top class). 

For example in our cat vs non-cat classification above, if the model predicts all samples as non-cat, it would result in a 1000/1100= 90.9%. Therefore we need to look at class specific performance metrics too. **Precision** is one of such metrics, which is defined as:

\begin{equation}
Precision= \frac{True\_Positive}{True\_Positive+ False\_Positive}
\end{equation}

The precision of Cat and Non-Cat class in the above example can be calculated as:

\begin{equation}
Precision\_cat= \frac{samples \ correctly\  predicted \ cat}{samples\ predicted\ as\ cat} = \frac{90}{90+60} = 60\% 
\end{equation}

\begin{equation}
Precision\_NonCat= \frac{940}{950}= 98.9\%
\end{equation}

As we can see the model has much higher precision in predicting non-cat samples, versus cats. This is not surprising, as model has seen more examples of non-cat images during training, making it better in classifying that class.


#### In Code
In keras, we can use `tf.keras.metrics.Precision(thresholds=None, top_k=None, class_id=None, name=None, dtype=None)`. Let's see a few examples:



**Example 1**

In [2]:
m = tf.keras.metrics.Precision()
m.update_state([0, 1, 1, 1], [1, 0, 1, 1])
m.result().numpy()


0.6666667

**Example 2:**

In [3]:
m = tf.keras.metrics.Precision()
m.update_state([0, 1, 1, 1], [1, 0, 1, 1], sample_weight=[0, 0, 1, 0])
m.result().numpy()

1.0

**Example 3:**

In [4]:
m = tf.keras.metrics.Precision(top_k=2)
m.update_state([0, 0, 1, 1], [1, 1, 1, 1])
m.result().numpy()

0.0

**Example 4:**
With `top_k=4`, it will calculate precision over $y\_true[:4]$ and $y\_pred[:4]$


In [5]:
m = tf.keras.metrics.Precision(top_k=4)
m.update_state([0, 0, 1, 1], [1, 1, 1, 1])
m.result().numpy()

0.5

**Usage with compile() API:**

In [6]:
model = Sequential(name = 'Model') 
model.add(Dense(units = 1,  activation='sigmoid', input_shape=(2,),name='Layer_1'))

model.compile(optimizer='sgd',
              loss='mse',
              metrics=[tf.keras.metrics.Precision()])

***
### Recall class

Recall is another important metric, which is defined as the fraction of samples from a class which are correctly predicted by the model as shown below

\begin{equation}
Recall= \frac{True\_Positive}{True\_Positive + False\_Negative}
\end{equation}

Therefore, for our example above, the recall rate of cat and non-cat classes can be found as:

\begin{equation}
Recall_cat= \frac{90}{100}= 90\%\\
Recall_NonCat= \frac{940}{1000}= 94\%
\end{equation}


AUC class
TruePositives class
TrueNegatives class
FalsePositives class
FalseNegatives class
PrecisionAtRecall class
SensitivityAtSpecificity class
SpecificityAtSensitivity class

#### In Code

In keras, we can use `tf.keras.metrics.Recall(
    thresholds=None, top_k=None, class_id=None, name=None, dtype=None
)`. Let's see a few examples:

**Example 1:**

In [7]:
m = tf.keras.metrics.Recall()
m.update_state([0, 1, 1, 1], [1, 0, 1, 1])
m.result().numpy()

0.6666667

**Example 2:**

In [8]:
m = tf.keras.metrics.Recall()
m.update_state([0, 1, 1, 1], [1, 0, 1, 1], sample_weight=[0, 0, 1, 0])
m.result().numpy()

1.0

**Example 3: Usage with compile() API:**

In [9]:
model.compile(optimizer='sgd',
              loss='mse',
              metrics=[tf.keras.metrics.Recall()])

***

Likewise, we can compute

1. **TruePositives** class using
`tf.keras.metrics.TruePositives(thresholds=None, name=None, dtype=None)`
2. **TrueNegatives** class using 
`tf.keras.metrics.TrueNegatives(thresholds=None, name=None, dtype=None)`
3. **FalsePositives** class using
`tf.keras.metrics.FalsePositives(thresholds=None, name=None, dtype=None)`
4. **FalseNegatives** class using
`tf.keras.metrics.FalseNegatives(thresholds=None, name=None, dtype=None)`
5. **PrecisionAtRecall** class using
`tf.keras.metrics.PrecisionAtRecall(recall, num_thresholds=200, class_id=None, name=None, dtype=None)`

 It Computes best precision where recall is >= specified value.
 This metric creates four local variables, **true_positives, true_negatives, false_positives and false_negatives** that are used to compute the precision at the given recall. The threshold for the given recall value is computed and used to evaluate the corresponding precision.
***

### F1 Score
Depending on application, you may want to give higher priority to recall or precision. But there are many applications in which both recall and precision are important. Therefore, it is natural to think of a way to combine these two into a single metric. One popular metric which combines precision and recall is called F1-score, which is the harmonic mean of precision and recall defined as:

\begin{equation}
F1-score= \frac{2*Precision*Recall}{Precision+Recall}
\end{equation}

So for our classification example with the confusion matrix in Figure 1, the F1-score can be calculated as:

\begin{equation}
F1\_cat= \frac{2*0.6*0.9}{0.6+0.9}= 72\%
\end{equation}

***

### Sensitivity and Specificity
Sensitivity and specificity are two other popular metrics mostly used in medical and biology related fields, and are defined as:

\begin{equation}
Sensitivity = Recall= \frac{TP}{TP+FN} \\
Specificity = True Negative Rate= \frac{TN}{TN+FP}
\end{equation}

#### (In Code) SensitivityAtSpecificity class

**SensitivityAtSpecificity** class can be computed using 
`tf.keras.metrics.SensitivityAtSpecificity(specificity, num_thresholds=200, class_id=None, name=None, dtype=None)`

It computes best sensitivity where specificity is >= specified value: the sensitivity at a given specificity.

**Sensitivity** measures the proportion of actual positives that are correctly identified as such $\frac{tp}{tp + fn}$. Specificity measures the proportion of actual negatives that are correctly identified as such $\frac{tn}{tn + fp}.

This metric creates four local variables, **true_positives, true_negatives, false_positives and false_negatives** that are used to compute the sensitivity at the given specificity. The threshold for the given specificity value is computed and used to evaluate the corresponding sensitivity.



**Example 1:**

In [10]:
m = tf.keras.metrics.SensitivityAtSpecificity(0.6)
m.update_state([0, 0, 0, 1, 1], [0, 0.3, 0.8, 0.3, 0.8])
m.result().numpy()

0.5

**Example 2:**

In [11]:
m = tf.keras.metrics.SensitivityAtSpecificity(0)
m.update_state([0, 0, 0, 1, 1], [0, 0.3, 0.8, 0.3, 0.8],sample_weight=[1, 1, 2, 2, 1])
m.result().numpy()
0.333333

0.333333

**Example 3: Usage with compile() API**

In [12]:
model.compile(
    optimizer='sgd',
    loss='mse',
    metrics=[tf.keras.metrics.SensitivityAtSpecificity(0.5)])

***
Similarly, **SpecificityAtSensitivity** class can be computed using
`tf.keras.metrics.SpecificityAtSensitivity(sensitivity, num_thresholds=200, class_id=None, name=None, dtype=None)`

It computes best specificity where sensitivity is >= specified value.

***

### ROC Curve

The receiver operating characteristic curve is the plot that shows the performance of a binary classifier as function of its cut-off threshold. It essentially shows the true positive rate (TPR) against the false positive rate (FPR) for various threshold values. 

**Explanation**

Many of the classification models are probabilistic; that is, they predict the probability of a sample being a cat. They then compare that output probability with some cut-off threshold and if it is larger than the threshold they predict its label as cat, otherwise as non-cat. As an example, your model may predict the below probabilities for 4 sample images: $[0.45, 0.6, 0.7, 0.3]$. Then depending on the threshold values below, you will get different labels:

\begin{equation}
cut\dash off= 0.5: predicted-labels= [0,1,1,0] (default threshold) \\
cut-off= 0.2: predicted-labels= [1,1,1,1] \\
cut-off= 0.8: predicted-labels= [0,0,0,0]\\
\end{equation}

As you can see by varying the threshold values, we will get completely different labels. And as you can imagine each of these scenarios would result in a different precision and recall (as well as TPR, FPR) rates.
ROC curve essentially finds out the TPR and FPR for various threshold values and plots TPR against the FPR. A sample ROC curve is shown in the following figure.

![ROC Curve](image7.png)

As we can see from this example, the lower the cut-off threshold on positive class, the more samples predicted as positive class, i.e. higher true positive rate (recall) and also higher false positive rate (corresponding to the right side of this curve). Therefore, there is a trade-off between how high the recall could be versus how much we want to bound the error (FPR).

ROC curve is a popular curve to look at overall model performance and pick a good cut-off threshold for the model.

***

### AUC

The area under the curve (AUC), is an aggregated measure of performance of a binary classifier on all possible threshold values (and therefore it is threshold invariant).

AUC calculates the area under the ROC curve, and therefore it is between 0 and 1. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.

![AUC](image8.png)


On high-level, the higher the AUC of a model the better it is. But sometimes threshold independent measure is not what you want, e.g. you may care about your model recall and require that to be higher than 99% (while it has a reasonable precision or FPR). In that case, you may want to tune your model threshold such that it meets your minimum requirement on those metrics (and you may not care even if you model AUC is not too high).
Therefore in order to decide how to evaluate your classification model performance, perhaps you want to have a good understanding of the business/problem requirement and the impact of low recall vs. low precision, and decide what metric to optimize for.

From a practical standpoint, a classification model which outputs probabilities is preferred over a single label output, as it provides the flexibility of tuning the threshold such that it meets your minimum recall/precision requirements. Not all models provide this nice probabilistic outputs though, e.g. SVM does not provide a simple probability as an output (although it provides margin which can be used to tune the decision, but it is not as straightforward and interpretable as having output probabilities).

***

#### In Code

**Example 1:**

In [13]:
m = tf.keras.metrics.AUC(num_thresholds=3)
m.update_state([0, 0, 1, 1], [0, 0.5, 0.3, 0.9])
# threshold values are [0 - 1e-7, 0.5, 1 + 1e-7]  
# tp = [2, 1, 0], fp = [2, 0, 0], fn = [0, 1, 2], tn = [0, 2, 2]  
# tp_rate = recall = [1, 0.5, 0], fp_rate = [1, 0, 0]  
# auc = ((((1+0.5)/2)*(1-0)) + (((0.5+0)/2)*(0-0))) = 0.75  
m.result().numpy()

0.75

***
We can find different parameters as follows:
- m.thresholds will return threshold values.
- m.true_positive will return the TP values.
- and so on
***

**Example 2:**

In [14]:
m = tf.keras.metrics.AUC(num_thresholds=3)
m.update_state([0, 0, 1, 1], [0, 0.5, 0.3, 0.9],
               sample_weight=[1, 0, 0, 1])
m.result().numpy()

1.0

**Example 3 (Usage with compile() API):**

In [15]:
# Reports the AUC of a model outputing a probability.
model.compile(optimizer='sgd',
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=[tf.keras.metrics.AUC()])



# Reports the AUC of a model outputing a logit.
model = Sequential()
model.add(Dense(1, input_shape = (2, ), activation='sigmoid'))
model.compile(optimizer='sgd',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=[tf.keras.metrics.AUC()]) # use `from_logits = True` here


***
***
## Regression Related Metrics

Regression models are another family of machine learning and statistical models, which are used to predict a continuous target values³. They have a wide range of applications, from house price prediction, E-commerce pricing systems, weather forecasting, stock market prediction, to image super resolution, feature learning via auto-encoders, and image compression.

Models such as linear regression, random forest, XGboost, convolutional neural network, recurrent neural network are some of the most popular regression models.

Metrics used to evaluate these models should be able to work on a set of continuous values (with infinite cardinality), and are therefore slightly different from classification metrics.
***

### MeanSquaredError class

“Mean squared error” is perhaps the most popular metric used for regression problems. It essentially finds the average squared error between the predicted and actual values, and in keras, we can use 

`tf.keras.metrics.MeanSquaredError(name="mean_squared_error", dtype=None)` 

to compute MSE.

Let’s assume we have a regression model which predicts the price of houses in Seattle area (show them with $\hat{y}_i$), and let’s say for each house we also have the actual price the house was sold for (denoted with $y_i$). Then the MSE can be calculated as:

\begin{equation}
MSE = \frac{1}{N}\sum_{i=1}^{N}\left(y_i - \hat{y}_i\right)^2
\end{equation}

Sometimes people use **RMSE** to have a metric with scale as the target values, which is essentially the **square root of MSE**.

***

**Example 1**


In [16]:
m = tf.keras.metrics.MeanSquaredError()
m.update_state([[0, 1], [0, 0]], [[1, 1], [0, 0]])
m.result().numpy()

0.25

**Example 2**

In [17]:
m = tf.keras.metrics.MeanSquaredError()
m.update_state([[0, 1], [0, 0]], [[1, 1], [0, 0]],
               sample_weight=[1, 0])
m.result().numpy()

0.5

**Example 3 (Usage with compile() API):**

In [18]:
model = Sequential()
model.add(Dense(1, activation = 'linear', input_shape = (2,)))
model.compile(
    optimizer='sgd',
    loss='mse',
    metrics=[tf.keras.metrics.MeanSquaredError()])

***
Similarly, the room mean squared error (RMSE) can be calculated, in keras, using 

`tf.keras.metrics.RootMeanSquaredError(name="root_mean_squared_error", dtype=None)`.

***

### MeanAbsoluteError class

Mean absolute error (or mean absolute deviation) is another metric which finds the average absolute distance between the predicted and target values. MAE is define as below:

\begin{equation}
MAE = \frac{1}{N}\sum_{i=1}^{N}\left|y_i - \hat{y}_i\right|
\end{equation}

**MAE is known to be more robust to the outliers than MSE.** The main reason being that in MSE by squaring the errors, the outliers (which usually have higher errors than other samples) get more attention and dominance in the final error and impacting the model parameters.

It is also worth mentioning that there is a nice maximum likelihood (MLE) interpretation behind MSE and MAE metrics. If we assume a linear dependence between features and targets, then MSE and MAE correspond to the MLE on the model parameters by assuming Gaussian and Laplace priors on the model errors respectively.

In Keras, we can use

`tf.keras.metrics.MeanAbsoluteError(name="mean_absolute_error", dtype=None)`

to compute the MAE.
***

#### In Code

**Example 1:**

In [19]:
m = tf.keras.metrics.MeanAbsoluteError()
m.update_state([[0, 1], [0, 0]], [[1, 1], [0, 0]])
m.result().numpy()

0.25

**Example 2:**

In [20]:
m = tf.keras.metrics.MeanAbsoluteError()
m.update_state([[0, 1], [0, 0]], [[1, 1], [0, 0]],
               sample_weight=[1, 0])
m.result().numpy()

0.5

In [21]:
**Example 3 (Usage with compile() API):**

SyntaxError: invalid syntax (<ipython-input-21-eed2f758280a>, line 1)

In [None]:
model = Sequential()
model.add(Dense(1, activation = 'linear', input_shape = (2,)))
model.compile(
    optimizer='sgd',
    loss='mse',
    metrics=[tf.keras.metrics.MeanAbsoluteError()])

***
### [MeanSquaredLogarithmicError](https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/mean-squared-logarithmic-error-(msle)) class

The MSLE is define as 

\begin{equation}
L(y,\hat{y}) = \frac{1}{N}\sum_{i=1}^{N}\left(log(y_i+1)-log(\hat{y}_i+1)\right)^2
\end{equation}

In Keras, we can use
 
`tf.keras.metrics.MeanSquaredLogarithmicError(name="mean_squared_logarithmic_error", dtype=None)`.

**Use MSLE when doing regression, believing that your target, conditioned on the input, is normally distributed, and you don’t want large errors to be significantly more penalized than small ones, in those cases where the range of the target value is large.**


## Accuracy Metrics


### Accuracy class

The accuracy can be computed using

`tf.keras.metrics.Accuracy(name="accuracy", dtype=None)` 

and it is defined as

\begin{equation}
Accuracy = \frac{Total \ Correct\ Predictions}{Total\ Number\ of\ Predictions}
\end{equation}

**Example 1:**

In [None]:
m = tf.keras.metrics.Accuracy()
m.update_state([[1], [2], [3], [4]], [[0], [2], [3], [4]])
m.result().numpy()

**Example 2 (Usage with compile() API:):**

In [None]:
model.compile(optimizer='sgd',
              loss='mse',
              metrics=[tf.keras.metrics.Accuracy()])

### BinaryAccuracy class

The Binary Accuracy can be computedas follows:

`tf.keras.metrics.BinaryAccuracy(name="binary_accuracy", dtype=None, threshold=0.5)`

Calculates how often predictions match binary labels.

This metric creates two local variables, total and count that are used to compute the frequency with which y_pred matches y_true. This frequency is ultimately returned as binary accuracy: an idempotent operation that simply divides total by count.

**Example 1:**

In [None]:
m = tf.keras.metrics.BinaryAccuracy()
m.update_state([[1], [1], [0], [0]], [[0.98], [1], [0], [0.6]])
m.result().numpy()

In [None]:
m = tf.keras.metrics.BinaryAccuracy()
m.update_state([[1], [1], [0], [0]], [[0.98], [1], [0], [0.6]],
               sample_weight=[1, 0, 0, 1])
m.result().numpy()

**Example 3 (Usage with compile() API):**

In [None]:
model.compile(optimizer='sgd',
              loss='mse',
              metrics=[tf.keras.metrics.BinaryAccuracy()])

***
Similary, we can compute

1. **CategoricalAccuracy** using 
`tf.keras.metrics.CategoricalAccuracy(name="categorical_accuracy", dtype=None)`

2. **TopKCategoricalAccuracy** using
`tf.keras.metrics.TopKCategoricalAccuracy(k=5,name="top_k_categorical_accuracy", dtype=None)`

3. **SparseTopKCategoricalAccuracy** using
`tf.keras.metrics.SparseTopKCategoricalAccuracy(k=5, name="sparse_top_k_categorical_accuracy", dtype=None)`


***

## Probabilistic metrics
### BinaryCrossentropy class

BinaryCrossentropy can be computed using

`tf.keras.metrics.BinaryCrossentropy(name="binary_crossentropy", dtype=None, from_logits=False, label_smoothing=0)`

which computes the crossentropy metric between the labels and predictions.

**Example 1:**

In [None]:
m = tf.keras.metrics.BinaryCrossentropy()
m.update_state([[0, 1], [0, 0]], [[0.6, 0.4], [0.4, 0.6]])
m.result().numpy()

**Example 2 (Usage with compile() API):**

In [None]:
model.compile(
    optimizer='sgd',
    loss='mse',
    metrics=[tf.keras.metrics.BinaryCrossentropy()])

*** 
Similary, we can compute

1. **CategoricalCrossentropy** using

`tf.keras.metrics.CategoricalCrossentropy(name="categorical_crossentropy", dtype=None, from_logits=False, label_smoothing=0)`

2. **SparseCategoricalCrossentropy** using

`tf.keras.metrics.SparseCategoricalCrossentropy(name="sparse_categorical_crossentropy", dtype=None, from_logits=False, axis=-1)`

3. **KLDivergence** using

`tf.keras.metrics.KLDivergence(name="kullback_leibler_divergence", dtype=None) `

4. **Poisson** using
`tf.keras.metrics.Poisson(name="poisson", dtype=None)`

***


**There are other metrics as well, you can learn more about them by visiting the references given at the end of this tutorial.**

# References

1. [Metrics](https://keras.io/api/metrics/)
2. [20 Popular Machine Learning Metrics.](https://towardsdatascience.com/20-popular-machine-learning-metrics-part-1-classification-regression-evaluation-metrics-1ca3e282a2ce)
3. [Mean Squared Logarithmic Error](https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/mean-squared-logarithmic-error-(msle))
4. [Accuracy Metrics](https://towardsdatascience.com/keras-accuracy-metrics-8572eb479ec7)