# **üéØ Evaluation Metrics for Regression and Classification**

BY: **Muhammad Hassan Saboor**

# **üëã Introduction**

Evaluation metrics are essential tools for assessing the performance of machine learning models. They help us understand how well a model predicts outcomes, measures errors, and identifies strengths and weaknesses. Different types of metrics are suited for different tasks, such as `classification` and `regression`. Choosing the right metric ensures that the model aligns with the problem's requirements and objectives.

This notebook provides a comprehensive guide to the most commonly used evaluation metrics in machine learning. For each metric, you‚Äôll find:

- A simple, one-line definition.
- The mathematical formula.
- A practical example for better understanding.
  
By the end of this notebook, you'll have a clear understanding of how to evaluate models effectively, interpret results, and select the appropriate metric for your specific machine learning task.

> ## First I will tell you what are **True Positive**, **True Negative**, **False Positive**, and **False Negative**

> These are the building blocks of evaluation metrics for binary classification. They arise from comparing predicted labels with actual (ground truth) labels

---

**1. True Positive (TP):**

**Definition** Those cases where the model correctly predicts the positive class.

**Example** The model predicts "Disease" (positive), and the person actually has the disease.

**2. True Negative (TN):**

**Definition** Those cases where the model correctly predicts the negative class.

**Example** The model predicts "No Disease" (negative), and the person actually does not have the disease.

**3. False Positive (FP):**

**Definition** Those cases where the model predicts the positive class incorrectly.

**Example** The model predicts "Disease" (positive), but the person does not have the disease (false alarm).

**4. False Negative (FN):**

**Definition** Those cases where the model predicts the negative class incorrectly.

**Example** The model predicts "No Disease" (negative), but the person actually has the disease (missed detection).

### **Confusion Matrix**

| Actual \ Predicted | Positive (1) | Negative (0) |
|--------------------|--------------|--------------|
| **Positive (1)**   | True Positive (TP) | False Negative (FN) |
| **Negative (0)**   | False Positive (FP) | True Negative (TN) |

---

# **Classification Matrix**
---

## 1. **Accuracy**

**Definition** 

Accuracy is the percentage of correct predictions made by the model.

**Formula** 

$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$

**Example**

If a model predicts 80 out of 100 test samples correctly (e.g., 50 True Positives + 30 True Negatives), then:

$Accuracy = \frac{50 + 30}{100}$ = 0.8(80%)

## 2. **Precision**

**Definition** 

Precision is the percentage of correctly predicted positive cases out of all cases predicted as positive.

**Formula** 

$Precision = \frac{TP}{TP + FP}$

**Example**

If a model predicts 30 positive cases, out of which 25 are correct (True Positives) and 5 are incorrect (False Positives), then:

$Precision = \frac{25}{25 + 5}$ = 0.833(83.3%)

## 3. **Recall**

**Definition** 

Recall is the percentage of correctly predicted positive cases out of all actual positive cases.

**Formula** 

$Recall = \frac{TP}{TP + FN}$

**Example**

If there are 40 actual positive cases in the data, and the model correctly identifies 30 of them (True Positives) but misses 10 (False Negatives), then:

$Recall = \frac{30}{30 + 10}$ = 0.75(75%)

## 4. **F1-score**

**Definition** 

F1-Score is the harmonic mean of Precision and Recall, balancing their trade-off.

**Formula** 

$F1score = 2. \frac{Precision * Recall}{Precision + Recall}$

**Example**

If a model has a Precision of 80% (0.8) and Recall of 70% (0.7), then:

$F1score = 2. \frac{0.8 * 0.7}{0.8 + 0.7}$ = 0.7467(74.67%)

## 5. **ROC (Receiver Operating Characteristic)**

**Definition** 

ROC is a curve that shows the trade-off between True Positive Rate (Recall) and False Positive Rate at different classification thresholds.

**Formula (True Positive Rate (TPR) or Recall)** 

$TPR = \frac{TP}{TP + FN}$

**Formula (False Positive Rate (FPR))** 

$FPR = \frac{FP}{FP + TN}$

**Example**

For different thresholds, we calculate TPR and FPR and plot them on the ROC curve. A model with a better performance has an ROC curve closer to the top-left corner (high TPR, low FPR).

## 6. **AUC (Area Under the Curve)**

**Definition** 

AUC is the area under the ROC curve. It represents the likelihood that the model will correctly distinguish between a randomly chosen positive and negative instance.

**Formula**

- The AUC value is the area under the ROC curve, typically computed using numerical integration methods.
- **AUC ranges from 0 to 1, where:**
  
    - 0.5 indicates no discriminative power (random guessing).
    - 1 indicates perfect classification.


**Example**

If a model's ROC curve shows an AUC of 0.85, it means that there is an 85% chance that the model will correctly classify a randomly chosen positive instance as more likely to be positive than a randomly chosen negative instance.

## 7. **Log Loss (Logarithmic Loss or Cross-Entropy Loss)**

**Definition** 

Log Loss measures the accuracy of a classifier by comparing the predicted probabilities to the actual class labels. It penalizes wrong predictions, especially when the model is confident but incorrect.

**Formula** 

$$
\text{Log Loss} = - \frac{1}{N} \sum_{i=1}^{N} \left( y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right)
$$


**Where**
  
  - N is the number of instances
  - yi is the true label (0 or 1).
  - pi is the predicted probability of the positive class.

**Example**

For different thresholds, we calculate TPR and FPR and plot them on the ROC curve. A model with a better performance has an ROC curve closer to the top-left corner (high TPR, low FPR).

**Definition** 

Log Loss measures the accuracy of a classifier by comparing the predicted probabilities to the actual class labels. It penalizes wrong predictions, especially when the model is confident but incorrect.

**Formula** 

$$
\text{Log Loss} = - \frac{1}{N} \sum_{i=1}^{N} \left( y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right)
$$


**Where**
  
  - $N$ is the number of data points.
  - $y_i$ is the true label (0 or 1).
  - $pi$ is the predicted probability of the positive class.

**Example**

- If the model predicts a probability of 0.9 for a positive class and the actual label is 1, the Log Loss will be:
  
$$
\text{Log Loss} = -\log(0.9) \approx 0.105
$$


- If the model predicts 0.1 for the same positive class (but the true label is 1), the Log Loss would be much higher:


$$
\text{Log Loss} = -\log(0.1) \approx 2.302
$$

 

## 8. **Specificity (True Negative Rate)**

**Definition** 

Specificity is the percentage of correctly predicted negative cases out of all actual negative cases. It measures the model's ability to identify negative class instances.

**Formula** 

$Specificity = \frac{TN}{TN + FP}$

**Example**

If there are 50 actual negative cases, and the model correctly identifies 40 of them as negative (True Negatives) but misclassifies 10 as positive (False Positives), then:

$Specificity = \frac{40}{40 + 10}$ = 0.8(80%)

## 9. **MCC (Matthews Correlation Coefficient)**

**Definition** 

MCC is a metric that measures the quality of binary classifications by considering all four confusion matrix elements (TP, TN, FP, FN). It provides a balanced measure even in imbalanced datasets.

**Formula** 

$$
MCC = \frac{TP * TN - FP * FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}
$$

**Example**

- If the confusion matrix values are:
  - TP = 50, TN = 30, FP = 10, FN = 10
- The MCC would be:
  
$$
MCC = \frac{50 \times 30 - 10 \times 10}{\sqrt{(50 + 10)(50 + 10)(30 + 10)(30 + 10)}} = \frac{1400}{2400} \approx 0.5833
$$



## 10. **Confusion Matrix**

**Definition**

A confusion matrix is a table used to evaluate the performance of a classification model by comparing its predicted labels with the actual labels. It shows how many instances were correctly or incorrectly classified into each category.

**Formula**

A confusion matrix for binary classification contains four key elements:

| Actual \ Predicted | Positive (1) | Negative (0) |
|--------------------|--------------|--------------|
| Positive (1)       | True Positive (TP)  | False Negative (FN) |
| Negative (0)       | False Positive (FP) | True Negative (TN) |

- **True Positives (TP):** Correctly predicted positive cases.
- **True Negatives (TN):** Correctly predicted negative cases.
- **False Positives (FP):** Incorrectly predicted positive cases (Type I error).
- **False Negatives (FN):** Incorrectly predicted negative cases (Type II error).

**Example**
  > For a binary classifier:
  - TP = 50, TN = 30, FP = 10, FN = 10
  The confusion matrix would look like this:

| Actual \ Predicted | Positive (1) | Negative (0) |
|--------------------|--------------|--------------|
| Positive (1)       | 50           | 10           |
| Negative (0)       | 10           | 30           |


# **Regression Matrix**

## 1. **MAE (Mean Absolute Error)**

**Definition**

MAE is the average of the absolute differences between predicted values and actual values. It gives an idea of how far off the predictions are, on average.

**Formula**

$$
MAE = \frac{1}{N} \sum_{i=1}^{N} \left| y_i - \hat{y}_i \right|
$$

**Where**

- $N$ is the number of data points.
- $y_i$ is the actual value.
- $\hat{y}_i$ is the predicted value.

**Example**

If you have a set of predictions [3,5,7] and actual values [4,5,8], then:

$$
MAE = \frac{1}{3} \left( |3 - 4| + |5 - 5| + |7 - 8| \right) = \frac{1}{3} \left( 1 + 0 + 1 \right) = 0.67
$$




## 2. **MSE (Mean Squared Error)**

**Definition**

MSE is the average of the squared differences between predicted values and actual values. It gives a higher penalty to larger errors due to the squaring of differences.

**Formula**

$$
MSE = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2
$$

**Where**

- $N$ is the number of data points.
- $y_i$ is the actual value.
- $\hat{y}_i$ is the predicted value.
  
**Example**

If you have a set of predictions [3,5,7] and actual values [4,5,8], then:

$$
MSE = \frac{1}{3} \left( (3 - 4)^2 + (5 - 5)^2 + (7 - 8)^2 \right) = \frac{1}{3} \left( 1 + 0 + 1 \right) = 0.67
$$


## 3. **RMSE (Root Mean Squared Error)**

**Definition**

RMSE is the square root of the average of the squared differences between predicted values and actual values. It provides a measure of error in the same units as the data, making it easier to interpret than MSE.

**Formula**

$$
RMSE = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2}
$$

**Where**

- $N$ is the number of data points.
- $y_i$ is the actual value.
- $\hat{y}_i$ is the predicted value.

**Example**

If you have a set of predictions [3,5,7] and actual values [4,5,8], then:

$$
RMSE = \sqrt{\frac{1}{3} \left( (3 - 4)^2 + (5 - 5)^2 + (7 - 8)^2 \right)} = \sqrt{\frac{1}{3} \left( 1 + 0 + 1 \right)} = \sqrt{\frac{2}{3}} \approx 0.82
$$


## 4. **MAPE (Mean Absolute Percentage Error)**

**Definition**

MAPE is the average of the absolute percentage differences between predicted values and actual values. It expresses the error as a percentage, making it easy to understand in relative terms.

**Formula**

$$
MAPE = \frac{1}{N} \sum_{i=1}^{N} \frac{|y_i - \hat{y}_i|}{y_i} \times 100
$$

**Where**

- $N$ is the number of data points.
- $y_i$ is the actual value.
- $\hat{y}_i$ is the predicted value.

**Example**

If you have a set of predictions [3,5,7] and actual values [4,5,8], then:

$$
MAPE = \frac{1}{3} \left( \frac{|3 - 4|}{4} + \frac{|5 - 5|}{5} + \frac{|7 - 8|}{8} \right) \times 100 = \frac{1}{3} \left( 0.25 + 0 + 0.125 \right) \times 100 = 12.5\%
$$


## 5. **R-squared (Coefficient of Determination)**

**Definition**

R-squared measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It indicates how well the model's predictions match the actual data.

**Formula**

$$
R^2 = 1 - \frac{\sum_{i=1}^{N} (y_i - \bar{y})^2}{\sum_{i=1}^{N} (y_i - \hat{y}_i)^2}
$$

**Where**

- $N$ is the number of data points.
- $y_i$ is the actual value.
- $\hat{y}_i$ is the predicted value.
- $\bar{y}_i$ is the predicted value.

**Example**

If the total variance of actual values is 100, and the residual sum of squares is 25, then:

$$
R^2 = 1 - \frac{100}{25} = 0.75
$$

This means 75% of the variance in the data is explained by the model.

## 6. **Adjusted R-squared**

**Definition**

Adjusted R-squared is a modified version of R-squared that adjusts for the number of predictors in the model. It penalizes the addition of irrelevant predictors and is more useful when comparing models with different numbers of features.

**Formula**

$$
\text{Adjusted } R^2 = 1 - \left( \frac{(1 - R^2) \times (N - 1)}{(N - p -1)} \right) 
$$

**Where**

- $R^2$ is the R-squared value.
- $N$ is the number of data points.
- $p$ is the number of predictors (features) in the model.

**Example**

If $R^2$ = 0.80, $N$ = 100, and $p$ = 5, then:

$$
\text{Adjusted } R^2 = 1 - \left( \frac{(1 - 0.80) \times (100 - 1)}{(100 - 5 - 1)} \right)  = 1 - \left( \frac{(0.20 \times 99)}{94} \right) \approx 0.781
$$



## 7. **Adjusted R-squared**

**Definition**

Huber Loss is a combination of Mean Squared Error (MSE) and Mean Absolute Error (MAE). It is less sensitive to outliers than MSE and provides a smoother penalty for large errors compared to MAE. It is commonly used in regression tasks where the data contains outliers.

**Formula**

$$
L_{\delta}(y, \hat{y}) =
\begin{cases}
\frac{1}{2} (y - \hat{y})^2 & \text{for } |y - \hat{y}| \leq \delta \\
\delta |y - \hat{y}| - \frac{1}{2} \delta^2 & \text{for } |y - \hat{y}| > \delta
\end{cases}
$$

**Where**

- $y$ is the actual value.
- $\hat{y}_i$ is the predicted value.
- $Œ¥$ is a hyperparameter that determines the threshold for switching between squared loss and absolute loss.

**Example**

For \( Œ¥ = 1 \), actual value \( y = 5 \), and predicted value \( $\hat{y}$ = 7  \):

Since \( |5 - 7| = 2 > 1 \), the loss would be calculated as:

$$
L_{1}(5, 7) = 1 \times |5 - 7| - \frac{1}{2} \times 1^2 = 2 - 0.5 = 1.5
$$




## 8. **MBD (Mean Bias Deviation)**

**Definition**

MBD is a metric that measures the average difference between the predicted values and actual values. It indicates whether the model tends to overestimate or underestimate the predictions. Positive MBD values suggest overestimation, while negative values suggest underestimation.

**Formula**

$$
\text{MBD} = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)
$$

**Where**

- $N$ is the number of data points.
- $\hat{y}_i$ is the predicted value.
- ${y}_i$ is the actual value.

**Example**

If you have a set of predictions [3,5,7] and actual values [4,5,8], then:

$$
\text{MBD} = \frac{1}{3} \left( (3 - 4) + (5 - 5) + (7 - 8) \right) = \frac{1}{3} (-1 + 0 - 1) = -0.67
$$


# **Thank You!**

Thank you for exploring this notebook on evaluation metrics for machine learning! üåü Your interest in understanding these concepts is a step forward in building better, more reliable models. I hope this guide has provided clarity and helped you grasp the importance and usage of various metrics for both `regression` and `classification` tasks.

Feel free to reach out with feedback, suggestions, or any questions. 

üí¨ Keep experimenting, keep learning, and keep building amazing projects! üöÄ

Happy coding! üñ•Ô∏è‚ú®