# Model Evaluation

One reason when building machine learning models is that we assume to get a feedback from metrics and make improvements and continue building it until we achieve a desirable accuracy. Evaluation metrics Explain the performance of a model. The capacity to differentiate between model outputs is crucial in the assessment metrics. Model Evaluation is important when it comes to selecting a model which gives high accuracy on out of sample data. It is crucial to check the accuracy of the model prior to computing predicted values.

## Types of Predictive Models

When it comes to predictive models, it is either a regression model or classification model. The evaluation metrics used in each of these models are different.

In classification problems, there are two types of algorithms that is being used.

1. Class Output: SVM and KNN Algorithms create a class output. For instance, classification problem in binary, the outputs will either be 0 or 1. However, the algorithms can now convert class outputs to probability. But these algorithms are not well accepted by the statistics community.

2. Probaility output: Logistic Regression, Random Forest, Gradient Boosting, Adaboost etc. Algorithms give probability outputs. It is only a matter of creating threshold probability to convert probability output into class output. 

In regression problems, there are no incocnsistencies in the output. The output is always continuous in nature and requires no further treatment.

## Confusion Matrix

A confusion matrix ia an N x N matrix, where N is the number of classes being predicted. Take for an example N = 2 hence it will be a 2 x 2 matrix. There are a few definitions which is needed to be remembered for confusion matrix which are:

1. Accuracy: The proportion of the total number of predictions that were correct.
2. Positive Predictive Value or Precision: The proportion of positive cases that were correctly identified.
3. Negative Predictive Value: The proportion of actual negative cases which are correctly identified.
4. Sensitivity or Recall: The proportion of actual positive cases which are correctly identified.
5. Specificity: The proportion of actual negative cases which are correctly identified.

![confusion-matrix.png](attachment:confusion-matrix.png)

Reference: https://glassboxmedicine.com/2019/02/17/measuring-performance-the-confusion-matrix/

### Calculating Sensitivity and Specificity

For each conceivable category, these metrics must be computed if more than two potential prediction possibilities exist. The sensitivity estimates are as follows:

![Sensitivity%20and%20specificity.png](attachment:Sensitivity%20and%20specificity.png)

Specificity determines the capacity of a model to forecast whether an observation does not fall inside a particular category. It needs to be aware of the performance of the model, whether the observation belongs genuinely to each other category than that considered.

![sensitivity%20and%20specificity%20for%203%20rows.png](attachment:sensitivity%20and%20specificity%20for%203%20rows.png)

Reference: https://towardsdatascience.com/evaluating-categorical-models-ii-sensitivity-and-specificity-e181e573cff8

### Accuracy

The most frequent metric for evaluating a model is not truly a clear performance indication. Worse cases is when classes are imbalanced.

![Accuracy.png](attachment:Accuracy.png)

### Precision

Confusion Mtrix can also be use to calculate the Precision. The Precision, along with the true positive rate (also known as "Recall") is needed to calculate the area under the precision-recall curve (AUPRC), another performance metric

![precision.png](attachment:precision.png)

Reference: https://glassboxmedicine.com/2019/02/17/measuring-performance-the-confusion-matrix/

### Recall

This is the proportion of observation predicted to belong to the positive class, that truly belongs to the positive class. It indirectly tells us the model's ability to randomly identify an observation that belongs to the positive class. The formula for recall is as follows:

![Recall.png](attachment:Recall.png)

## F1 Score

In a case where best precision and recall is needed, F1-Score is the harmonic mean of precision and recall values for a classification problem. The formula for F1-Score is as Follow:

![F1-Score.png](attachment:F1-Score.png)

Why Take a harmonic mean and not an arithmetic mean?. The reason is because Harmonic Mean punishes extreme values. Take for example Precision = 0 and Recall = 1 and it is a binary classification model. If arithmetic mean was choosen, then the result will be 0.5. It is clear that the above results will just ignore the input and just predicts one of the classes as output. If Harmonic mean was choosen then the result given will be 0 which is accurate.

## Evaluating Multiclass classifier predictions

With the Accuracy evaluation metric removed from the options, Precision, recall and F1 scores are specifically choosen. Parameters options in python, which are used for aggregating the evaluation values by averaging them. The three main options that we have available to use are:

1. _macro: Here we specify to the compiler to calculate the mean of metric scores for each class in the dataset, weighting each class equally.
2. _weighted: We calculate the mean of metric scores for each class, and weigh each class directly proportional to its size in the dataset.
3. _micro: Here we calculate the mean of metric scores for each OBSERVATION in the dataset.

## Visualizing a Classifier's Performance

Confusion Matrix is the most popular way to visualize a classifier's performance. Error Matrix is known as Confusion Matrix. A confusion Matrix has a high level of interpretability. It includes a basic, commonly produced tabular format which is viewed as a heat map. The anticipated classes are shown in each column of a confusion matrix and every row shows the true (or actual) classes.

The confusion matrix has three key facts to know:
1. A Perfect Confusion matrix has values along the main diagonal (from left to right), and in the confusion matrix there are zeroes (0).
2. A confusion matrix not only shows us where the learning model has failed, but also how these conclusions have been obtained.
3. A Confusion Matrix will work for all sorts of different classes. It just implies that your visualised matrix is extremely huge since the dataset has 50 classes. It doesn't influence model performance and the confusion matrix.

![cm.png](attachment:cm.png)

reference: https://www.analyticsvidhya.com/blog/2021/05/machine-learning-model-evaluation/

## Evaluating a Regression Model's Performance.

A well-known evaluation Metrics is MSE which stands for Mean Squared Error. This regressor is one of the most used in machine learning model evaluation. MSE formula is stated as follow:

![Mean-Bias-Error-1-i2tutorials-2.jpg](attachment:Mean-Bias-Error-1-i2tutorials-2.jpg)

Where:
1. n represents the number of observations in the dataset.
2. yi is the true value of the target value we are trying to predict for observation.
2. ŷi is the model's predicted value for yi

Mean squared error is a calculation that involves finding the squared sum of all the distances between predicted and true values. The higher the output value for Mean squared error, the greater the sum of squared error present in the model, and hence, the worse the quality of model predictions. There are advantages of squaring the error margins, as seen in the model:

1. Firstly, squaring the error constrains all error values to be positive.
2. Secondly, this means the model will penalize few large error values, more than it will penalize many small error values.

![residual.png](attachment:residual.png)

residence: https://www.analyticsvidhya.com/blog/2021/05/machine-learning-model-evaluation/

Coding Example on MSE:

In [None]:
from sklearn.metrics import mean_squared_error

actual_values = [3, -0.5, 2, 7]
predicted_values = [2.5, 0.0, 2, 8]

mean_squared_error(actual_values, predicted_values)

## R Squared (coefficient of determination)

When RMSE decreases, the model's performance will improve. But these values alone are not intuitive. In the case of a classificiation problem, if the model has an accuracy of 0.8, we could gauge how good the model against a random model, which has an accuracy of 0.5, So the random model can be treated as benchmark. But when talking about the RMSE metrics, there is no benchmark to compare.

The formula for R-Squared is:

![R-Squared.png](attachment:R-Squared.png)

![MSE%28model%29.png](attachment:MSE%28model%29.png)

1. MSE(model): Mean Squared Error of the predictions against the actaull values
2. MSE(baseline): Mean Squared Error of mean prediction against the actual values

R-Squared has an intuitive scale and doesn't depend on y units

R-Squared gives you no information about prediction error.

Coding Example on R-Squared:

In [None]:
import numpy as np

X = np.random.randn(100)
y = np.random.randn(60) # y has nothing to do with X whatsoever

from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_score

scores = cross_val_score(LinearRegression(), X, y,scoring='r2')

## Mean Absolute Error (MAE)

The absolute difference between the actual or true values and the values that are predicted. Absolute difference means that if the result has a negative sign, it is ignored. MAE takes the average error from every sample in a dataset and gives the output.

![Mean%20Absolute%20Error.jpg](attachment:Mean%20Absolute%20Error.jpg)

This can be implemented using sklearn's mean_absolute_error method:

Coding Example on MAE:

In [None]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = mycity_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

## Root Mean Squared Error (RMSE)

RMSE is popular evaluation metric used in regression problems. An assumption that error are unbiased and follow a normal distribution. Here are some keypoints to consider on RMSE:

1. This metric is enabled by 'Square Root' power to display huge variations.
2. The "squared" aspect of this metric helps to produce robust results that avoid the negative or positive error values being cancelled. This metric shows, in other words, the plausible extent of the error term.
3. It prevents usage of absolute error values that in math calculations are particularly undesirable.
4. If we have more samples, it is deemed more reliable to reconstruct the error distribution using RMSE.
5. The outliers of RMSE are strongly influenced. Before utilizing this metric, make sure that you eliminate outliers from your data collection.
6. RMSE gives greater weighting and punishes major errors compared to an absolute error.

![rmse.png](attachment:rmse.png)

N is Total Number of Observations

reference: https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evaluation-error-metrics/

Coding Example on RMSE:

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

actual_values = [3, -0.5, 2, 7]
predicted_values = [2.5, 0.0, 2, 8]

mean_squared_error(actual_values, predicted_values)
# taking root of mean squared error
root_mean_squared_error = sqrt(mean_squared_error)

## Which Metric to determine the Performance of a Machine Learning Model?

MAE: In comparison to MSE, it is not sensitive to outliers, as it does not punish enormous errors. It is generally used for measuring performance on continuous variable data. It generates a linear number that equally averages weighted differences. The smaller the value, the better the performance of the model.

MSE: One of the metrics most frequently employed, but less effective if a single erroneous prediction destroys all the predictive capabilities of the model, i.e. when the dataset contains plenty of noise. It is extremely important when the dataset has outliers or unexpected values (too high or too low values).

RMSE: In RMSE, mistake is quadrated before it's medium. This essentially means that RMSE gives greater weight to greater errors. This shows that RMSE is significantly more effective when big mistakes occur and that they drastically influence the performance of the model. It prevents the absolute error value, and in many mathematical calculation this feature is helpful. The lower the value, the better the performance of the model is in this metric.