# Regression Metrics

## 1. R-squared 

### Residual Sum of Squares

To understand the concepts clearly, we are going to take up a simple regression problem. Here, we are trying to predict the ‘Marks Obtained’ based on the amount of ‘Time Spent Studying’.
We can plot a simple regression graph to visualize this data.

![image.png](attachment:image.png)

The yellow dots represent the data points and the blue line is our predicted regression line. As you can see, our regression model does not perfectly predict all the data points. So how do we evaluate the predictions from the regression line using the data? Well, we could start by determining the residual values for the data points.

**Residual for a point in the data is the difference between the actual value and the value predicted by our linear regression model.**
$$ Residual = actual - predicted = y - \hat y $$

![image-2.png](attachment:image-2.png)

Residual plots tell us whether the regression model is the right fit for the data or not. It is actually an assumption of the regression model that there is no trend in residual plots. 

Using the residual values, we can determine the sum of squares of the residuals also known as Residual sum of squares or RSS.

$$ RSS = \sum_{i=0}^n (y_i - \hat y_i)^2 $$

The lower the value of RSS, the better is the model predictions. Or we can say that – a regression line is a line of best fit if it minimizes the RSS value. But there is a flaw in this – RSS is a scale variant statistic. Since RSS is the sum of the squared difference between the actual and predicted value, the value depends on the scale of the target variable.

### Total Sum of Squares

Total variation in target variable is the sum of squares of the difference between the actual values and their mean.

$$ TSS = \sum_{i=0}^n (y_i - \bar y)^2 $$

TSS or Total sum of squares gives the total variation in Y. We can see that it is very similar to the variance of Y. While the variance is the average of the squared sums of difference between actual values and data points, TSS is the total of the squared sums.

### R-Squared

Now, if TSS gives us the total variation in Y, and RSS gives us the variation in Y not explained by X, then TSS-RSS gives us the variation in Y that is explained by our model! We can simply divide this value by TSS to get the proportion of variation in Y that is explained by the model. And this our R-squared statistic!

<mark>
R-squared = (TSS-RSS)/TSS <br>
          = Explained variation/ Total variation <br>
          = 1 – Unexplained variation/ Total variation <br>
</mark>

So R-squared gives the degree of variability in the target variable that is explained by the model or the independent variables. If this value is 0.7, then it means that the independent variables explain 70% of the variation in the target variable.

R-squared value always lies between 0 and 1. A higher R-squared value indicates a higher amount of variability being explained by our model and vice-versa.

If we had a really low RSS value, it would mean that the regression line was very close to the actual points. This means the independent variables explain the majority of variation in the target variable. In such a case, we would have a really high R-squared value.

![image-3.png](attachment:image-3.png)

On the contrary, if we had a really high RSS value, it would mean that the regression line was far away from the actual points. Thus, independent variables fail to explain the majority of variation in the target variable. This would give us a really low R-squared value.

![image-4.png](attachment:image-4.png)

So, this explains why the R-squared value gives us the variation in the target variable given by the variation in independent variables.

### Problems with R-squared statistic

The R-squared statistic isn’t perfect. In fact, it suffers from a major flaw. **Its value never decreases no matter the number of variables we add to our regression model. That is, even if we are adding redundant variables to the data, the value of R-squared does not decrease. It either remains the same or increases with the addition of new independent variables.** This clearly does not make sense because some of the independent variables might not be useful in determining the target variable. Adjusted R-squared deals with this issue.

## Adjusted R-squared statistic
The Adjusted R-squared takes into account the number of independent variables used for predicting the target variable. In doing so, we can determine whether adding new variables to the model actually increases the model fit.
Let’s have a look at the formula for adjusted R-squared to better understand its working.

![image-5.png](attachment:image-5.png)

Here,
- n represents the number of data points in our dataset
- k represents the number of independent variables, and
- R represents the R-squared values determined by the model.
So, if R-squared does not increase significantly on the addition of a new independent variable, then the value of Adjusted R-squared will actually decrease.

![image-6.png](attachment:image-6.png)

On the other hand, if on adding the new independent variable we see a significant increase in R-squared value, then the Adjusted R-squared value will also increase.

![image-7.png](attachment:image-7.png)

## 2) Mean Absolute Error(MAE)

MAE is a very simple metric which calculates the absolute difference between actual and predicted values.<br>
To better understand, let’s take an example you have input data and output data and use Linear Regression, which draws a best-fit line.<br>
Now you have to find the MAE of your model which is basically a mistake made by the model known as an error. Now find the difference between the actual value and predicted value that is an absolute error but we have to find the mean absolute of the complete dataset.<br>
so, sum all the errors and divide them by a total number of observations And this is MAE. And we aim to get a minimum MAE because this is a loss.

![image.png](attachment:image.png)

### Advantages of MAE
- The MAE you get is in the same unit as the output variable.
- It is most Robust to outliers.

### Disadvantages of MAE
- The graph of MAE is not differentiable so we have to apply various optimizers like Gradient descent which can be differentiable.

Now to overcome the disadvantage of MAE next metric came as MSE.

In [None]:
from sklearn.metrics import mean_absolute_error
print("MAE",mean_absolute_error(y_test,y_pred))

## 3) Mean Squared Error(MSE)
MSE is a most used and very simple metric with a little bit of change in mean absolute error. Mean squared error states that finding the squared difference between actual and predicted value.<br>
So, above we are finding the absolute difference and here we are finding the squared difference.<br>
What actually the MSE represents? It represents the squared distance between actual and predicted values. we perform squared to avoid the cancellation of negative terms and it is the benefit of MSE.<br>

![image.png](attachment:image.png)

### Advantages of MSE
The graph of MSE is differentiable, so you can easily use it as a loss function.

### Disadvantages of MSE
- The value you get after calculating MSE is a squared unit of output. for example, the output variable is in meter(m) then after calculating MSE the output we get is in meter squared.
- If you have outliers in the dataset then it penalizes the outliers most and the calculated MSE is bigger. So, in short, It is not Robust to outliers which were an advantage in MAE.

In [None]:
from sklearn.metrics import mean_squared_error
print("MSE",mean_squared_error(y_test,y_pred))

## 4) Root Mean Squared Error(RMSE)

As RMSE is clear by the name itself, that it is a simple square root of mean squared error.

![image.png](attachment:image.png)

### Advantages of RMSE
- The output value you get is in the same unit as the required output variable which makes interpretation of loss easy.

### Disadvantages of RMSE
- It is not that robust to outliers as compared to MAE.


Most of the time people use RMSE as an evaluation metric and mostly when you are working with deep learning techniques the most preferred metric is RMSE.

In [None]:
# for performing RMSE we have to use NumPy square root function over MSE.
print("RMSE",np.sqrt(mean_squared_error(y_test,y_pred)))

## 5) Root Mean Squared Log Error(RMSLE)

Taking the log of the RMSE metric slows down the scale of error. The metric is very helpful when you are developing a model without calling the inputs. In that case, the output will vary on a large scale.<br>
To control this situation of RMSE we take the log of calculated RMSE error and resultant we get as RMSLE.<br>

It is a very simple metric that is used by most of the datasets hosted for Machine Learning competitions.

In [None]:
# To perform RMSLE we have to use the NumPy log function over RMSE.
print("RMSE",np.log(np.sqrt(mean_squared_error(y_test,y_pred))))

# Classification Metrics

This is what a confusion matrix looks like:

![image.png](attachment:image.png)
									
From the confusion matrix, we can derive some important metrics that were not discussed in the previous article. Let’s talk about them here.

## Classification Accuracy
Classification Accuracy is what we usually mean, when we use the term accuracy. It is the ratio of number of correct predictions to the total number of input samples.

![image.png](attachment:image.png)

Or Accuracy = (TP + TN)/(TP+TN+FP+FN)

It works well only if there are equal number of samples belonging to each class.<br>
For example, consider that there are 98% samples of class A and 2% samples of class B in our training set. Then our model can easily get 98% training accuracy by simply predicting every training sample belonging to class A.<br>
The real problem arises, when the cost of misclassification of the minor class samples are very high. If we deal with a rare but fatal disease, the cost of failing to diagnose the disease of a sick person is much higher than the cost of sending a healthy person to more tests.<br>

So if you have imbalanced data then you should consider other evaluation metrics like F1 score to evaluate your model.

## Sensitivity / True Positive Rate / Recall

![image.png](attachment:image.png)

Sensitivity tells us what proportion of the positive class got correctly classified.<br>
A simple example would be to determine what proportion of the actual sick people were correctly detected by the model.

## False Negative Rate

![image.png](attachment:image.png)

False Negative Rate (FNR) tells us what proportion of the positive class got incorrectly classified by the classifier.<br>
A higher TPR and a lower FNR is desirable since we want to correctly classify the positive class.

## Specificity / True Negative Rate

![image.png](attachment:image.png)

Specificity tells us what proportion of the negative class got correctly classified.<br>
Taking the same example as in Sensitivity, Specificity would mean determining the proportion of healthy people who were correctly identified by the model.

## False Positive Rate

![image.png](attachment:image.png)

FPR tells us what proportion of the negative class got incorrectly classified by the classifier.<br>
A higher TNR and a lower FPR is desirable since we want to correctly classify the negative class.<br>
Out of these metrics, Sensitivity and Specificity are perhaps the most important and we will see later on how these are used to build an evaluation metric. But before that, let’s understand why the probability of prediction is better than predicting the target class directly.<br>

## AUC-ROC curve

The Receiver Operator Characteristic (ROC) curve is an evaluation metric for binary classification problems. It is a probability curve that plots the TPR against FPR at various threshold values and essentially separates the ‘signal’ from the ‘noise’. The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve.<br>
The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.<br>

![image.png](attachment:image.png)

When AUC = 1, then the classifier is able to perfectly distinguish between all the Positive and the Negative class points correctly. If, however, the AUC had been 0, then the classifier would be predicting all Negatives as Positives, and all Positives as Negatives.

![image-2.png](attachment:image-2.png)

When $0.5<AUC<1$, there is a high chance that the classifier will be able to distinguish the positive class values from the negative class values. This is so because the classifier is able to detect more numbers of True positives and True negatives than False negatives and False positives.

![image-3.png](attachment:image-3.png)

When AUC=0.5, then the classifier is not able to distinguish between Positive and Negative class points. Meaning either the classifier is predicting random class or constant class for all the data points.<br>
So, the higher the AUC value for a classifier, the better its ability to distinguish between positive and negative classes.

## Precision : 

It is the number of correct positive results divided by the number of positive results predicted by the classifier.

![image.png](attachment:image.png)

## Recall

It is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive).

![image.png](attachment:image.png)

## F1 Score

F1 Score is the Harmonic Mean between precision and recall. The range for F1 Score is [0, 1]. It tells you how precise your classifier is (how many instances it classifies correctly), as well as how robust it is (it does not miss a significant number of instances).<br>
High precision but lower recall, gives you an extremely accurate, but it then misses a large number of instances that are difficult to classify. The greater the F1 Score, the better is the performance of our model. <br>

Mathematically, it can be expressed as :

![image.png](attachment:image.png)

F1 Score tries to find the balance between precision and recall. It is used when your data is imbalanced.