In [1]:
import numpy as np
import pandas as pd

NumPy random seed is simply a function that sets the random seed of the NumPy pseudo-random number generator. It provides an essential input that enables NumPy to generate pseudo-random numbers for random processes

random.random() = [a, b),
random.uniform() = [a, b]

In [3]:
np.random.seed(50)
p = np.random.uniform(size=10) # prediction
y = np.random.uniform(size=10) # real value

# to understand outlier
y_outlier = y.copy()
y_outlier[0] = 1000

In [5]:
p # this array is from 0 to 1 (uniform)

array([0.49460165, 0.2280831 , 0.25547392, 0.39632991, 0.3773151 ,
       0.99657423, 0.4081972 , 0.77189399, 0.76053669, 0.31000935])

In [7]:
y # this array is from 0 to 1 (uniform)

array([0.3465412 , 0.35176482, 0.14546686, 0.97266468, 0.90917844,
       0.5599571 , 0.31359075, 0.88820004, 0.67457307, 0.39108745])

In [4]:
y_outlier # in the first element of this array, the first number is really huge and is distant from the uniform distribution
# so, 1000 is an outlier

array([1.00000000e+03, 3.51764817e-01, 1.45466856e-01, 9.72664685e-01,
       9.09178438e-01, 5.59957104e-01, 3.13590746e-01, 8.88200038e-01,
       6.74573066e-01, 3.91087448e-01])

### The metrics of regression can help us to understand how the model perform in normal conditions and in situations where we have outliers

# R(MSE) (Root Mean Squared Error)

MSE doesn't perform the root. The mean squared error (MSE) tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them. **The squaring is necessary to remove any negative signs.** It also **gives more weight to larger differences**. It’s called the mean squared error as you’re finding the **average of a set of errors**. **The lower the MSE, the better the forecast**. 

MSE (Mean Squared Error)

<img src="https://d1zx6djv3kb1v7.cloudfront.net/wp-content/media/2019/11/Differences-between-MSE-and-RMSE-1-i2tutorials.jpg" width="50%">

Mean square error is the calculation of residuals (difference between actual value and the predicted) to the power of two. The squaring is necessary to remove any negative signs. When we use the root, we are transforming the number into a unit that we can understand. (Ex: Dolar to the power of 2 is something, but we understand dolar, so we need to do the root, to transform this dolar² into dolar).

- Root is done to put the result in the original measure
- It's really common because it gives (both, MSE and RMSE) a huge weight to the wrong predictions. (ex: if the residual (actual value and the prediction) of an umbrella price is 2. The penalty applied to the model is the 2² = 4. if the residual is 7, 7² = 49, the higher the residual, the higher the value). 
- It's impacted more when it comes outliers (because outliers have a huge error)
- What minimizes the mean squared error is the mean

<img src="https://miro.medium.com/max/966/1*lqDsPkfXPGen32Uem1PTNg.png" width="50%"/>
<img src="https://miro.medium.com/max/605/1*NupCRCord7W0whzSOttnxQ.png" width="50%"/>


In [8]:
from sklearn.metrics import mean_squared_error

In [14]:
# parameter are: y = true and y_true, y_pred.
mse = mean_squared_error(y, p)
mse

0.08914363950320478

In [15]:
# the outlier is in the y, because the prediction normally won't include a lot of outliers.
mse_outlier = mean_squared_error(y_outlier, p)
mse_outlier

99901.19108542126

In [19]:
# RMSE
print("Without outliers (RMSE): {} \nWith outliers (RMSE): {}".format(np.sqrt(mse), np.sqrt(mse_outlier)))

# one element in the outlier array was changed, and only this number contributed to increase the metric.
# it's important to check if the error is not being affected by the outlier, that's why it's important to remove them

Without outliers (RMSE): 0.29856932110182516 
With outliers (RMSE): 316.071496793718


# R(MSLE) (Root Mean Squared Logarithmic Error)

It is the Root Mean Squared Error of the log-transformed predicted and log-transformed actual values. RMSLE adds 1 to both actual and predicted values before taking the natural logarithm to avoid taking the natural log of possible 0 (zero) values.

- This error is good because it's an approximation of the percentual error. The idea behind this metric is percentual.
- It's continuous, it's easier to minimize. Mathematically it cannot be negative.
- The prediction cannot be negative
- R(MSLE) it's interested in the relative errors.


<img src="https://miro.medium.com/max/1200/0*AUzyQ1rc6mpQVYfn" width='50%'/>

In [20]:
# example to understand this relative error
p_ = 10
y_ = 11
p_ - y_

-1

In [21]:
# the number increased but the result it's still the same
p_ = 110
y_ = 111
p_ - y_

-1

In [25]:
# when i use the log, this log is applied before the difference calculation
p_ = np.log(10)
y_ = np.log(11)

p_ - y_

-0.09531017980432477

In [26]:
# when i use the log, this log is applied before the difference calculation
p_ = np.log(110)
y_ = np.log(111)

p_ - y_

-0.009049835519917337

In [30]:
# Instead of using np.log, it's used np.loglp(), because it does the x + 1

from sklearn.metrics import mean_squared_log_error

msle = mean_squared_log_error(y, p)
msle

0.03304106424381391

In [31]:
# the outlier is in the y, because the prediction normally won't include a lot of outliers.
msle_outlier = mean_squared_log_error(y_outlier, p)
msle_outlier

4.26592112775019

In [32]:
# RMSLE
print("Without outliers (RMSLE): {} \nWith outliers (RMSLE): {}".format(np.sqrt(msle), np.sqrt(msle_outlier)))

Without outliers (RMSLE): 0.1817720117174641 
With outliers (RMSLE): 2.0654106438551607


# MAE (Mean Absolute Error)

- Doesn't care about outlier. just remover the sign, it doesn't penalize the errors.
- When we don't care much about outliers, we can use this metric to check the central information data.
- if MAE is 0, this indicates that the model is perfect

<img src="https://i.imgur.com/BmBC8VW.jpg" width='50%'/>

In [34]:
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y, p)
mae_outlier = mean_absolute_error(y_outlier, p)

print("Without outliers (MAE): {} \nWith outliers (MAE): {}".format(mae, mae_outlier))

# the effects of the outliers in this error is smalles than in the R(MSE)

Without outliers (MAE): 0.23045186806726878 
With outliers (MAE): 100.16618565941592


# MedAE (Median Absolute Error)

- The outlier does not affect this metric

In [35]:
from sklearn.metrics import median_absolute_error

mad = median_absolute_error(y, p)
mad_outlier = median_absolute_error(y_outlier, p)

print("Without outliers (MAE): {} \nWith outliers (MAE): {}".format(mad, mad_outlier))



Without outliers (MAE): 0.11999387786570043 
With outliers (MAE): 0.11999387786570043


# MAPE (Mean Absolute Percentage Error)

- The mean absolute percentage error (MAPE) is a measure of how accurate a forecast system is. It measures this accuracy as a percentage
- The mean absolute percentage error (MAPE) is the most common measure used to forecast error, and works best if there are no extremes to the data (and no zeros).
- MAPE is one of the easiest methods and easy to infer and explain. Suppose MAPE value of a particular model is 5% indicate that the average difference between the predicted value and the original value is 5%.

<img src="https://lindevs.com/wp-content/uploads/2020/11/formula_to_calculate_mape.png"/>

In [36]:
def mape(y, p):
    return np.mean(np.abs((y - p) / y))

mape_ = mape(y, p)
mape_outlier = mape(y_outlier, p)


print("Without outliers (MAE): {} \nWith outliers (MAE): {}".format(mape_, mape_outlier))

Without outliers (MAE): 0.4259730428408323 
With outliers (MAE): 0.4831983774118244


# R2 (R-Squared)

R^2 tries to answer the question. How much of my change in the input (Y) is explaineed by the change in my input (x).
It gives the performance of the model in comparison with the base line (worst model of the regression - computed using the mean) -> this computational is called = TSS	=	total sum of squares. The model is sum of squares of residuals.

The values are between 0 and 1.0, but it can be negative meaning that the model is worst that the mean model.

if it's small the regression is not doing a good job of capturing trend in data
if it's large, it is doing a good job of describing the relationship between x and y

A low r-squared figure is generally a bad sign for predictive models. However, in some cases, a good model may show a small value.

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

<img src="https://miro.medium.com/max/2812/1*_HbrAW-tMRBli6ASD5Bttw.png"/>

A good explanation from <a href="https://blog.minitab.com/en/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit"> This link in minitab </a>

Plotting fitted values by observed values graphically illustrates different R-squared values for regression models.

<img src="https://blog.minitab.com/hubfs/Imported_Blog_Media/fittedxobserved.gif"/> 

The regression model on the left accounts for 38.0% of the variance while the one on the right accounts for 87.4%.  If a model could explain 100% of the variance, the fitted values would always equal the observed values and, therefore, all the data points would fall on the fitted regression line (minitab, 2013). 

Limitations: 

- R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.

- R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!

- This is a biased estimator

Cases when a low r^2 is ok:

- Any field that attempts to predict human behavior, such as psychology, typically has R-squared values lower than 50%. Humans are simply harder to predict than, say, physical processes (minitab, 2013).

Henseler (2009) proposed a rule of thumb for acceptable R2 with 0.75, 0.50, and 0.25 are described as substantial, moderate and weak respectively. ( Henseler, J., Ringle, C., and Sinkovics, R. (2009). "The use of partial least squares path modeling in international marketing." Advances in International Marketing (AIM), 20, 277-320)

In [38]:
from sklearn.metrics import r2_score
r2 = r2_score(y, p)
r2_outlier = r2_score(y_outlier, p)

print("Without outliers (MAE): {} \nWith outliers (MAE): {}".format(r2, r2_outlier))

# Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). 

Without outliers (MAE): -0.16855661234825958 
With outliers (MAE): -0.11129774756083188


# Adjusted R-Square