# Expedition to Data Science and Machine Learning
## Module 4: Machine Learning with Python
### Lecture 3: Supervised Learning: Linear Regression and Regression accuracy metrics

Instructor: Md Shahidullah Kawsar
<br>Data Scientist, IDARE, Houston, TX, USA

#### Objectives:
- Supervised Learning: Linear Regression
- Accuracy metric in Regression problem
- Mean Absolute Error (MAE)
- Mean Absolute Percentage Error (MAPE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R-squared or coefficient of determination
- Prediction result evaluation

#### References:
[1] Accuracy metrics in sklearn: https://scikit-learn.org/stable/modules/model_evaluation.html
<br>[2] Mean Absolute Error (MAE): https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error
<br>[3] Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error
<br>[4] R-squared or coefficient of determination: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score
<br>[5] MAE, MSE, RMSE, Coefficient of Determination, Adjusted R Squared — Which Metric is Better? https://medium.com/analytics-vidhya/mae-mse-rmse-coefficient-of-determination-adjusted-r-squared-which-metric-is-better-cd0326a5697e

#### Import required Libraries

In [10]:
import pandas as pd
import numpy as np

#### Accuracy metrics in Supervised Learning: Regression

In [11]:
actual_value = [1,2,3,4,5,6,7,8,9,10]
predicted_value = [1,3,4,5,6,5,6,5,8,9]

**Mean absolute error** represents the average of the absolute difference between the actual and predicted values in the dataset. It measures the average of the residuals in the dataset.

**Mean Squared Error** represents the average of the squared difference between the original and predicted values in the data set. It measures the variance of the residuals.

**Root Mean Squared Error** is the square root of Mean Squared error. It measures the standard deviation of residuals.

**Coefficient of determination or R-squared** represents the proportion of the variance in the dependent variable. It is a scale-free score i.e. irrespective of the values being small or large, the value of R square will be less than one.

In [24]:
df = pd.DataFrame({"actual":actual_value,
                   "predicted": predicted_value})

df["dif"] = df["actual"] - df["predicted"]
df["abs_error"] = np.abs(df["dif"])
df["squared_error"] = df["dif"]**2

df["actual_subtract_mean"] = df["actual"] - df["actual"].mean()
df["squared_actual_subtract_mean"] = df["actual_subtract_mean"]**2

display(df)

Unnamed: 0,actual,predicted,dif,abs_error,squared_error,actual_subtract_mean,squared_actual_subtract_mean
0,1,1,0,0,0,-4.5,20.25
1,2,3,-1,1,1,-3.5,12.25
2,3,4,-1,1,1,-2.5,6.25
3,4,5,-1,1,1,-1.5,2.25
4,5,6,-1,1,1,-0.5,0.25
5,6,5,1,1,1,0.5,0.25
6,7,6,1,1,1,1.5,2.25
7,8,5,3,3,9,2.5,6.25
8,9,8,1,1,1,3.5,12.25
9,10,9,1,1,1,4.5,20.25


In [23]:
df["actual"].mean()

5.5

In [27]:
# mean absolute error: lower is better
MAE = df["abs_error"].mean()
print("MAE = ", MAE)

# MAPE: Mean Absolute Percentage Error: lower is better
MAPE = np.round(np.mean(df["abs_error"]/df["actual"])*100, 2)
print("MAPE = ", MAPE)

# mean squared error: lower is better
MSE = df["squared_error"].mean()
print("MSE = ", MSE)

# root mean squared error: lower is better
RMSE = np.round(np.sqrt(MSE), 2)
print("RMSE = ", RMSE)

# coefficient of determination == r_squared: greater is better. Max =1, min=-
r_squared = np.round(1- df["squared_error"].sum()/df["squared_actual_subtract_mean"].sum(), 2)
print("r_squared = ", r_squared)

MAE =  1.1
MAPE =  21.79
MSE =  1.7
RMSE =  1.3
r_squared =  0.79
