### Evaluating a Regression Model 

Model evaluation metrics documentation `Check documentation on website (3.3.4 regression metrics)`

The ones we cover :

1. **R^2 (R²: R-squared) or coefficient of determination**
2. **Mean absolute error (MAE)**
3. **Mean squared error (MSE)**


#### 1. R^2 (R²: R-squared) or coefficient of determination

- `What R-squared does:` Compares your models prediction to the mean ogf the targets. Values can range from negative infinity (a very poor model) to 1. 
For example, if all your model does is predict the mean of the targets, its R^2 value would be 0. And if your model perfectly predicts a range of numbers it's R^2 score value would be 1. 

In [1]:
# Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

In [3]:
# Creating housing dataset
housing = fetch_california_housing()
housing_df = pd.DataFrame(housing["data"], columns = housing["feature_names"])
housing_df["target"] = housing["target"] # Now we have complete dataset


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [4]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

X = housing_df.drop("target", axis=1)
y = housing_df["target"]

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

model = RandomForestRegressor()
model.fit(X_train, y_train)




In [5]:
model.score(X_test,y_test)

0.8066196804802649

In [6]:
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [7]:
y_test.mean()

2.0550030959302323

In [8]:
from sklearn.metrics import r2_score

# Fill an array with y_test mean
y_test_mean = np.full(len(y_test), y_test.mean())
y_test_mean[:10]

array([2.0550031, 2.0550031, 2.0550031, 2.0550031, 2.0550031, 2.0550031,
       2.0550031, 2.0550031, 2.0550031, 2.0550031])

In [9]:
r2_score(y_true=y_test,
         y_pred=y_test_mean)

0.0

In [10]:
r2_score(y_true=y_test,
         y_pred=y_test) # If model predicted perfect, it would get 1.0 score

1.0

#### 2. Mean absolute error (MAE)

- `MAE` is the average of the absolute differences between predictions and actual values
- It gives you an idea of how wrong your model's predictions are

In [11]:
# MAE 
from sklearn.metrics import mean_absolute_error

y_preds = model.predict(X_test)
mae = mean_absolute_error(y_test,y_preds)
mae

0.3265721842781009

In [16]:
# Creating a dataframe to visualize MAE
df = pd.DataFrame(data = {"actual values":y_test,
                           "predicted values":y_preds})
df["differences"] = df["predicted values"] - df ["actual values"]
df["absolute differences"] = df["differences"].abs()

df.head(10)

Unnamed: 0,actual values,predicted values,differences,absolute differences
20046,0.477,0.49384,0.01684,0.01684
3024,0.458,0.75494,0.29694,0.29694
15663,5.00001,4.928596,-0.071414,0.071414
20484,2.186,2.54029,0.35429,0.35429
9814,2.78,2.33176,-0.44824,0.44824
13311,1.587,1.65497,0.06797,0.06797
7113,1.982,2.34323,0.36123,0.36123
7668,1.575,1.66182,0.08682,0.08682
18246,3.4,2.47489,-0.92511,0.92511
5723,4.466,4.834478,0.368478,0.368478


In [17]:
df["absolute differences"].mean() # Now e got the MAE through manual calculation

0.3265721842781009

In [12]:
y_preds

array([0.49384  , 0.75494  , 4.9285964, ..., 4.8363785, 0.71782  ,
       1.67901  ])

In [13]:
y_test

20046    0.47700
3024     0.45800
15663    5.00001
20484    2.18600
9814     2.78000
          ...   
15362    2.63300
16623    2.66800
18086    5.00001
2144     0.72300
3665     1.51500
Name: target, Length: 4128, dtype: float64

#### 2. Mean squared error (MSE)

`MSE` is the mean of the square of the errors between actual and predicted values.  

In [18]:
# Mean Squared Error
from sklearn.metrics import mean_squared_error

y_preds = model.predict(X_test)

mse = mean_squared_error(y_test, y_preds)
mse

0.2534073069137548

In [19]:
df["squared differences"] = np.square(df["differences"])
df.head()

Unnamed: 0,actual values,predicted values,differences,absolute differences,squared differences
20046,0.477,0.49384,0.01684,0.01684,0.000284
3024,0.458,0.75494,0.29694,0.29694,0.088173
15663,5.00001,4.928596,-0.071414,0.071414,0.0051
20484,2.186,2.54029,0.35429,0.35429,0.125521
9814,2.78,2.33176,-0.44824,0.44824,0.200919


In [29]:
# Calculating MSE manually
squared = np.square(df["differences"])
squared.mean()

0.2534073069137548

In [30]:
df_large_error = df.copy()
df_large_error.iloc[0]["squared differences"] = 16

df_large_error.head()


Unnamed: 0,actual values,predicted values,differences,absolute differences,squared differences
20046,0.477,0.49384,0.01684,0.01684,16.0
3024,0.458,0.75494,0.29694,0.29694,0.088173
15663,5.00001,4.928596,-0.071414,0.071414,0.0051
20484,2.186,2.54029,0.35429,0.35429,0.125521
9814,2.78,2.33176,-0.44824,0.44824,0.200919


In [31]:
# Calculate MSE with large error
df_large_error["squared differences"].mean()

0.25728320720794084

In [38]:
df_2large_error = df_large_error.copy()
df_2large_error.iloc[1:100] = 20
df_2large_error.head()
df_2large_error["squared differences"].mean()

0.7333540351264799

#### `Which Regression Metric Should We Use`

**Get a screenshot from video 138 New Evaluating A Regression Model (MSE) minute :8:00 / check chapter 3.3 / study Document after video**