In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression

In [15]:
data = pd.read_csv('/content/sample_data/california_housing_train.csv')

In [16]:
data.shape

(17000, 9)

In [17]:
X = data.drop('median_house_value', axis=1)  # Features
y = data['median_house_value']  # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

model = LinearRegression()

model.fit(X_train, y_train)

In [26]:
y_pred = model.predict(X_test)

In [21]:
data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


In [25]:
## MEAN BASELINE MODEL

mean_target = y.mean()
y_pred_base = [mean_target] * len(y)

## Mean Absolute Error

calculated as the average of the absolute differences between predicted and actual values

Points to Remember:
1. MAE gives error value in the same scale as original data
2. Less sensitive to outliers compared to MSE because it doesn't square the errors
3. easy to understand and interpret
4. can be optimized directly using optimization algorithms

Don't use when:
1. Outliers: MAE is less sensitive to outliers compared to MSE but extreme outliers can still influence it
2. Emphasizing Large Errors: if you want to heavily penalize larger errors dont use MAE as it treats all errors equally

In [28]:
mae_baseline = mean_absolute_error(y, y_pred_base)
print(f"Mean Absolute Error (Baseline): {mae_baseline}")

Mean Absolute Error (Baseline): 91645.59140892734


In [29]:
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error: ", mae)

Mean Absolute Error:  50384.96988178577


## Mean Square Error

average of the squared differences between predicted and actual values

Points to Remember:
1. MSE squares errors, making it more sensitive to large errors/outliers
2. Larger errors contribute more significantly to MSE due to the squaring effect
3. MSE is differentiable making it suitable for certain optimization techniques

Use:
1. used as loss function during model training in various algos
2. comparision of different algos
3. used in optimization algorithms for model parameter tuning due to its differentiability

Don't use when:
1. don't use with outliers
2. In cases where larger errors have significantly higher costs eg.:financial forecasting where understanding the scale of errors is important to get a clear picture to define next steps


In [30]:
mse_baseline = mean_squared_error(y, y_pred_base)
print(f"Mean Squares Error (Baseline): {mse_baseline}")

Mean Squares Error (Baseline): 13451442293.56867


In [31]:
mse = mean_squared_error(y_test, y_pred)
print("Mean Square Error: ", mse)

Mean Square Error:  4959203201.344264


## Root Mean Squared Error (RMSE)

average magnitude of the errors between predicted and actual values

Points to remember:
1. RMSE is in the same units as the target variable, making it easily interpretable, unlike MSE

Uses:
1. frequently used to compare performances of different models
2. can serve as a loss function during model training and validation, guiding the optimization process

Don't use when:
1. with outliers
2. where large errors have higher costs
3. cases where data is not normally distributed (as then residuals are not normally distributed either)

In [32]:
print(f"Root Mean Squares Error (Baseline): {np.sqrt(mse_baseline)}")

Root Mean Squares Error (Baseline): 115980.35304985354


In [33]:
print("Mean Square Error: ", np.sqrt(mse))

Mean Square Error:  70421.61032910469


**NOTE:**

The assumption of normality in the residuals (errors) is a fundamental assumption in many statistical models, including linear regression. The normality assumption suggests that the residuals should follow a normal distribution, meaning they should be symmetrically distributed around zero, with most values clustered closely to the mean and fewer values further away (following a bell-shaped curve).

When the residuals deviate from a normal distribution and exhibit non-normal behavior, it can impact the reliability and suitability of certain evaluation metrics, including the Root Mean Squared Error (RMSE).

Non-normality in residuals can imply that the model is not capturing certain patterns or relationships present in the data.

In cases of non-normal residuals, alternative evaluation metrics such as Mean Absolute Error (MAE) or quantile regression metrics might be more robust as they are less sensitive to the distributional assumptions of the residuals.

Additionally, diagnostic tests like Q-Q plots, Shapiro-Wilk tests, or histograms of residuals can help assess the normality assumption and guide the choice of appropriate evaluation metrics.

## R Squared

measures how well the independent variables explain the variability of the dependent variable

Points to Remember
1. ranges: 0 to 1 (higher is better)
2. interpreted as the % of variation in the dependent variable that is accounted for by the independent variables
3. it's essential to compare R2 with a baseline model (often a model using only the mean of the dependent variable) to determine the improvement achieved by the regression model.
    
    **An R2 significantly higher than the R2 of the baseline model indicates that your regression model is adding value by capturing patterns and relationships in the data beyond what a basic average-based model can achieve**

4. R2 increases when adding more predictors, even if they aren't truly contributing to the model, which can be misleading

Adjusted R-squared addresses this by penalizing for the number of predictors.

Don't use when:
1. In cases where predicting accurate values is more critical than explaining variance
2. If relationship is nonlinear R2 might not provide accurate representation -> R2 assumes a linear relationship between predictors and the dependent variable



In [34]:
r2_baseline = r2_score(y, y_pred_base)
print(f"R2 Score (Baseline): {r2_baseline}")

R2 Score (Baseline): 0.0


In [35]:
r2 = r2_score(y_test, y_pred)
print("R2 Score: ", r2)

R2 Score:  0.6244564323513384


In [38]:
## Adjusted R2

n = len(y_test)
k = len(X_test.columns)

adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - k - 1)
print("Adjusted R2 Score: ", adjusted_r2)

Adjusted R2 Score:  0.6235704551938068
