California Houses price Regression model

Including needed libraries


In [138]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Lasso, Ridge, RidgeCV, LassoCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import GridSearchCV


Reading data from the csv file and printing first 5 rows 


In [139]:
data = pd.read_csv('California_Houses.csv', header=0)
print(data.head())

   Median_House_Value  Median_Income  Median_Age  Tot_Rooms  Tot_Bedrooms  \
0            452600.0         8.3252          41        880           129   
1            358500.0         8.3014          21       7099          1106   
2            352100.0         7.2574          52       1467           190   
3            341300.0         5.6431          52       1274           235   
4            342200.0         3.8462          52       1627           280   

   Population  Households  Latitude  Longitude  Distance_to_coast  \
0         322         126     37.88    -122.23        9263.040773   
1        2401        1138     37.86    -122.22       10225.733072   
2         496         177     37.85    -122.24        8259.085109   
3         558         219     37.85    -122.25        7768.086571   
4         565         259     37.85    -122.25        7768.086571   

   Distance_to_LA  Distance_to_SanDiego  Distance_to_SanJose  \
0   556529.158342         735501.806984         67432.5170

Splitting data to features and target then dvide them into training set, validation set and test set 70%, 15%, 15% respectively 


In [140]:
features = data.drop('Median_House_Value', axis=1)
target = data['Median_House_Value']
features_train, features_vald_test, target_train, target_vald_test = train_test_split(features, target, test_size=0.3, random_state=30)
features_vald, features_test, target_vald, target_test = train_test_split(features_vald_test, target_vald_test, test_size=0.5, random_state=30)
print(f'Training set size = {features_train.shape[0]}')
print(f'Validation set size = {features_vald.shape[0]}')
print(f'Test set size = {features_test.shape[0]}')

Training set size = 14448
Validation set size = 3096
Test set size = 3096


Normalizing data due to range in values


In [141]:
scaler = StandardScaler()
features_train = scaler.fit_transform(features_train)
features_vald = scaler.transform(features_vald)
features_test = scaler.transform(features_test)

creating linear regression model and predict the output of validation and test sets

In [142]:
model = LinearRegression()
model.fit(features_train, target_train)
predictions_val = model.predict(features_vald)
predictions_test = model.predict(features_test)

Calculating MSE and MAE for the validation set outputs 

In [143]:
mae_val = mean_absolute_error(target_vald, predictions_val)
mse_val = mean_squared_error(target_vald, predictions_val)

Calculating MSE and MAE for the test set outputs 

In [144]:
mae_test = mean_absolute_error(target_test, predictions_test)
mse_test = mean_squared_error(target_test, predictions_test)

Printing the values 


In [145]:
print(f'Validation MAE = {mae_val}')
print(f'Test MAE = {mae_test}')
print(f'Validation MSE = {mse_val}')
print(f'Test MSE = {mse_test}')
print(f'Mean Target Value: {target.mean()}')  # Print mean target value for reference
print(f'MAE as percentage of mean target: {mae_val / target.mean() * 100:.2f}%')  # Relative error


Validation MAE = 52383.55863950708
Test MAE = 50322.28933544367
Validation MSE = 5448495356.8047
Test MSE = 4932147504.842557
Mean Target Value: 206855.81690891474
MAE as percentage of mean target: 25.32%


Creating Lasso model using LassoCV for performing Lasso regression with built-in cross-validation to select the optimal regularization parameter (alpha)

In [146]:
lasso_cv = LassoCV(alphas=[0.1, 0.25, 0.5, 0.75, 1.0], cv= 5, max_iter=5000)
lasso_cv.fit(features_train, target_train)
print("Best alpha for Lasso:", lasso_cv.alpha_)


Best alpha for Lasso: 0.25


predict the outputs of the test and validation set


In [147]:
predictions_val = lasso_cv.predict(features_vald)
predictions_test = lasso_cv.predict(features_test)

Calculation MSE and MAE for test and validation set 


In [148]:
mae_val = mean_absolute_error(target_vald, predictions_val)
mse_val = mean_squared_error(target_vald, predictions_val)
mae_test = mean_absolute_error(target_test, predictions_test)
mse_test = mean_squared_error(target_test, predictions_test)
print("lasso Regression validation MSE:", mse_val)
print("Lasso Regression validation MAE:", mae_val)
print("lasso Regression test MSE:", mse_test)
print("Lasso Regression test MAE:", mae_test)
print(f'Mean Target Value: {target.mean()}')  # Print mean target value for reference
print(f'MAE as percentage of mean target: {mae_test / target.mean() * 100:.2f}%')  # Relative error

lasso Regression validation MSE: 5448461381.833978
Lasso Regression validation MAE: 52383.731782164235
lasso Regression test MSE: 4932101067.664037
Lasso Regression test MAE: 50322.27441944892
Mean Target Value: 206855.81690891474
MAE as percentage of mean target: 24.33%


Error deacresed due to regularization of the model


Creating Ridge
 model and choosing the best alpha using cross validation

In [149]:
ridge_cv = RidgeCV(alphas=[0.1, 0.25, 0.5, 0.75, 1.0], cv= 5)
ridge_cv.fit(features_train, target_train)
print("Best alpha for Ridge:", ridge_cv.alpha_)

Best alpha for Ridge: 0.75


predict the outputs of the test and validation set

In [125]:
predictions_val = lasso_cv.predict(features_vald)
predictions_test = lasso_cv.predict(features_test)


Calculation MSE and MAE for test and validation set for Ridge Regression


In [126]:
mae_val = mean_absolute_error(target_vald, predictions_val)
mse_val = mean_squared_error(target_vald, predictions_val)
mae_test = mean_absolute_error(target_test, predictions_test)
mse_test = mean_squared_error(target_test, predictions_test)
print("Ridge Regression validation MSE:", mse_val)
print("Ridge Regression validation MAE:", mae_val)
print("Ridge Regression test MSE:", mse_test)
print("Ridge Regression test MAE:", mae_test)
print(f'Mean Target Value: {target.mean()}')  # Print mean target value for reference
print(f'MAE as percentage of mean target: {mae_test / target.mean() * 100:.2f}%')  # Relative error

Ridge Regression validation MSE: 5448461381.833978
Ridge Regression validation MAE: 52383.731782164235
Ridge Regression test MSE: 4932101067.664037
Ridge Regression test MAE: 50322.27441944892
Mean Target Value: 206855.81690891474
MAE as percentage of mean target: 24.33%


Error for ridge is the same as the error of lasso regression (both are smaller than linear regression) due to their ability to handle issues like overfitting and multicollinearity using regularization (lasso uses L1 regularization and ridge uses L2 regularization)