## Linear Regression for Predicting Median Household Value

In this notebook, we are implementing a Linear Regression model to predict the median household value for households within a block based on various features present in the `California_Houses.csv` file. Linear Regression is a statistical method that models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data.

The dataset contains various features such as the number of rooms, population, median income, and other attributes that can influence the median household value. By training a Linear Regression model on this data, we aim to understand how these features impact the median household value and make accurate predictions for new data points.
```

## Includes
The cell bellow is responsible to import all the libraries needed to run, test and quantisize the metrics to give the model an analysis module which is done by the function evaluation

In [79]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression , Ridge, Lasso
from sklearn.metrics import accuracy_score, precision_score, mean_absolute_error, root_mean_squared_error
import math

def evaluation(model, x, y):
  y_pred = model.predict(x)
  mae = mean_absolute_error(y, y_pred)
  rmse = root_mean_squared_error(y, y_pred)
  print(f"Model: {type(model).__name__}")
  print(f"MAE: {mae:.2f}")
  print(f"RMSE: {rmse:.2f}")
  print("/n")
  return mae, rmse

## Reading the data
The cell bellow is responsible for reading the data in the California_House.csv file

In [80]:
data = pd.read_csv("California_Houses.csv")
data.head()

Unnamed: 0,Median_House_Value,Median_Income,Median_Age,Tot_Rooms,Tot_Bedrooms,Population,Households,Latitude,Longitude,Distance_to_coast,Distance_to_LA,Distance_to_SanDiego,Distance_to_SanJose,Distance_to_SanFrancisco
0,452600.0,8.3252,41,880,129,322,126,37.88,-122.23,9263.040773,556529.158342,735501.806984,67432.517001,21250.213767
1,358500.0,8.3014,21,7099,1106,2401,1138,37.86,-122.22,10225.733072,554279.850069,733236.88436,65049.908574,20880.6004
2,352100.0,7.2574,52,1467,190,496,177,37.85,-122.24,8259.085109,554610.717069,733525.682937,64867.289833,18811.48745
3,341300.0,5.6431,52,1274,235,558,219,37.85,-122.25,7768.086571,555194.266086,734095.290744,65287.138412,18031.047568
4,342200.0,3.8462,52,1627,280,565,259,37.85,-122.25,7768.086571,555194.266086,734095.290744,65287.138412,18031.047568


## Split
In this cell bellow, we will split our dataset into training, validation and testing cell. The data set is split as 70% training data, 15% testing data and 15% validation data which is verified in the output.

In [81]:
train_val_data, test_data = train_test_split(data, test_size=0.15, random_state=42)

train_data, val_data = train_test_split(train_val_data, test_size=0.1765, random_state=42) 
tr = len(train_data) / len(data) * 100
val = len(val_data) / len(data) * 100
te = len(test_data) / len(data) * 100

print(f"Training Set: {train_data.shape}" + f" percentage of Training set data from the total data is {tr}")
print(f"Validation Set: {val_data.shape}" + f" percentage of the Validation set is {val}")
print(f"Testing Set: {test_data.shape}" + f" percentage of the Testing set is {te}")


Training Set: (14447, 14) percentage of Training set data from the total data is 69.99515503875969
Validation Set: (3097, 14) percentage of the Validation set is 15.00484496124031
Testing Set: (3096, 14) percentage of the Testing set is 15.0


## Initializtion and training
The cells bellow are responsible for the initialization of the models and training them. with the first one being initializing the features and the trainbing answers.

In [82]:
X_train = train_data.drop(columns=["Median_House_Value"])
y_train = train_data["Median_House_Value"]

#### Linear regression

In [83]:
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

print("Model Coefficients:", linear_model.coef_)
print("Intercept:", linear_model.intercept_)

Model Coefficients: [ 3.94587505e+04  9.05710831e+02 -6.24067991e+00  1.05569157e+02
 -3.92895416e+01  4.80644972e+01 -4.02079229e+04 -2.83509872e+04
 -2.45679018e-01 -1.38162140e-01  1.99685373e-01  1.42125749e-01
 -1.16994946e-01]
Intercept: -1974252.9772424346


#### Lasso regression

In [84]:
Lasso_model = Lasso( alpha=0.1 ,max_iter=10000)
Lasso_model.fit(X_train, y_train)

print("Lasso Model Coefficients: " , Lasso_model.coef_)
print("Intercpt: ", Lasso_model.intercept_)


Lasso Model Coefficients:  [ 3.94587103e+04  9.05713532e+02 -6.24064539e+00  1.05569037e+02
 -3.92895913e+01  4.80645445e+01 -4.02049410e+04 -2.83501529e+04
 -2.45697837e-01 -1.38161147e-01  1.99666874e-01  1.42117848e-01
 -1.16990934e-01]
Intercpt:  -1974250.2939722675


#### Ridege regression

In [85]:
Ridge_model = Ridge(alpha=0.1, max_iter=10000)
Ridge_model.fit(X_train, y_train)

print("Ridge Model Coefficients: ", Ridge_model.coef_)
print("Intercept: ", Ridge_model.intercept_)

Ridge Model Coefficients:  [ 3.94586404e+04  9.05719887e+02 -6.24058214e+00  1.05569302e+02
 -3.92896613e+01  4.80640722e+01 -4.01993129e+04 -2.83496543e+04
 -2.45724764e-01 -1.38158230e-01  1.99625188e-01  1.42104594e-01
 -1.16982833e-01]
Intercept:  -1974372.7064609916


## Validation
The following cells are responsible for validating the models using the validation set while tuning the models coefficiants

In [86]:
X_val = val_data.drop(columns=["Median_House_Value"])
Y_val = val_data["Median_House_Value"]

evaluation(linear_model, X_val, Y_val)
evaluation(Ridge_model, X_val, Y_val)
evaluation(Lasso_model, X_val, Y_val)

linear_model.fit(X_val, Y_val)
Lasso_model.fit(X_val, Y_val)
Ridge_model.fit(X_val, Y_val)

Model: LinearRegression
MAE: 49819.74
RMSE: 67845.55
/n
Model: Ridge
MAE: 49819.83
RMSE: 67845.61
/n
Model: Lasso
MAE: 49819.77
RMSE: 67845.57
/n


## Testing
The following cells are responsible for testing the models

In [87]:
X_test = test_data.drop(columns=["Median_House_Value"])
Y_test = test_data["Median_House_Value"]

evaluation(linear_model, X_test, Y_test)
evaluation(Lasso_model, X_test, Y_test)
evaluation(Ridge_model, X_test, Y_test)

Model: LinearRegression
MAE: 50873.03
RMSE: 69859.60
/n
Model: Lasso
MAE: 50873.03
RMSE: 69859.58
/n
Model: Ridge
MAE: 50872.87
RMSE: 69859.28
/n


(50872.8666519924, 69859.27991657387)