In [75]:
import pandas as pd

In [76]:
df = pd.read_csv("/content/california_housing.csv")

In [77]:
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,-122.23,37.88,41,880,129.0,322,126,8.3252,NEAR BAY,452600
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,NEAR BAY,358500
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,NEAR BAY,352100
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,NEAR BAY,341300
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,NEAR BAY,342200
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25,1665,374.0,845,330,1.5603,INLAND,78100
20636,-121.21,39.49,18,697,150.0,356,114,2.5568,INLAND,77100
20637,-121.22,39.43,17,2254,485.0,1007,433,1.7000,INLAND,92300
20638,-121.32,39.43,18,1860,409.0,741,349,1.8672,INLAND,84700


In [78]:
df.shape

(20640, 10)

In [79]:
df.isnull().sum()#null value counting

Unnamed: 0,0
longitude,0
latitude,0
housing_median_age,0
total_rooms,0
total_bedrooms,207
population,0
households,0
median_income,0
ocean_proximity,0
median_house_value,0


total_bedrooms column has 207 null value. So next step is to eleminate that.

In [80]:
df['total_bedrooms'].fillna(df['total_bedrooms'].median(), inplace=True)# replacing null with median

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['total_bedrooms'].fillna(df['total_bedrooms'].median(), inplace=True)


In [81]:
df.isnull().sum()

Unnamed: 0,0
longitude,0
latitude,0
housing_median_age,0
total_rooms,0
total_bedrooms,0
population,0
households,0
median_income,0
ocean_proximity,0
median_house_value,0


Median was selected because the total_bedrooms feature is skewed and may contain outliers. The median is robust to extreme values, represents the central tendency better than the mean, and prevents distortion of the data distribution while retaining all records for model training.

In [82]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [83]:
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"]

In [84]:
categorical_features = ["ocean_proximity"]
numerical_features = X.drop(columns=categorical_features).columns# seperating the categorical and numerical value for encoding

In [85]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(drop="first", sparse_output=False)
encoded_cat = encoder.fit_transform(X[categorical_features])# encoding the categoric value

In [86]:
encoded_df = pd.DataFrame(encoded_cat,columns=encoder.get_feature_names_out(categorical_features))

In [87]:
X_num = X[numerical_features].reset_index(drop=True)
X_cat = encoded_df.reset_index(drop=True)

X_final = pd.concat([X_num, X_cat], axis=1)
#combining all the columns including the encoded

Since the column ocean_proximity has categorical values. So it is encoded using OneHotEncoder and added to the dataset

In [88]:
X_final

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,-122.23,37.88,41,880,129.0,322,126,8.3252,0.0,0.0,1.0,0.0
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,0.0,0.0,1.0,0.0
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,0.0,0.0,1.0,0.0
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,0.0,0.0,1.0,0.0
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25,1665,374.0,845,330,1.5603,1.0,0.0,0.0,0.0
20636,-121.21,39.49,18,697,150.0,356,114,2.5568,1.0,0.0,0.0,0.0
20637,-121.22,39.43,17,2254,485.0,1007,433,1.7000,1.0,0.0,0.0,0.0
20638,-121.32,39.43,18,1860,409.0,741,349,1.8672,1.0,0.0,0.0,0.0


In [89]:
X_final.shape

(20640, 12)

In [90]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_final)# scaling using standardscaler

In [91]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)# train test split

The data is scaled using standard scaler and is then divided to train and test data

StandardScaler is chosen because the features in the California Housing dataset have very different scales, and standardization ensures that all features contribute equally to the model. It is especially important for algorithms like Linear Regression and Support Vector Regressor, as it improves numerical stability and model performance

Linear Regression

In [92]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)

It is a supervised learning algorithm used to predict a continuous value using one input feature(x) and one output variable.

It is suitable because, it is easy to interpret and can be easily trained.

Decision Tree Regressor

In [93]:
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)

Decision Tree Regressor works by repeatedly splitting the dataset into smaller subsets based on feature values that minimize prediction error, creating a tree-like structure of decision rules, and the final prediction is obtained by averaging the target values within the terminal leaf nodes.

This is suitable because it can handly non linearity. Some features are not linear to the output.

Random Forest Regressor

In [94]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)


Random Forest Regressor works by constructing a large number of decision trees using random subsets of the training data and features and then combining their predictions by averaging, which reduces variance and improves generalization compared to a single decision tree.

It is suitable because of its high chance of accuracy and reducing of overfitting


Gradient Boosting Regressor

In [95]:
from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor(random_state=42)
gbr.fit(X_train, y_train)
gbr_pred = gbr.predict(X_test)

Gradient Boosting Regressor works by building decision trees sequentially, where each new tree focuses on correcting the errors made by the previous models using gradient descent optimization, gradually improving prediction accuracy.

It can capture complex patterns and non linear relations.

Support Vector Regressor

In [96]:
from sklearn.svm import SVR

svr = SVR(kernel='rbf')
svr.fit(X_train, y_train)
svr_pred = svr.predict(X_test)

Support Vector Regressor works by finding a function that fits the data within a specified error margin while minimizing model complexity, and by using kernel functions it can transform the input space to handle non-linear relationships effectively.

It is effective for high dimensions and can model non linear relations

**Performance Metrics**

In [97]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

models = {"Linear Regression": lr_pred,"Decision Tree": dt_pred,"Random Forest": rf_pred,"Gradient Boosting": gbr_pred,"SVR": svr_pred}


In [98]:
results = []

for model, preds in models.items():
    results.append([model,mean_squared_error(y_test, preds),mean_absolute_error(y_test, preds),r2_score(y_test, preds)])

results

[['Linear Regression',
  4908476721.156616,
  50670.73824097191,
  0.6254240620553606],
 ['Decision Tree', 4870669661.1838665, 44214.67223837209, 0.6283091964371093],
 ['Random Forest', 2404389528.630863, 31642.75753633721, 0.8165160977560968],
 ['Gradient Boosting',
  3123095111.877028,
  38248.031950045464,
  0.7616701988665029],
 ['SVR', 13655934685.60698, 86961.27698172479, -0.04211241775363761]]

In [99]:
results_df = pd.DataFrame(results, columns=["Model", "MSE", "MAE", "R2"])

results_df# tabularizing the metrics for easy identification


Unnamed: 0,Model,MSE,MAE,R2
0,Linear Regression,4908477000.0,50670.738241,0.625424
1,Decision Tree,4870670000.0,44214.672238,0.628309
2,Random Forest,2404390000.0,31642.757536,0.816516
3,Gradient Boosting,3123095000.0,38248.03195,0.76167
4,SVR,13655930000.0,86961.276982,-0.042112


From the above table, it is noted that the best model is random forest regressor because it have lesser mean square error,mean absolute error and higher R2 score.

The worst model among the above is Supporting vector regressor as it have comparitively very less R2 score and high mean absolute error and mean square error.

**Cross Validation and Hyper Parameter Testing**

In [105]:
from sklearn.model_selection import cross_val_score
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "SVR": SVR()
}

In [106]:
cv_results = {}

for name, model in models.items():
    scores = cross_val_score(model, X_scaled, y, cv=5, scoring="r2")
    cv_results[name] = scores.mean()

cv_results

{'Linear Regression': np.float64(0.5596297261642371),
 'Decision Tree': np.float64(0.17756344840429783),
 'Random Forest': np.float64(0.5073647401536752),
 'Gradient Boosting': np.float64(0.5883651553000089),
 'SVR': np.float64(-0.1253744551046536)}

In [108]:
from sklearn.model_selection import GridSearchCV

param_grid_lr = {"fit_intercept": [True, False],"positive": [True, False]}

grid_lr = GridSearchCV(LinearRegression(), param_grid_lr,cv=5, scoring="r2")

grid_lr.fit(X_train, y_train)
best_lr = grid_lr.best_estimator_


In [109]:
param_grid_dt = {
    "max_depth": [5, 10, 20, None],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5]
}

grid_dt = GridSearchCV(DecisionTreeRegressor(random_state=42),param_grid_dt, cv=5, scoring="r2")

grid_dt.fit(X_train, y_train)
best_dt = grid_dt.best_estimator_


In [110]:
param_grid_rf = {
    "n_estimators": [100, 200],
    "max_depth": [None, 10, 20],
    "min_samples_split": [2, 5]
}

grid_rf = GridSearchCV(RandomForestRegressor(random_state=42),param_grid_rf, cv=5,scoring="r2", n_jobs=-1)

grid_rf.fit(X_train, y_train)
best_rf = grid_rf.best_estimator_


In [111]:
param_grid_gbr = {
    "n_estimators": [100, 200],
    "learning_rate": [0.05, 0.1],
    "max_depth": [3, 5]
}

grid_gbr = GridSearchCV(GradientBoostingRegressor(random_state=42),param_grid_gbr, cv=5, scoring="r2")

grid_gbr.fit(X_train, y_train)
best_gbr = grid_gbr.best_estimator_


In [118]:
from sklearn.model_selection import RandomizedSearchCV
import numpy as np
param_dist_svr = {
    "C": np.logspace(-1, 2, 5),
    "epsilon": [0.01, 0.1, 0.2],
    "gamma": ["scale", "auto"]
}

rand_svr = RandomizedSearchCV(SVR(kernel="rbf"),param_dist_svr,n_iter=10,cv=5,scoring="r2",random_state=42)

rand_svr.fit(X_train, y_train)
best_svr = rand_svr.best_estimator_


In [121]:
final_models = {"Linear Regression": best_lr,"Decision Tree": best_dt,"Random Forest": best_rf,"Gradient Boosting": best_gbr,"SVR": best_svr}

In [122]:
def evaluate_model(y_true, y_pred):
    return {
        "MSE": mean_squared_error(y_true, y_pred),
        "MAE": mean_absolute_error(y_true, y_pred),
        "R2": r2_score(y_true, y_pred)
    }

In [123]:
results = []

for name, model in final_models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    metrics = evaluate_model(y_test, y_pred)
    metrics["Model"] = name
    results.append(metrics)

Hyperparameter tuning is a feature used to control how a model learns, by changing and adjusting the parameters

Hyperparameter optimization was performed using GridSearchCV and RandomizedSearchCV to improve model generalization and reduce overfitting. For models like decision tree,random forest, parameters such as max_depth, min_samples_split, and min_samples_leaf controlled model complexity and prevented overly deep trees, leading to more stable predictions.

In Random Forest Regressor and Gradient boosting model, tuning n_estimators increased prediction stability, while adjusting learning_rate in Gradient Boosting balanced bias and variance, resulting in significant performance gains.

For SVR, parameters like C, gamma, and epsilon governed regularization strength and error tolerance, but despite tuning, the model remained sensitive and computationally expensive on large datasets. Overall, hyperparameter optimization had the greatest positive impact on ensemble models, particularly Random Forest and Gradient Boosting, substantially improving accuracy and cross-validation performance compared to simpler or kernel-based models.

In [124]:
results_df = pd.DataFrame(results)
results_df = results_df[["Model", "MSE", "MAE", "R2"]]
results_df

Unnamed: 0,Model,MSE,MAE,R2
0,Linear Regression,4908477000.0,50670.738241,0.625424
1,Decision Tree,3618843000.0,40086.199205,0.723839
2,Random Forest,2386450000.0,31495.690854,0.817885
3,Gradient Boosting,2298815000.0,32062.239272,0.824573
4,SVR,8698806000.0,64030.206074,0.336176


From the table, it is noted that Gradient boosting is the best algorithm for this dataset as it have less Mean square error and mean absolute error and a high R2 score compared to the other models