In [158]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

In [159]:
housing = fetch_california_housing()

In [160]:
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name="MedHouseValue")

In [161]:
df = pd.concat([X, y], axis=1)
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseValue
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [162]:
df.shape

(20640, 9)

In [163]:
def evaluate_model(y_true, y_pred):
    return {
        "MSE": mean_squared_error(y_true, y_pred),
        "MAE": mean_absolute_error(y_true, y_pred),
        "R2": r2_score(y_true, y_pred)
    }

In [164]:
df.isna().sum()

Unnamed: 0,0
MedInc,0
HouseAge,0
AveRooms,0
AveBedrms,0
Population,0
AveOccup,0
Latitude,0
Longitude,0
MedHouseValue,0


In [165]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [166]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [167]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In the preprocessing steps,the dataset is first converted to pandas dataframe. the shape of the dataset is found for more understanding of the data. The null value is found using the respective methods. It is noted that the dataset have no null values in any of its coloumns. The values are scaled using standard scaler for easy processing. It ensures that all features contribute equally.

In [168]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [169]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)

In [170]:
Lr_evaluate=evaluate_model(y_test, y_pred_lr)
print("Linear Regression",Lr_evaluate)

Linear Regression {'MSE': 0.5558915986952442, 'MAE': 0.5332001304956565, 'R2': 0.575787706032451}


Linear Regression works by modeling the relationship between input features and the target variable using a straight-line equation where the house price is expressed as a weighted sum of all features plus a bias term, and the model learns these weights by minimizing the mean squared error between predicted and actual values.

Linear regression is suitable because some features such as median income and number of rooms show a roughly linear relationship with house prices, and it also serves as a simple and interpretable baseline for comparing more complex regression models.

In [171]:
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

In [172]:
Dt_evaluate=evaluate_model(y_test,y_pred_dt)
print("Decision Tree Regression",Dt_evaluate)

Decision Tree Regression {'MSE': 0.495235205629094, 'MAE': 0.45467918846899225, 'R2': 0.622075845135081}


Decision Tree Regressor works by repeatedly splitting the dataset into smaller subsets based on feature values that minimize prediction error, creating a tree-like structure of decision rules, and the final prediction is obtained by averaging the target values within the terminal leaf nodes.

This model is suitable for the California Housing dataset because housing prices depend on complex, non-linear interactions between variables such as location, income, and population, which decision trees can capture effectively without requiring feature scaling.

In [173]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

In [174]:
Rf_evaluate=evaluate_model(y_test,y_pred_rf)
print("Random Forest Regression",Rf_evaluate)

Random Forest Regression {'MSE': 0.2553684927247781, 'MAE': 0.32754256845930246, 'R2': 0.8051230593157366}


Random Forest Regressor works by constructing a large number of decision trees using random subsets of the training data and features and then combining their predictions by averaging, which reduces variance and improves generalization compared to a single decision tree.

It can model highly non-linear relationships, is robust to noise and outliers, and performs very well on structured tabular data with multiple interacting features.

In [175]:
from sklearn.ensemble import GradientBoostingRegressor

gb = GradientBoostingRegressor(random_state=42)
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)

In [176]:
Gb_evaluate=evaluate_model(y_test,y_pred_gb)
print("Gradient Boosting Regression",Gb_evaluate)

Gradient Boosting Regression {'MSE': 0.2939973248643864, 'MAE': 0.3716425690425596, 'R2': 0.7756446042829697}


Gradient Boosting Regressor works by building decision trees sequentially, where each new tree focuses on correcting the errors made by the previous models using gradient descent optimization, gradually improving prediction accuracy.

It can learn complex patterns and subtle feature interactions, often achieving high predictive performance on housing price prediction tasks when properly tuned.

In [177]:
from sklearn.svm import SVR

svr = SVR(kernel='rbf')
svr.fit(X_train_scaled, y_train)
y_pred_svr = svr.predict(X_test_scaled)

In [178]:
Sv_evaluate=evaluate_model(y_test,y_pred_svr)
print("Supporting Vector Regression",Sv_evaluate)

Supporting Vector Regression {'MSE': 0.357004031933865, 'MAE': 0.39859907695205365, 'R2': 0.7275628923016773}


Support Vector Regressor works by finding a function that fits the data within a specified error margin while minimizing model complexity, and by using kernel functions it can transform the input space to handle non-linear relationships effectively.

Support Vector Regressor performs well on scaled numerical data, can capture non-linear trends in housing prices, and is less sensitive to outliers due to its margin-based learning approach.

**Performance Analysis**

In [180]:
Performance= [Lr_evaluate,Dt_evaluate,Rf_evaluate,Gb_evaluate,Sv_evaluate]
for i in Performance:
  print(i)

{'MSE': 0.5558915986952442, 'MAE': 0.5332001304956565, 'R2': 0.575787706032451}
{'MSE': 0.495235205629094, 'MAE': 0.45467918846899225, 'R2': 0.622075845135081}
{'MSE': 0.2553684927247781, 'MAE': 0.32754256845930246, 'R2': 0.8051230593157366}
{'MSE': 0.2939973248643864, 'MAE': 0.3716425690425596, 'R2': 0.7756446042829697}
{'MSE': 0.357004031933865, 'MAE': 0.39859907695205365, 'R2': 0.7275628923016773}


Here the maximum Mean Square Error is for Linear Regression model and minimum Mean Square Error is for Random Forest

The maximum Mean Absolute Error is for Linear Regression model and minimum Mean Absolute Error is for Random Forest

The maximum R-square is for Random Forest and minimum R-square is for Linear Regression

For better performance, MSE and MAE should be minimum and R-Square should be maximum

Thus,

Better Performance----> Random Forest
Worst Performance----> Linear Regression