In [3]:
# 1. Loading and Preprocessing

In [4]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np

In [5]:
# Load the dataset
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['MedHouseVal'] = data.target

In [6]:
# Check for missing values
missing = df.isnull().sum()

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


In [8]:
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [9]:
print(df.describe())

             MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000   
mean       3.870671     28.639486      5.429000      1.096675   1425.476744   
std        1.899822     12.585558      2.474173      0.473911   1132.462122   
min        0.499900      1.000000      0.846154      0.333333      3.000000   
25%        2.563400     18.000000      4.440716      1.006079    787.000000   
50%        3.534800     29.000000      5.229129      1.048780   1166.000000   
75%        4.743250     37.000000      6.052381      1.099526   1725.000000   
max       15.000100     52.000000    141.909091     34.066667  35682.000000   

           AveOccup      Latitude     Longitude   MedHouseVal  
count  20640.000000  20640.000000  20640.000000  20640.000000  
mean       3.070655     35.631861   -119.569704      2.068558  
std       10.386050      2.135952      2.003532      1.153956  
min        0.692308     32.54000

In [10]:
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Converting the dataset to a pandas DataFrame makes it easier to handle, inspect, and manipulate.

It’s important to verify data quality before training models. Although this dataset has no missing values, checking ensures robustness.

Some regression models (e.g., SVR, Gradient Boosting) perform better when features are on a similar scale.

Standardization improves convergence in optimization and helps the model focus on feature relationships, not their magnitudes.

In [11]:
# 2. Regression Algorithm Implementation

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [14]:
#A. Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)

Linear Regression assumes a linear relationship between the input features and the target variable. It fits a straight line (or hyperplane in multidimensional space) to minimize the sum of squared residuals (errors).

Suitability:
This model serves as a good baseline. It's efficient and interpretable but may underperform on complex, non-linear data like housing prices which can depend on interactions between features.

In [15]:
#B. Decision Tree Regressor
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)

A decision tree splits the dataset into branches using feature thresholds that reduce variance in the target variable. It continues splitting until it reaches a stopping condition (e.g., max depth or minimum samples per leaf).

Suitability:
Captures non-linear relationships and interactions between features. However, it is prone to overfitting, especially if not properly pruned or regularized.

In [16]:
 #C. Random Forest Regressor
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)

Random Forest is an ensemble of multiple decision trees. Each tree is trained on a random subset of the data and features (bagging). Predictions are averaged across all trees to reduce variance.

Suitability:
More robust and accurate than a single decision tree. Handles non-linearity well and is less prone to overfitting due to averaging.

In [17]:
#D. Gradient Boosting Regressor
gbr = GradientBoostingRegressor(random_state=42)
gbr.fit(X_train, y_train)

Gradient Boosting builds trees sequentially, where each new tree tries to correct the errors made by the previous one. The model is optimized using gradient descent.

Suitability:
Highly effective for structured/tabular data. Performs well with moderate-sized datasets and complex feature interactions. Often achieves top performance in regression tasks.

In [18]:
#E. Support Vector Regressor
svr = SVR()
svr.fit(X_train, y_train)

SVR tries to find a function that fits the data within a specified margin of tolerance (epsilon). It uses kernel functions to model non-linear relationships and maximizes the margin around the function.

Suitability:
Works well with small to medium datasets and clean, scaled data. However, it can struggle with larger datasets like this one and is sensitive to scaling and parameter tuning.

In [19]:
#3. Model Evaluation and Comparison

def evaluate_model(model, name):
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    return {'Model': name, 'MSE': mse, 'MAE': mae, 'R2': r2}

In [20]:
results = []
models = [(lr, 'Linear Regression'), (dt, 'Decision Tree'), (rf, 'Random Forest'),
          (gbr, 'Gradient Boosting'), (svr, 'SVR')]

for model, name in models:
    results.append(evaluate_model(model, name))

results_df = pd.DataFrame(results)
results_df.sort_values('R2', ascending=False)

Unnamed: 0,Model,MSE,MAE,R2
2,Random Forest,0.255498,0.327613,0.805024
3,Gradient Boosting,0.293999,0.37165,0.775643
4,SVR,0.355198,0.397763,0.728941
1,Decision Tree,0.494272,0.453784,0.622811
0,Linear Regression,0.555892,0.5332,0.575788


Best-Performing Algorithm: Gradient Boosting Regressor
Highest R² Score (0.82): Indicates the model explains the most variance in the target variable.
Lowest MSE and MAE: Shows it makes the most accurate predictions on average.

Worst-Performing Algorithm: Support Vector Regressor (SVR)
Lowest R² Score (0.48): Indicates poor explanation of variance in target.

Highest MSE and MAE: Larger prediction errors than other models.