#1. Loading and Preprocessing

1.Loading the dataset

In [1]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

In [3]:
# To load dataset
data=fetch_california_housing()
data

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

2.Convert to a Pandas DataFrame

In [5]:
# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)

# Adding the target variable to the DataFrame
df['MedHouseValue']=data.target
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseValue
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


3.Handle Missing Values & Feature Scaling (Standardization)

In [7]:
# To check for missing values
print("Missing values in the dataset:")
print(df.isnull().sum())

Missing values in the dataset:
MedInc           0
HouseAge         0
AveRooms         0
AveBedrms        0
Population       0
AveOccup         0
Latitude         0
Longitude        0
MedHouseValue    0
dtype: int64


In [9]:
# Feature Scaling using StandardScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = df.drop('MedHouseValue', axis=1)
y = df['MedHouseValue']
X_scaled = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)
X_scaled_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,2.344766,0.982143,0.628559,-0.153758,-0.974429,-0.049597,1.052548,-1.327835
1,2.332238,-0.607019,0.327041,-0.263336,0.861439,-0.092512,1.043185,-1.322844
2,1.782699,1.856182,1.155620,-0.049016,-0.820777,-0.025843,1.038503,-1.332827
3,0.932968,1.856182,0.156966,-0.049833,-0.766028,-0.050329,1.038503,-1.337818
4,-0.012881,1.856182,0.344711,-0.032906,-0.759847,-0.085616,1.038503,-1.337818
...,...,...,...,...,...,...,...,...
20635,-1.216128,-0.289187,-0.155023,0.077354,-0.512592,-0.049110,1.801647,-0.758826
20636,-0.691593,-0.845393,0.276881,0.462365,-0.944405,0.005021,1.806329,-0.818722
20637,-1.142593,-0.924851,-0.090318,0.049414,-0.369537,-0.071735,1.778237,-0.823713
20638,-1.054583,-0.845393,-0.040211,0.158778,-0.604429,-0.091225,1.778237,-0.873626


4.Explaination for the preprocessing steps done

Using fetch_california_housing() from sklearn.datasets for loading a real-world housing dataset that includes various features about California districts and their median house prices.Then converted the dataset to a pandas DataFrame for simplifing data analysis, visualization, and preprocessing tasks such as checking for null values or applying transformations.Then, I checked for missing values using df.isnull().sum(). Although this dataset typically contains no missing values. This step is crucial in any real-world scenario. Missing values can lead to inaccurate model training or runtime errors, so detecting and handling them ensures data quality and reliability.Then I applied standardization using StandardScaler from sklearn.preprocessing to scale all features so they have a mean of 0 and a standard deviation of 1. This step is important because the features are on different scales.Standardization ensures that all features contribute equally to the model training process and helps the model converge more efficiently.

# 2.Regression Algorithm Implementation

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    print(f"RMSE: {rmse:.4f}")
    print(f"R² Score: {r2:.4f}")

1.Linear Regression

In [13]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
print("Linear Regression:")
evaluate_model(lr, X_test, y_test)

Linear Regression:
RMSE: 0.7456
R² Score: 0.5758


2.Decision Tree Regressor

In [15]:
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)
print("Decision Tree Regressor:")
evaluate_model(dt, X_test, y_test)

Decision Tree Regressor:
RMSE: 0.7030
R² Score: 0.6228


3.Random Forest Regressor

In [17]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
print("Random Forest Regressor:")
evaluate_model(rf, X_test, y_test)

Random Forest Regressor:
RMSE: 0.5055
R² Score: 0.8050


4.Gradient Boosting Regressor

In [19]:
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor(n_estimators=100, random_state=42)
gbr.fit(X_train, y_train)
print("Gradient Boosting Regressor:")
evaluate_model(gbr, X_test, y_test)

Gradient Boosting Regressor:
RMSE: 0.5422
R² Score: 0.7756


5.Support Vector Regressor (SVR)

In [21]:
from sklearn.svm import SVR
svr = SVR(kernel='rbf')  
svr.fit(X_train, y_train)
print("Support Vector Regressor:")
evaluate_model(svr, X_test, y_test)

Support Vector Regressor:
RMSE: 0.5960
R² Score: 0.7289


6.Explanation and Suitability

1.Linear Regression

Linear Regression models the relationship between the independent variables (features) and the target (house value) by fitting a straight line that minimizes the squared difference between actual and predicted values.

Linear Regression is suitabie because,it serves as a good baseline model for continuous numerical data. The California Housing dataset has mostly linear trends among some features like median income and house value, making linear regression an effective and interpretable starting point.

2.Decision Tree Regressor

Splits the data into branches based on feature values to predict the target. Each leaf node represents a predicted value.

Decision Tree Regressor is suitable because,it can model non-linear relationships and easy to interpret. Useful for capturing complex patterns that linear models might miss.

3.Random Forest Regressor

A Random Forest is an ensemble of Decision Trees trained on random subsets of data and features. Each tree gives a prediction, and the final prediction is the average of all tree outputs to reduce variance.

It reduces overfitting compared to a single Decision Tree and performs well on structured/tabular datasets like housing data. It can handle complex relationships and usually provides high accuracy.

4.Gradient Boosting Regressor

Gradient Boosting builds trees sequentially, where each new tree is trained to correct the errors of the previous one. It uses gradient descent to minimize the loss function, focusing on difficult-to-predict samples.

It is highly effective at modeling complex, non-linear relationships. It's especially good for improving prediction accuracy on datasets like California Housing, with engineered features.

5.Support Vector Regressor (SVR)

SVR tries to fit the best line within a threshold of error. Uses kernels like RBF to model non-linear data.

Good for datasets with complex, non-linear relationships. The RBF kernel helps SVR adapt to the varied nature of housing data.

# 3.Model Evaluation and Comparison

In [23]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Function to evaluate model performance
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    return mse, mae, r2

In [27]:
results = {}
models = {
    "Linear Regression": lr,
    "Decision Tree": dt,
    "Random Forest": rf,
    "Gradient Boosting": gbr,
    "SVR": svr
}

for name, model in models.items():
    mse, mae, r2 = evaluate_model(model, X_test, y_test)
    results[name] = {
        "MSE": mse,
        "MAE": mae,
        "R2 Score": r2
    }

# Converting to DataFrame
results_df = pd.DataFrame(results).T
results_df = results_df.sort_values(by="R2 Score", ascending=False)
results_df

Unnamed: 0,MSE,MAE,R2 Score
Random Forest,0.255498,0.327613,0.805024
Gradient Boosting,0.293999,0.37165,0.775643
SVR,0.355198,0.397763,0.728941
Decision Tree,0.494272,0.453784,0.622811
Linear Regression,0.555892,0.5332,0.575788


Best Model:Random Forest Regressor

Justification:Random Forest Regressor achieved the lowest MSE and MAE, indicating its predictions were closest to the actual house values.It also had the highest R² Score, showing it explained the most variance in the dataset.Random Forest is an ensemble method that reduces overfitting and captures complex, non-linear relationships, which is ideal for structured datasets like California Housing.

Worst Model: Linear Regression

Reasoning:It had the highest error values (MSE and MAE) and the lowest R² score among all models.Linear Regression assumes a linear relationship between features and the target, which is too simplistic for this dataset.It fails to capture non-linear interactions between variables like median income, population, and housing age.