1. Loading and preprocessing

In [5]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['MedHouseVal'] = data.target

In [7]:
print(df.isnull().sum())

from sklearn.preprocessing import StandardScaler

X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
MedHouseVal    0
dtype: int64


2. Regression Algorithm Implementation

importing all modules

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

model training and evaluation function

In [15]:
def evaluate_model(model, name):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"{name}:\n  MSE: {mse:.4f}\n  MAE: {mae:.4f}\n  R2 Score: {r2:.4f}\n")
    return name, mse, mae, r2

train each model

In [17]:
results = []

results.append(evaluate_model(LinearRegression(), "Linear Regression"))
results.append(evaluate_model(DecisionTreeRegressor(random_state=42), "Decision Tree"))
results.append(evaluate_model(RandomForestRegressor(random_state=42), "Random Forest"))
results.append(evaluate_model(GradientBoostingRegressor(random_state=42), "Gradient Boosting"))
results.append(evaluate_model(SVR(), "Support Vector Regressor"))

Linear Regression:
  MSE: 0.5559
  MAE: 0.5332
  R2 Score: 0.5758

Decision Tree:
  MSE: 0.4943
  MAE: 0.4538
  R2 Score: 0.6228

Random Forest:
  MSE: 0.2555
  MAE: 0.3276
  R2 Score: 0.8050

Gradient Boosting:
  MSE: 0.2940
  MAE: 0.3717
  R2 Score: 0.7756

Support Vector Regressor:
  MSE: 0.3552
  MAE: 0.3978
  R2 Score: 0.7289



Algorithm description

1. Linear Regression: Simple model assuming linear relationship between features and target.
2. Decision Tree: Splits data based on feature thresholds to minimize error.
3. Random Forest: Ensemble of decision trees, improves stability and accuracy.
4. Gradient Boosting: Sequential model that corrects previous errors, great for complex data.
5. Support Vector Regressor: Uses hyperplanes to find a margin of tolerance for regression predictions.

3. Model Evaluation and Comparison

In [19]:
import pandas as pd

result_df = pd.DataFrame(results, columns=["Model", "MSE", "MAE", "R2"])
print(result_df.sort_values(by="R2", ascending=False))

                      Model       MSE       MAE        R2
2             Random Forest  0.255498  0.327613  0.805024
3         Gradient Boosting  0.293999  0.371650  0.775643
4  Support Vector Regressor  0.355198  0.397763  0.728941
1             Decision Tree  0.494272  0.453784  0.622811
0         Linear Regression  0.555892  0.533200  0.575788


BEST PERFORMING MODEL:
Likely Random Forest or Gradient Boosting, due to their ability to capture non-linear relationships.
WORST PERFORMING MODEL:
Usually Linear Regression or SVR, due to limitations in handling complex relationships or sensitivity to data scaling.