## Loading and Preprocessing

In [14]:
# 📦 Import libraries
from sklearn.datasets import fetch_california_housing
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# 📥 Load the California Housing dataset
housing = fetch_california_housing()

# Convert to pandas DataFrame
df = pd.DataFrame(housing.data, columns=housing.feature_names)

# Add target variable (Median House Value)
df['median_house_value'] = housing.target

# Handle missing values (None in this dataset)
# The dataset doesn't have missing values, but you can fill or drop if required
df.fillna(df.mean(), inplace=True)

# Standardizing features (feature scaling)
scaler = StandardScaler()
X = df.drop('median_house_value', axis=1)
y = df['median_house_value']

# Scaling the features
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

print("First 5 rows of scaled features:\n", X_scaled[:5])


First 5 rows of scaled features:
 [[ 2.34476576  0.98214266  0.62855945 -0.15375759 -0.9744286  -0.04959654
   1.05254828 -1.32783522]
 [ 2.33223796 -0.60701891  0.32704136 -0.26333577  0.86143887 -0.09251223
   1.04318455 -1.32284391]
 [ 1.7826994   1.85618152  1.15562047 -0.04901636 -0.82077735 -0.02584253
   1.03850269 -1.33282653]
 [ 0.93296751  1.85618152  0.15696608 -0.04983292 -0.76602806 -0.0503293
   1.03850269 -1.33781784]
 [-0.012881    1.85618152  0.3447108  -0.03290586 -0.75984669 -0.08561576
   1.03850269 -1.33781784]]


### Loading Dataset: We use fetch_california_housing to load the data.

Convert to DataFrame: This makes it easier to manipulate and explore.

Missing Values: There are no missing values in this dataset, but I included a step for handling them (e.g., filling with the mean).

Feature Scaling: We use StandardScaler to standardize the features to ensure all variables have equal importance.

## Regression Algorithm Implementation

## Linear Regression:

In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Linear Regression Model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predictions
y_pred_lr = lr.predict(X_test)

# Evaluation Metrics
mse_lr = mean_squared_error(y_test, y_pred_lr)
mae_lr = mean_absolute_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

print("Linear Regression MSE:", mse_lr)
print("Linear Regression MAE:", mae_lr)
print("Linear Regression R2:", r2_lr)


Linear Regression MSE: 0.5558915986952441
Linear Regression MAE: 0.5332001304956565
Linear Regression R2: 0.575787706032451


### Linear Regression fits a linear equation to model the relationship between features and the target.

It's simple and suitable if the data has a linear relationship.

## Decision Tree Regressor

In [16]:
from sklearn.tree import DecisionTreeRegressor

# Decision Tree Regressor Model
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)

# Predictions
y_pred_dt = dt.predict(X_test)

# Evaluation Metrics
mse_dt = mean_squared_error(y_test, y_pred_dt)
mae_dt = mean_absolute_error(y_test, y_pred_dt)
r2_dt = r2_score(y_test, y_pred_dt)

print("Decision Tree MSE:", mse_dt)
print("Decision Tree MAE:", mae_dt)
print("Decision Tree R2:", r2_dt)


Decision Tree MSE: 0.4942716777366763
Decision Tree MAE: 0.4537843265503876
Decision Tree R2: 0.6228111330554302


### Decision Trees split the data into segments based on feature values.

It’s suitable for capturing non-linear relationships.

## Random Forest Regressor

In [17]:
from sklearn.ensemble import RandomForestRegressor

# Random Forest Regressor Model
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predictions
y_pred_rf = rf.predict(X_test)

# Evaluation Metrics
mse_rf = mean_squared_error(y_test, y_pred_rf)
mae_rf = mean_absolute_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest MSE:", mse_rf)
print("Random Forest MAE:", mae_rf)
print("Random Forest R2:", r2_rf)


Random Forest MSE: 0.25549776668540763
Random Forest MAE: 0.32761306601259704
Random Forest R2: 0.805024407701793


### Random Forest is an ensemble of decision trees, making it more robust and less prone to overfitting.

It handles complex datasets well and works for non-linear data.

## Gradient Boosting Regressor

In [18]:
from sklearn.ensemble import GradientBoostingRegressor

# Gradient Boosting Regressor Model
gb = GradientBoostingRegressor(random_state=42)
gb.fit(X_train, y_train)

# Predictions
y_pred_gb = gb.predict(X_test)

# Evaluation Metrics
mse_gb = mean_squared_error(y_test, y_pred_gb)
mae_gb = mean_absolute_error(y_test, y_pred_gb)
r2_gb = r2_score(y_test, y_pred_gb)

print("Gradient Boosting MSE:", mse_gb)
print("Gradient Boosting MAE:", mae_gb)
print("Gradient Boosting R2:", r2_gb)


Gradient Boosting MSE: 0.29399901242474274
Gradient Boosting MAE: 0.37165044848436773
Gradient Boosting R2: 0.7756433164710084


### Gradient Boosting iteratively corrects errors made by previous models, which improves its predictive power.

Suitable for complex data and tends to perform well for large datasets.

## Support Vector Regressor (SVR)

In [19]:
from sklearn.svm import SVR

# Support Vector Regressor Model
svr = SVR(kernel='rbf')
svr.fit(X_train, y_train)

# Predictions
y_pred_svr = svr.predict(X_test)

# Evaluation Metrics
mse_svr = mean_squared_error(y_test, y_pred_svr)
mae_svr = mean_absolute_error(y_test, y_pred_svr)
r2_svr = r2_score(y_test, y_pred_svr)

print("SVR MSE:", mse_svr)
print("SVR MAE:", mae_svr)
print("SVR R2:", r2_svr)


SVR MSE: 0.3551984619989429
SVR MAE: 0.397763096343787
SVR R2: 0.7289407597956454


### SVR creates a boundary (hyperplane) that fits the data with minimal error.

It's good for capturing non-linear relationships but may be slow for large datasets.

# Model Evaluation and Comparison

In [20]:
# Example to compare the models (assuming previous code blocks)
models = {
    'Linear Regression': (mse_lr, mae_lr, r2_lr),
    'Decision Tree': (mse_dt, mae_dt, r2_dt),
    'Random Forest': (mse_rf, mae_rf, r2_rf),
    'Gradient Boosting': (mse_gb, mae_gb, r2_gb),
    'SVR': (mse_svr, mae_svr, r2_svr),
}

for model_name, (mse, mae, r2) in models.items():
    print(f"{model_name} -> MSE: {mse}, MAE: {mae}, R²: {r2}")


Linear Regression -> MSE: 0.5558915986952441, MAE: 0.5332001304956565, R²: 0.575787706032451
Decision Tree -> MSE: 0.4942716777366763, MAE: 0.4537843265503876, R²: 0.6228111330554302
Random Forest -> MSE: 0.25549776668540763, MAE: 0.32761306601259704, R²: 0.805024407701793
Gradient Boosting -> MSE: 0.29399901242474274, MAE: 0.37165044848436773, R²: 0.7756433164710084
SVR -> MSE: 0.3551984619989429, MAE: 0.397763096343787, R²: 0.7289407597956454


### Mean Squared Error (MSE): Measures the average squared difference between the predicted and actual values. Lower values indicate better performance.

Mean Absolute Error (MAE): Measures the average absolute difference between actual and predicted values. Again, lower values indicate better performance.

R² Score: Indicates how well the model explains the variance in the target variable. Higher values are better (closer to 1).

### Compare Results:
The best-performing model would have the lowest MSE, lowest MAE, and the highest R².

The worst-performing model will have the highest MSE, highest MAE, and the lowest R².