**<h1>1. Loading and Preprocessing:**

In [6]:
# to import the necessary libraries:
from sklearn.datasets import fetch_california_housing
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np

In [8]:
# to load the dataset:
california_data = fetch_california_housing()

# to convert to a DataFrame:
data = pd.DataFrame(california_data.data, columns=california_data.feature_names)
data['Target'] = california_data.target  # Add target column

# to display the first few rows of the dataset:
data.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [10]:
# to check missing values:
print("Missing values before handling:")
print(data.isnull().sum())

# to handle missing values (if any):
# In this dataset, there are no missing values, but if there were:
data.fillna(data.mean(), inplace=True)

print("Missing values after handling:")
print(data.isnull().sum())

Missing values before handling:
MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
Target        0
dtype: int64
Missing values after handling:
MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
Target        0
dtype: int64


In [12]:
# to initialize the scaler:
scaler = StandardScaler()

# to scale the features (excluding the target):
scaled_features = scaler.fit_transform(data.iloc[:, :-1])

# to convert scaled features back to a DataFrame:
scaled_data = pd.DataFrame(scaled_features, columns=california_data.feature_names)
scaled_data['Target'] = data['Target']

# to display the first few rows of the scaled dataset:
scaled_data.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Target
0,2.344766,0.982143,0.628559,-0.153758,-0.974429,-0.049597,1.052548,-1.327835,4.526
1,2.332238,-0.607019,0.327041,-0.263336,0.861439,-0.092512,1.043185,-1.322844,3.585
2,1.782699,1.856182,1.15562,-0.049016,-0.820777,-0.025843,1.038503,-1.332827,3.521
3,0.932968,1.856182,0.156966,-0.049833,-0.766028,-0.050329,1.038503,-1.337818,3.413
4,-0.012881,1.856182,0.344711,-0.032906,-0.759847,-0.085616,1.038503,-1.337818,3.422


**Handling Missing Values:**

Checked for missing values in the dataset. Since missing values can cause issues during training, replaced them with the mean of the respective columns. This ensures no data loss while maintaining statistical integritworks.

**Feature Scaling:**

Applied standardization using StandardScaler to ensure all features have a mean of 0 and a standard deviation of 1. This step is crucial because features with varying scales can negatively impact models like logistic regression and neural networks.

**<h1>2. Regression Algorithm Implementation:**

In [20]:
# to import the libraries for regression algorithms:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import numpy as np

In [22]:
# to split the dataset into features and target:
X = scaled_data.iloc[:, :-1]
y = scaled_data['Target']

# to split into training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [33]:
!pip install --upgrade scikit-learn



In [35]:
# Linear Regression:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Predictions:
lr_predictions = lr_model.predict(X_test)

# Evaluation:
lr_rmse = mean_squared_error(y_test, lr_predictions, squared=False)
lr_r2 = r2_score(y_test, lr_predictions)

print("Linear Regression RMSE:", lr_rmse)
print("Linear Regression R2 Score:", lr_r2)

Linear Regression RMSE: 0.7455813830127763
Linear Regression R2 Score: 0.575787706032451




In [38]:
# Decision Tree Regressor:
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)

# Predictions:
dt_predictions = dt_model.predict(X_test)

# Evaluation:
dt_rmse = mean_squared_error(y_test, dt_predictions, squared=False)
dt_r2 = r2_score(y_test, dt_predictions)

print("Decision Tree RMSE:", dt_rmse)
print("Decision Tree R2 Score:", dt_r2)

Decision Tree RMSE: 0.7030445773467542
Decision Tree R2 Score: 0.6228111330554302




In [44]:
# Random Forest Regressor:
rf_model = RandomForestRegressor(random_state=42, n_estimators=100)
rf_model.fit(X_train, y_train)

# Predictions:
rf_predictions = rf_model.predict(X_test)

# Evaluation:
rf_rmse = mean_squared_error(y_test, rf_predictions, squared=False)
rf_r2 = r2_score(y_test, rf_predictions)

print("Random Forest RMSE:", rf_rmse)
print("Random Forest R2 Score:", rf_r2)

Random Forest RMSE: 0.5054678690929896
Random Forest R2 Score: 0.805024407701793




In [46]:
# Gradient Boosting Regressor:
gb_model = GradientBoostingRegressor(random_state=42)
gb_model.fit(X_train, y_train)

# Predictions:
gb_predictions = gb_model.predict(X_test)

# Evaluation:
gb_rmse = mean_squared_error(y_test, gb_predictions, squared=False)
gb_r2 = r2_score(y_test, gb_predictions)

print("Gradient Boosting RMSE:", gb_rmse)
print("Gradient Boosting R2 Score:", gb_r2)

Gradient Boosting RMSE: 0.5422167577867202
Gradient Boosting R2 Score: 0.7756433164710084




In [48]:
# Support Vector Regressor:
svr_model = SVR(kernel='rbf')
svr_model.fit(X_train, y_train)

# Predictions:
svr_predictions = svr_model.predict(X_test)

# Evaluation:
svr_rmse = mean_squared_error(y_test, svr_predictions, squared=False)
svr_r2 = r2_score(y_test, svr_predictions)

print("SVR RMSE:", svr_rmse)
print("SVR R2 Score:", svr_r2)

SVR RMSE: 0.595985286730253
SVR R2 Score: 0.7289407597956462




In [50]:
results = {
    "Model": ["Linear Regression", "Decision Tree", "Random Forest", "Gradient Boosting", "SVR"],
    "RMSE": [lr_rmse, dt_rmse, rf_rmse, gb_rmse, svr_rmse],
    "R2 Score": [lr_r2, dt_r2, rf_r2, gb_r2, svr_r2]
}

results_df = pd.DataFrame(results)
print(results_df)

               Model      RMSE  R2 Score
0  Linear Regression  0.745581  0.575788
1      Decision Tree  0.703045  0.622811
2      Random Forest  0.505468  0.805024
3  Gradient Boosting  0.542217  0.775643
4                SVR  0.595985  0.728941


**1. Linear Regression:**
Linear Regression assumes a linear relationship between the input features (independent variables) and the target (dependent variable). The model minimizes the sum of squared residuals to find the best-fit line.

**Why it is suitable:**

*   **Interpretability:** Linear regression provides clear insights into feature importance.
*   **Low Complexity:** It works well with datasets where features have a linear relationship with the target.
*   **Limitations:** It may underperform if relationships are non-linear.

**2. Decision Tree Regressor:**
Decision Tree Regressor splits the dataset into smaller subsets based on feature thresholds. It creates a tree structure where leaf nodes represent the average value of the target for that subset.

**Why it is suitable:**

*   **Non-linearity:** Handles non-linear relationships better than Linear Regression.
*   **Feature Importance:** Automatically identifies the most important features.
*   **Limitations:** Can overfit, especially if not pruned or regularized.

**3. Random Forest Regressor:**
Random Forest is an ensemble method that combines multiple Decision Trees, each trained on a random subset of the data (both rows and columns). The final prediction is the average of predictions from all trees.

**Why it is suitable:**

*   **Robustness:** Reduces overfitting seen in Decision Trees.
*   **Accuracy:** Performs well on datasets with complex relationships and interactions.
*   **Feature Selection:** Identifies key features automatically.
*   **Limitations:** Computationally expensive for large datasets.

**4. Gradient Boosting Regressor:**
Gradient Boosting builds models sequentially, where each new model focuses on correcting the errors of the previous ones. It minimizes a loss function (e.g., MSE) using gradient descent.

**Why it is suitable:**

*   **High Accuracy:** Works well for complex datasets with non-linear relationships.
*   **Customizable:** Can optimize for specific loss functions.
*   **Limitations:** Sensitive to hyperparameters and may overfit if not tuned.

**5. Support Vector Regressor (SVR):**
SVR uses a margin-based approach to predict the target values, aiming to fit the data within a "tube" around the true values. It uses kernels (e.g., linear, polynomial, RBF) to capture non-linear relationships.

**Why it is suitable:**

*   **Non-linear Relationships:** Works well with kernels like RBF to handle complex relationships.
*   **Small Datasets:** Effective on smaller datasets where computational cost is manageable.
*   **Limitations:** Computationally expensive for large datasets; sensitive to hyperparameters like C and epsilon.

# **3. Model Evaluation and Comparison**

In [78]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import pandas as pd  # to create a comparison table

In [82]:
# Linear Regression Evaluation:
lr_mse = mean_squared_error(y_test, lr_predictions)
lr_mae = mean_absolute_error(y_test, lr_predictions)
lr_r2 = r2_score(y_test, lr_predictions)

print("Linear Regression Metrics:")
print("MSE:", lr_mse)
print("MAE:", lr_mae)
print("R²:", lr_r2)

Linear Regression Metrics:
MSE: 0.5558915986952442
MAE: 0.5332001304956565
R²: 0.575787706032451


In [84]:
# Decision Tree Regressor Evaluation:
dt_mse = mean_squared_error(y_test, dt_predictions)
dt_mae = mean_absolute_error(y_test, dt_predictions)
dt_r2 = r2_score(y_test, dt_predictions)

print("Decision Tree Regressor Metrics:")
print("MSE:", dt_mse)
print("MAE:", dt_mae)
print("R²:", dt_r2)

Decision Tree Regressor Metrics:
MSE: 0.4942716777366763
MAE: 0.4537843265503876
R²: 0.6228111330554302


In [86]:
# Random Forest Regressor Evaluation:
rf_mse = mean_squared_error(y_test, rf_predictions)
rf_mae = mean_absolute_error(y_test, rf_predictions)
rf_r2 = r2_score(y_test, rf_predictions)

print("Random Forest Regressor Metrics:")
print("MSE:", rf_mse)
print("MAE:", rf_mae)
print("R²:", rf_r2)

Random Forest Regressor Metrics:
MSE: 0.25549776668540763
MAE: 0.32761306601259704
R²: 0.805024407701793


In [88]:
# Gradient Boosting Regressor Evaluation:
gb_mse = mean_squared_error(y_test, gb_predictions)
gb_mae = mean_absolute_error(y_test, gb_predictions)
gb_r2 = r2_score(y_test, gb_predictions)

print("Gradient Boosting Regressor Metrics:")
print("MSE:", gb_mse)
print("MAE:", gb_mae)
print("R²:", gb_r2)

Gradient Boosting Regressor Metrics:
MSE: 0.29399901242474274
MAE: 0.37165044848436773
R²: 0.7756433164710084


In [90]:
# Support Vector Regressor Evaluation:
svr_mse = mean_squared_error(y_test, svr_predictions)
svr_mae = mean_absolute_error(y_test, svr_predictions)
svr_r2 = r2_score(y_test, svr_predictions)

print("Support Vector Regressor Metrics:")
print("MSE:", svr_mse)
print("MAE:", svr_mae)
print("R²:", svr_r2)

Support Vector Regressor Metrics:
MSE: 0.3551984619989419
MAE: 0.3977630963437859
R²: 0.7289407597956462


In [92]:
# to create a DataFrame to compare results:
results = {
    "Model": ["Linear Regression", "Decision Tree", "Random Forest", "Gradient Boosting", "SVR"],
    "MSE": [lr_mse, dt_mse, rf_mse, gb_mse, svr_mse],
    "MAE": [lr_mae, dt_mae, rf_mae, gb_mae, svr_mae],
    "R²": [lr_r2, dt_r2, rf_r2, gb_r2, svr_r2],
}

comparison_df = pd.DataFrame(results)
print(comparison_df)

               Model       MSE       MAE        R²
0  Linear Regression  0.555892  0.533200  0.575788
1      Decision Tree  0.494272  0.453784  0.622811
2      Random Forest  0.255498  0.327613  0.805024
3  Gradient Boosting  0.293999  0.371650  0.775643
4                SVR  0.355198  0.397763  0.728941


## **Analysis:**
**Best-Performing Algorithm:**

**Model:** Random Forest Regressor 

**Justification:**
*   It has the lowest MSE (0.255608), indicating the smallest average squared error.
*   It also has the lowest MAE (0.327683), indicating the smallest average absolute error.
*   It achieved the highest R² score (0.804885), meaning it explains approximately 80.49% of the variance in the target variable.
*   Random Forest works well because it combines the predictions of multiple decision trees to improve accuracy and reduce overfitting, making it a robust choice for this dataset.

**Worst-Performing Algorithm:**

**Model:** Linear Regression

**Reasoning:**

*   It has the highest MSE (0.555892) and highest MAE (0.533200), indicating the largest errors in predictions.
*   It has the lowest R² score (0.575788), meaning it explains only 57.58% of the variance in the target variable.
*   Linear Regression assumes a linear relationship between features and the target variable, which may not capture the complexity and nonlinearity in the fetch_california_housing dataset.

**Conclusion:**

**Best Algorithm: Random Forest Regressor** (justification: superior metrics, robust to overfitting, handles nonlinearity well).

**Worst Algorithm: Linear Regression** (reasoning: poor performance due to its linearity assumption, which is not well-suited for complex datasets).