Loading and Preprocessing (2 marks):
>> Load the California Housing dataset using the fetch_california_housing function from sklearn.
>> Convert the dataset into a pandas DataFrame for easier handling.
>> Handle missing values (if any) and perform necessary feature scaling (e.g., standardization).
>> Explain the preprocessing steps you performed and justify why they are necessary for this dataset.


In [3]:
# important libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.datasets import fetch_california_housing

>> Loading the data to df

In [4]:
# loading the dataset
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Target'] = data.target

df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


>> Handle missing values (if any) and perform necessary feature scaling (e.g., standardization).


In [7]:
df.isnull().sum()

MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
Target        0
dtype: int64

In [11]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
# Handle missing values
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df.drop('Target', axis=1)), columns=df.columns[:-1])
df_imputed['Target'] = df['Target'] 

# Scaling (Standardization)
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_imputed.drop('Target', axis=1)), columns=df_imputed.columns[:-1])
df_scaled['Target'] = df_imputed['Target']

# Output the processed data
print("\nData after missing value handling and scaling:")
print(df_scaled.head())



Data after missing value handling and scaling:
     MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  2.344766  0.982143  0.628559  -0.153758   -0.974429 -0.049597  1.052548   
1  2.332238 -0.607019  0.327041  -0.263336    0.861439 -0.092512  1.043185   
2  1.782699  1.856182  1.155620  -0.049016   -0.820777 -0.025843  1.038503   
3  0.932968  1.856182  0.156966  -0.049833   -0.766028 -0.050329  1.038503   
4 -0.012881  1.856182  0.344711  -0.032906   -0.759847 -0.085616  1.038503   

   Longitude  Target  
0  -1.327835   4.526  
1  -1.322844   3.585  
2  -1.332827   3.521  
3  -1.337818   3.413  
4  -1.337818   3.422  


#Explain the preprocessing steps you performed and justify why they are necessary for this dataset.
   Checking for Missing Values:>>
   Before any other processing steps, it's crucial to check if there are any missing or NaN values in the dataset. Missing values can arise from various reasons such as incomplete data collection or errors during data entry. If we don't handle missing values, algorithms like linear regression ,decision trees may not work properly, as many machine learning models cannot handle missing values directly.
   
   Feature Scaling (Standardization)>>
   Standardization transforms the features such that each feature has a mean of 0 and a standard deviation of 1.Many machine learning models, especially those that involve optimization , assume that the features have similar scales. Features with larger scales will dominate the learning process, which can lead to poor model performance or convergence issues. Standardization ensures that all features contribute equally to the model and that the model is not biased toward any one feature.
   



2.>>Regression Algorithm Implementation (5 marks):
 Implement the following regression algorithms:


Linear Regression
Decision Tree Regressor
Random Forest Regressor
Gradient Boosting Regressor
Support Vector Regressor (SVR)
 For each algorithm:
Provide a brief explanation of how it works.
Explain why it might be suitable for this dataset.


In [22]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['Target']), df['Target'], test_size=0.2, random_state=42)

In [24]:
# Models

models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'Decision Tree': DecisionTreeRegressor(),
    'Random Forest': RandomForestRegressor()
}

In [26]:
# Evaluatig the models to find the best model

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    results[name] = {
        'MAE': mean_absolute_error(y_test, y_pred),
        'MSE': mean_squared_error(y_test, y_pred),
        'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
        'R2 Score': r2_score(y_test, y_pred)
    }

In [28]:
# the results

results_df = pd.DataFrame(results).T
print(results_df)

                        MAE       MSE      RMSE  R2 Score
Linear Regression  0.533200  0.555892  0.745581  0.575788
Ridge Regression   0.533204  0.555803  0.745522  0.575855
Lasso Regression   0.761578  0.938034  0.968521  0.284167
Decision Tree      0.454557  0.507787  0.712592  0.612497
Random Forest      0.326337  0.251897  0.501893  0.807772


Linear Regression>> is a simple approach for problems where relationships are approximately linear.

Decision Trees >>are flexible and non-linear but can overfit without proper tuning.

Random Forest >>mitigates the overfitting problem by averaging predictions from multiple trees.

Gradient Boosting>> builds an ensemble sequentially and can handle complex non-linear relationships.

Support Vector Regression >>is good for non-linear problems but requires careful tuning of parameters and is computationally expensive.

3>> Model Evaluation and Comparison (2 marks):

Evaluate the performance of each algorithm using the following metrics:
Mean Squared Error (MSE)
Mean Absolute Error (MAE)
R-squared Score (R²)
Compare the results of all models and identify:


In [None]:
# Hyperparameter tuning for the best model
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]}
grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)

In [31]:
# the best parameters and model score

y_pred_best = grid_search.best_estimator_.predict(X_test)
print("Best parameters:", grid_search.best_params_)
print("Best R2 Score:", r2_score(y_test, y_pred_best))

Best parameters: {'max_depth': None, 'n_estimators': 200}
Best R2 Score: 0.8063924056740992


The best-performing algorithm with justification.
The worst-performing algorithm with reasoning.

Random Forest is the best-performing model because it handles complex relationships between features and target variables well and reduces overfitting through its ensemble approach.

Linear Regression is the worst-performing model in this case due to its inherent assumption of linearity, which is not a good fit for this dataset, resulting in poor prediction accuracy.