In [1]:
# Regression Assignment 

# Objective: 
# The objective of this assignment is to evaluate your understanding of regression techniques in supervised learning by applying them to a real-world dataset.

# Dataset: 
# Use the California Housing dataset available in the sklearn library. This dataset contains information about various features of houses in California and their respective median prices. 

# Key Components to be Fulfilled: 

# 1. Loading and Preprocessing (2 marks): 
#     Load the California Housing dataset using the fetch_california_housing function from sklearn. 
#     Convert the dataset into a pandas DataFrame for easier handling. 
#     Handle missing values (if any) and perform necessary feature scaling (e.g., standardization). 
#     Explain the preprocessing steps you performed and justify why they are necessary for this dataset.

# 2. Regression Algorithm Implementation (5 marks): 
# Implement the following regression algorithms: 
#     a.Linear Regression 
#     b.Decision Tree Regressor 
#     c.Random Forest Regressor 
#     d.Gradient Boosting Regressor 
#     e.Support Vector Regressor (SVR) 

# For each algorithm: 
#     Provide a brief explanation of how it works. 
#     Explain why it might be suitable for this dataset.
    
# 3. Model Evaluation and Comparison (2 marks): 
# Evaluate the performance of each algorithm using the following metrics: 
#     a.Mean Squared Error (MSE) 
#     b.Mean Absolute Error (MAE) 
#     c.R-squared Score (R²) 

# Compare the results of all models and identify: 
#     The best-performing algorithm with justification. 
#     The worst-performing algorithm with reasoning.
    


# 1.Loading and Preprocessing

In [2]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor

In [3]:
# Load the California Housing dataset using the fetch_california_housing function from sklearn. 

In [4]:
california = fetch_california_housing()

In [5]:
# Convert the dataset into a pandas DataFrame for easier handling.
df = pd.DataFrame(california.data, columns=california.feature_names)
df['Target'] = california.target  # Adding the target column (Median House Value)
print(df.head())

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  Target  
0    -122.23   4.526  
1    -122.22   3.585  
2    -122.24   3.521  
3    -122.25   3.413  
4    -122.25   3.422  


In [6]:
# Find missing values
df.isnull().sum()

MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
Target        0
dtype: int64

In [7]:
# Rows and columns distribution
df.shape

(20640, 9)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
 8   Target      20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


In [9]:
df.columns

Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude', 'Target'],
      dtype='object')

In [10]:
# Find duplicates
df.duplicated().sum()

0

In [11]:
# Checking for Skewness in Features
print("\nSkewness of dataset features:")
print(df.skew())



Skewness of dataset features:
MedInc         1.646657
HouseAge       0.060331
AveRooms      20.697869
AveBedrms     31.316956
Population     4.935858
AveOccup      97.639561
Latitude       0.465953
Longitude     -0.297801
Target         0.977763
dtype: float64


OBSERVATION:
   1. Highly Positively Skewed Features (Extreme Right-Skewness):
    AveRooms, AveBedrms, AveOccup ,a few outliers have significantly higher values.
   2. Moderately Positively Skewed Features (Slight Right-Skewness):
    MedInc, Population, HouseAge , meaning most values are low, but there are some high-income areas, older properties, and densely populated regions.
   3. Slight Left-Skewness:
    Longitude (-0.30) → A small negative skew means most longitude values are concentrated in a certain range, with fewer data points towards lower values.
   4. Target Variable (Target) - Slight Right Skew,eaning most houses are in the lower price range, with fewer expensive properties.


In [12]:
# Apply log transformation to highly skewed features
skewed= ['AveRooms', 'AveBedrms', 'AveOccup', 'Population', 'MedInc']  # Example selection
df[skewed] = df[skewed].apply(lambda x: np.log1p(x))  # log(1 + x) transformation
print(df.head())


     MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  2.232720      41.0  2.077455   0.704982    5.777652  1.268511     37.88   
1  2.230165      21.0  1.979364   0.678988    7.784057  1.134572     37.86   
2  2.111110      52.0  2.228738   0.729212    6.208590  1.335596     37.85   
3  1.893579      52.0  1.919471   0.729025    6.326149  1.266369     37.85   
4  1.578195      52.0  1.985385   0.732888    6.338594  1.157342     37.85   

   Longitude  Target  
0    -122.23   4.526  
1    -122.22   3.585  
2    -122.24   3.521  
3    -122.25   3.413  
4    -122.25   3.422  


In [13]:
df.skew()

MedInc        0.226083
HouseAge      0.060331
AveRooms      1.390761
AveBedrms     8.988786
Population   -1.044087
AveOccup      3.879679
Latitude      0.465953
Longitude    -0.297801
Target        0.977763
dtype: float64

In [14]:
skewed2= ['AveRooms', 'AveBedrms', 'AveOccup']  # Example selection
df[skewed2] = df[skewed2].apply(lambda x: np.log1p(x))  # log(1 + x) transformation
print(df.head())

     MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  2.232720      41.0  1.124103   0.533554    5.777652  0.819124     37.88   
1  2.230165      21.0  1.091710   0.518191    7.784057  0.758266     37.86   
2  2.111110      52.0  1.172091   0.547666    6.208590  0.848267     37.85   
3  1.893579      52.0  1.071402   0.547558    6.326149  0.818179     37.85   
4  1.578195      52.0  1.093729   0.549789    6.338594  0.768877     37.85   

   Longitude  Target  
0    -122.23   4.526  
1    -122.22   3.585  
2    -122.24   3.521  
3    -122.25   3.413  
4    -122.25   3.422  


In [15]:
df.skew()

MedInc        0.226083
HouseAge      0.060331
AveRooms      0.335100
AveBedrms     6.436582
Population   -1.044087
AveOccup      0.960154
Latitude      0.465953
Longitude    -0.297801
Target        0.977763
dtype: float64

In [16]:
# Splitting features and target
#  define the features (x) and target (y)
X= df[['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Longitude']]
y=df['Target']


In [18]:
# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("\nTraining Data (Features):")
print(x_train)
print("\nTesting Data (Features):")
print(x_test)

#This step is crucial to evaluate the model’s performance on unseen data and prevent overfitting.


Training Data (Features):
         MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  \
14196  1.449175      33.0  1.027724   0.528480    7.741099  0.934453   
8267   1.571217      49.0  0.993225   0.538504    7.181592  0.696772   
17445  1.640219       4.0  1.062636   0.522168    6.820016  0.839231   
14265  1.079260      36.0  0.959351   0.536439    7.257708  0.958703   
2271   1.516050      43.0  1.093110   0.564232    6.774224  0.785691   
...         ...       ...       ...        ...         ...       ...   
11284  1.997418      35.0  1.086599   0.504154    6.490724  0.873102   
11964  1.398717      33.0  1.119356   0.598591    7.469654  0.951696   
5390   1.369758      36.0  0.958115   0.549405    7.471363  0.902616   
860    1.904969      15.0  1.098896   0.546138    7.483244  0.887910   
15795  1.274105      52.0  0.909141   0.543551    7.870930  0.758093   

       Longitude  
14196    -117.03  
8267     -118.16  
17445    -120.48  
14265    -117.11  
2271     -119

In [20]:
x_train
x_train.shape

(16512, 7)

In [21]:
y_train

14196    1.030
8267     3.821
17445    1.726
14265    0.934
2271     0.965
         ...  
11284    2.292
11964    0.978
5390     2.221
860      2.835
15795    3.250
Name: Target, Length: 16512, dtype: float64

In [23]:
y_train.shape

(16512,)

In [24]:
x_test

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Longitude
20046,0.986264,25.0,0.973486,0.533112,7.239215,0.949578,-119.01
3024,1.261666,30.0,1.029013,0.579696,7.356280,0.834150,-119.46
15663,1.499645,52.0,0.957378,0.577746,7.178545,0.619932,-122.44
20484,1.907704,17.0,1.088231,0.532507,7.441907,0.912947,-118.72
9814,1.552868,34.0,1.054564,0.534778,6.969791,0.810076,-121.93
...,...,...,...,...,...,...,...
15362,1.723659,16.0,1.124838,0.545687,7.209340,0.868581,-117.22
16623,1.315496,28.0,1.086735,0.595490,7.409136,0.791225,-120.83
18086,2.325305,25.0,1.134211,0.510656,7.368970,0.846939,-122.05
2144,1.331046,36.0,1.043384,0.521571,7.113142,0.823194,-119.76


In [25]:
y_test

20046    0.47700
3024     0.45800
15663    5.00001
20484    2.18600
9814     2.78000
          ...   
15362    2.63300
16623    2.66800
18086    5.00001
2144     0.72300
3665     1.51500
Name: Target, Length: 4128, dtype: float64

In [26]:
# Display the first few rows of the scaled DataFrame
print(df.head())

     MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  2.232720      41.0  1.124103   0.533554    5.777652  0.819124     37.88   
1  2.230165      21.0  1.091710   0.518191    7.784057  0.758266     37.86   
2  2.111110      52.0  1.172091   0.547666    6.208590  0.848267     37.85   
3  1.893579      52.0  1.071402   0.547558    6.326149  0.818179     37.85   
4  1.578195      52.0  1.093729   0.549789    6.338594  0.768877     37.85   

   Longitude  Target  
0    -122.23   4.526  
1    -122.22   3.585  
2    -122.24   3.521  
3    -122.25   3.413  
4    -122.25   3.422  


In [None]:
# Feature Scaling

In [28]:
#Standardization ensures all features have a mean of 0 and a standard deviation of 1, making them comparable.
# Scaling using StandardScaler
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(x_train)
x_train_scaled = scaler.transform(x_train)

In [29]:
x_train_scaled

array([[-0.19385452,  0.34849025, -0.11718269, ...,  0.97210868,
         1.00261049,  1.27258656],
       [ 0.14566783,  1.61811813, -0.54929563, ...,  0.21062259,
        -1.86974584,  0.70916212],
       [ 0.3376347 , -1.95271028,  0.32009965, ..., -0.28147975,
        -0.14813056, -0.44760309],
       ...,
       [-0.41479424,  0.58654547, -0.98905098, ...,  0.60499949,
         0.61786455,  0.59946887],
       [ 1.07417538, -1.07984112,  0.77426108, ...,  0.62116991,
         0.44014876, -1.18553953],
       [-0.6809043 ,  1.85617335, -1.60245664, ...,  1.14880727,
        -1.12867887, -1.41489815]])

In [30]:
print(len(x_train_scaled[0]))
x_train_scaled.shape
x_train.shape

7


(16512, 7)

In [31]:
from sklearn.preprocessing import StandardScaler

y_train_df = pd.DataFrame(y_train) #converting to data frame from series
scaler = StandardScaler()
scaler.fit(y_train_df)
y_train_scaled = scaler.transform(y_train_df)

In [32]:
y_train_scaled

array([[-0.90118909],
       [ 1.5127714 ],
       [-0.29921255],
       ...,
       [ 0.12891731],
       [ 0.65997132],
       [ 1.01890847]])

In [33]:
y_test

20046    0.47700
3024     0.45800
15663    5.00001
20484    2.18600
9814     2.78000
          ...   
15362    2.63300
16623    2.66800
18086    5.00001
2144     0.72300
3665     1.51500
Name: Target, Length: 4128, dtype: float64

OBSERVATION:

* Preprocessing Steps

1.Load Dataset
California Housing dataset is loaded and converted into a Pandas DataFrame.

2.Check for Missing Values:
Verified that no missing values exist.

3.Check for Skewness:
Identified features with skewness greater than 1 (highly skewed).

4.Apply Log Transformation:
Used log(1 + x) transformation to reduce skewness for better model performance.

5.Split Data into Features and Target:
Separated predictors (X) and target variable (y).

6.Train-Test Split:
80% training, 20% testing to evaluate model performance.

7.Feature Scaling (Standardization)


# 2. Regression Algorithm Implementation 


## a. Linear Regression 

Linear regression models the relationship between a dependent variable and one or more independent variables by fitting a straight line.

In [32]:
# Model training using Linear regression

In [69]:
print("Linear Regression")
model = LinearRegression()
model.fit(x_train_scaled, y_train)
y_pred = model.predict(x_test_scaled)
mse_lr = mean_squared_error(y_test, y_pred)
r2_lr = r2_score(y_test, y_pred)
print("Mean Squared Error (MSE):", mse_lr)
print("R^2 Score:", r2_lr)



Linear Regression
Mean Squared Error (MSE): 0.5718557153138697
R^2 Score: 0.5636051608242703


 A higher MSE means larger errors (i.e.,model's predictions are far from actual values).
 A negative R² means the model is worse than a simple mean-based prediction.

## b. Decision Tree Regressor

A decision tree splits data into branches based on feature values to create a model that predicts the target variable

In [70]:
print("Decision Tree Regressor")
decision_tree_model = DecisionTreeRegressor()
decision_tree_model.fit(x_train_scaled, y_train)
y_pred_tree = decision_tree_model.predict(x_test_scaled)
mse_dt = mean_squared_error(y_test, y_pred_tree)
r2_dt = r2_score(y_test, y_pred_tree)
print("Mean Squared Error (MSE):", mse_dt)
print("R^2 Score:", r2_dt)


Decision Tree Regressor
Mean Squared Error (MSE): 0.7629320738954458
R^2 Score: 0.41779086791696884


Lower MSE is better, so we need improvement. R² also needs improvement

## c. Random Forest Regressor

An ensemble method that combines multiple decision trees to improve accuracy and reduce overfitting

In [72]:
print("Random Forest Regressor")
random_forest_model = RandomForestRegressor(n_estimators=100, random_state=42)
random_forest_model.fit(x_train_scaled, y_train)
y_pred_rf = random_forest_model.predict(x_test_scaled)
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)
print("Mean Squared Error (MSE):", mse_rf)
print("R^2 Score:", r2_rf)


Random Forest Regressor
Mean Squared Error (MSE): 0.3748693407113
R^2 Score: 0.7139295083169175


Random Forest Model significantly improved the performance compared to the Decision Tree and Linear Regression.
Lower MSE and higher R² indicate better performance compared to a single decision tree. However, further tuning might improve results.

## d.GradientBoostingRegressor

A boosting algorithm that builds sequential trees, where each tree corrects the errors of the previous one

In [57]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor


In [77]:
# d. Gradient Boosting Regressor
print("Gradient Boosting Regressor")
gb_model = GradientBoostingRegressor(n_estimators=100, random_state=42)
gb_model.fit(x_train_scaled, y_train)
y_pred_gb = gb_model.predict(x_test_scaled)

# Compute performance metrics
mse_gb = mean_squared_error(y_test, y_pred_gb)
r2_gb = r2_score(y_test, y_pred_gb)
mae_gb = mean_absolute_error(y_test, y_pred_gb)

print("Mean Squared Error (MSE):", mse_gb)
print("R^2 Score:", r2_gb)



Gradient Boosting Regressor
Mean Squared Error (MSE): 0.3875166432044995
R^2 Score: 0.7042780920772506


Lower MSE compared to other models indicates better predictive power. A higher R² suggests effective learning, but tuning is essential to prevent overfitting

## e. Support Vector Regressor (SVR)

Uses support vectors to fit a hyperplane that maximizes the margin while minimizing prediction error

In [62]:
from sklearn.svm import SVR


In [79]:
# e. Support Vector Regressor (SVR)
print("Support Vector Regressor (SVR)")
svr_model = SVR()
svr_model.fit(x_train_scaled, y_train)
y_pred_svr = svr_model.predict(x_test_scaled)

# Compute performance metrics
mse_svr = mean_squared_error(y_test, y_pred_svr)
r2_svr = r2_score(y_test, y_pred_svr)
mae_svr = mean_absolute_error(y_test, y_pred_svr)

print("Mean Squared Error (MSE):", mse_svr)
print("R^2 Score:", r2_svr)



Support Vector Regressor (SVR)
Mean Squared Error (MSE): 0.373989248318908
R^2 Score: 0.7146011248938837


SVR can be sensitive to hyperparameters. If MSE is high and R² is low, kernel tuning may be required for better performance

# 3.Model Evaluation and Comparison:
Evaluate the performance of each algorithm using the following metrics:
Mean Squared Error (MSE)
Mean Absolute Error (MAE)
R-squared Score (R²)
Compare the results of all models and identify:
The best-performing algorithm with justification.
The worst-performing algorithm with reasoning.

In [80]:
from sklearn.metrics import mean_absolute_error

# Compute Mean Absolute Error (MAE) for all models
mae_lr = mean_absolute_error(y_test, y_pred)
mae_dt = mean_absolute_error(y_test, y_pred_tree)
mae_rf = mean_absolute_error(y_test, y_pred_rf)
mae_gb = mean_absolute_error(y_test, y_pred_gb)
mae_svr = mean_absolute_error(y_test, y_pred_svr)

# Store results in a dictionary for comparison
model_results = {
    "Linear Regression": {"MSE": mse_lr, "MAE": mae_lr, "R2": r2_lr},
    "Decision Tree Regressor": {"MSE": mse_dt, "MAE": mae_dt, "R2": r2_dt},
    "Random Forest Regressor": {"MSE": mse_rf, "MAE": mae_rf, "R2": r2_rf},
    "Gradient Boosting Regressor": {"MSE": mse_gb, "MAE": mae_gb, "R2": r2_gb},
    "Support Vector Regressor": {"MSE": mse_svr, "MAE": mae_svr, "R2": r2_svr},
}

# Identify the best and worst models based on MSE
best_model = min(model_results, key=lambda x: model_results[x]["MSE"])
worst_model = max(model_results, key=lambda x: model_results[x]["MSE"])

# Print comparison results
print("\nModel Evaluation and Comparison")
for model, scores in model_results.items():
    print(f"{model}: MSE = {scores['MSE']:.4f}, MAE = {scores['MAE']:.4f}, R² = {scores['R2']:.4f}")

print(f"\nBest Performing Model: {best_model} - Low MSE and High R² indicate better predictive performance.")
print(f"Worst Performing Model: {worst_model} - High MSE and Low R² suggest poor model generalization.")



Model Evaluation and Comparison
Linear Regression: MSE = 0.5719, MAE = 0.5670, R² = 0.5636
Decision Tree Regressor: MSE = 0.7629, MAE = 0.5890, R² = 0.4178
Random Forest Regressor: MSE = 0.3749, MAE = 0.4282, R² = 0.7139
Gradient Boosting Regressor: MSE = 0.3875, MAE = 0.4476, R² = 0.7043
Support Vector Regressor: MSE = 0.3740, MAE = 0.4247, R² = 0.7146

Best Performing Model: Support Vector Regressor - Low MSE and High R² indicate better predictive performance.
Worst Performing Model: Decision Tree Regressor - High MSE and Low R² suggest poor model generalization.


## Observations from Model Evaluation and Comparison
 #### * Support Vector Regressor (SVR) is the best-performing model

It has the lowest MSE (0.3740) and highest R² (0.7146), indicating that it makes the most accurate predictions.
It also has the lowest MAE (0.4247), meaning it minimizes the average absolute prediction error.

 #### *Random Forest Regressor performs similarly well

MSE (0.3749) and R² (0.7139) are very close to SVR, suggesting strong generalization and accuracy.
Slightly higher MAE (0.4282) compared to SVR, but still a good model choice.

 #### * Gradient Boosting Regressor is slightly behind
 #### * Linear Regression shows moderate performance


 #### * Decision Tree Regressor is the worst-performing model

It has the highest MSE (0.7629) and lowest R² (0.4178), meaning it fails to generalize well.
The high MAE (0.5890) indicates large individual prediction errors.