# **Regression and Evaluation**

### **1. Data Loading and Preprocessing**

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV 
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [2]:
data = fetch_california_housing(as_frame=True)
df = data.frame
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


In [4]:
df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


In [5]:
# Check for missing values
print('Missing values per column:')
df.isnull().sum()

Missing values per column:


MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
MedHouseVal    0
dtype: int64

In [6]:
X = df.drop("MedHouseVal", axis=1)
y = df["MedHouseVal"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
# Correlation 
corrs = X.join(y).corr()['MedHouseVal'].sort_values(ascending=False)
print('\nCorrelation of features with target:')
print(corrs)


Correlation of features with target:
MedHouseVal    1.000000
MedInc         0.688075
AveRooms       0.151948
HouseAge       0.105623
AveOccup      -0.023737
Population    -0.024650
Longitude     -0.045967
AveBedrms     -0.046701
Latitude      -0.144160
Name: MedHouseVal, dtype: float64


**Preprocessing Steps**

Checked missing values – Ensures no empty data affects learning.

Separated inputs and output – Helps the model know what to predict.

Split into train and test data – Tests the model on unseen data.

Scaled feature values – Prevents large values from dominating learning.

Used StandardScaler – Puts all features on the same scale.

Used a pipeline – Avoids data leakage and keeps steps consistent.

No encoding applied – All features are already numeric.

### **2. Regression Algorithm Implementation**

In [8]:
models = {
    "LinearRegression": LinearRegression(),
    "DecisionTree": DecisionTreeRegressor(random_state=42),
    "RandomForest": RandomForestRegressor(random_state=42),
    "GradientBoosting": GradientBoostingRegressor(random_state=42),
    "SVR": SVR()
}

**Descriptions**

LinearRegression: Linear model; fast, interpretable; good baseline if relationships are roughly linear.

DecisionTree: Non-linear, captures interactions; interpretable but prone to overfitting.

RandomForest: Ensemble of trees; reduces overfitting and variance compared to single tree.

GradientBoosting: Sequential ensemble; often high predictive performance on structured data.

SVR: Kernel-based; can capture complex relations but may be slower and require scaling/tuning.

### **3. Model Evaluation and Comparison**

In [9]:
def evaluate_model(name, model):
    pipeline = Pipeline([("scaler", StandardScaler()),("model", model)])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    return mse, mae, r2

In [10]:
results = []
for name, model in models.items():
    mse, mae, r2 = evaluate_model(name, model)
    results.append([name, mse, mae, r2])
results_df = pd.DataFrame(results,columns=["Model", "MSE", "MAE", "R2 Score"])
results_df

Unnamed: 0,Model,MSE,MAE,R2 Score
0,LinearRegression,0.555892,0.5332,0.575788
1,DecisionTree,0.493969,0.453904,0.623042
2,RandomForest,0.25517,0.327425,0.805275
3,GradientBoosting,0.293999,0.37165,0.775643
4,SVR,0.357004,0.398599,0.727563


**Explaination**

Best-performing algorithm: RandomForest Regressor performs best because it captures complex patterns and gives the highest R².

Worst-performing algorithm: Linear Regression performs worst because it cannot model non-linear relationships in housing data.

### **4. Cross-Validation and Hyperparameter Tuning**

In [11]:
for name, model in models.items():
    pipeline = Pipeline([("scaler", StandardScaler()),("model", model)])
    scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring="r2")
    print(f"{name} CV R2 Mean: {scores.mean():.4f}, Std: {scores.std():.4f}")

LinearRegression CV R2 Mean: 0.6115, Std: 0.0065
DecisionTree CV R2 Mean: 0.6069, Std: 0.0245
RandomForest CV R2 Mean: 0.8041, Std: 0.0055
GradientBoosting CV R2 Mean: 0.7866, Std: 0.0032
SVR CV R2 Mean: 0.7373, Std: 0.0050


In [12]:
param_grid = {"model__n_estimators": [100, 200], "model__max_depth": [None, 10, 20]}
pipeline = Pipeline([("scaler", StandardScaler()), ("model", RandomForestRegressor(random_state=42))])
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring="r2", n_jobs=-1)
grid.fit(X_train, y_train)
grid.best_params_

{'model__max_depth': 20, 'model__n_estimators': 200}

**Explaination**

n_estimators: Increasing the number of trees improves prediction stability and reduces error, but increases computation time.

max_depth: Limiting tree depth prevents overfitting and helps the model generalize better to unseen data.

min_samples_split: Setting a higher value avoids overly complex trees and improves model robustness.

### **5. Selecting the Best Regression Model**

In [13]:
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Best Random Forest Performance:")
print("MSE:", mean_squared_error(y_test, y_pred))
print("MAE:", mean_absolute_error(y_test, y_pred))
print("R2:", r2_score(y_test, y_pred))

Best Random Forest Performance:
MSE: 0.2545042828477844
MAE: 0.3271167686567187
R2: 0.8057825556190614


**Insights**

Best Model: Random Forest Regressor, because it gives the highest R² score and lowest errors among all models.

Justification: It performs consistently well in cross-validation, showing strong and stable prediction ability.

Why it outperforms others: It combines many decision trees, which reduces overfitting and captures complex patterns in housing data better than single models.