# <u>Assignment 6</u>:
# WEEK 6: Model Evaluation and Hyperparameter Tuning
   Train multiple machine learning models and evaluate their performance using metrics such as accuracy, precision, recall, and F1-score. Implement hyperparameter tuning techniques like GridSearchCV and RandomizedSearchCV to optimize model parameters. Analyze the results to select the best-performing model.

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Part B: <u>Regression</u> 
---
### Step B1: Load and Explore the Regression Dataset
We are using the **California Housing** dataset from sklearn. It’s a regression problem where the goal is to predict the **median house value** in a district based on features like average income, number of rooms, location, etc.


In [6]:
from sklearn.datasets import fetch_california_housing

In [7]:
data=fetch_california_housing()
housing=pd.DataFrame(data.data, columns=data.feature_names)
housing['target']=data.target

In [8]:
housing.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [11]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
 8   target      20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


In [12]:
housing.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


### Step B2: Preprocess the Regression Data
 Since regression models like **Linear Regression** are sensitive to feature scales, we apply **StandardScaler** to normalize the data.

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

x=housing.drop('target', axis=1)
y=housing['target']

x_train, x_test, y_train, y_test =train_test_split(x,y,test_size=0.2, random_state=42)

scaler=StandardScaler()
x_train_scaled= scaler.fit_transform(x_train)
x_test_scaled= scaler.transform(x_test)

### Step B3: Train and Evaluate Regression Models
**We'll train the following models:**
- Linear Regression
- Decision Tree Regressor
- Random Forest Regressor

**Evaluation Matrics**
- MAE
- MSE
- RMSE
- R2 score

In [17]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [19]:
models={
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor()
}

result=[]

for name, model in models.items():
    model.fit(x_train_scaled, y_train)
    y_pred=model.predict(x_test_scaled)
    
    #Evaluation
    mae = mean_absolute_error(y_test,y_pred)
    mse= mean_squared_error(y_test,y_pred)
    rmse = (mean_squared_error(y_test,y_pred))**1/2
    r2=r2_score(y_test,y_pred)
    
    result.append({
        "Model":name,
        "MAE": round(mae, 2),
        "MSE": round(mse,2),
        "RMSE": round(rmse,2),
        "R2 Score": round(r2*100,2),
        
    })

In [20]:
for res in result:
    for key, value in res.items():
        print(f"{key}: {value}")
    print()


Model: Linear Regression
MAE: 0.53
MSE: 0.56
RMSE: 0.28
R2 Score: 57.58

Model: Decision Tree
MAE: 0.46
MSE: 0.5
RMSE: 0.25
R2 Score: 61.59

Model: Random Forest
MAE: 0.33
MSE: 0.26
RMSE: 0.13
R2 Score: 80.46



In [21]:
results = pd.DataFrame(result)
results.set_index('Model', inplace=True)
print(results)

                    MAE   MSE  RMSE  R2 Score
Model                                        
Linear Regression  0.53  0.56  0.28     57.58
Decision Tree      0.46  0.50  0.25     61.59
Random Forest      0.33  0.26  0.13     80.46


---
### Step B4: Hyperparameter Tuning (Classification)
##### We'll tune hyperparameters for
- Decision Tree Regressor
- Random Forest Regressor

##### We will use two techniques to tune our models:
- **GridSearchCV** for `Decision Tree` : which tries all possible combinations of given parameters
- **RandomizedSearchCV** for `Random Forest` : Randomly selects a few combinations from a larger grid
----
### # GridSearchCV
---

In [22]:
#Decision Tree, Using GridSearchCV
from sklearn.model_selection import GridSearchCV

parameters={
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'criterion': ['squared_error', 'friedman_mse']
}

In [25]:
gridCV = GridSearchCV(DecisionTreeRegressor(), parameters, cv=5, scoring='r2')
gridCV.fit(x_train_scaled, y_train)

In [26]:
print("Best Parameters (Decision Tree):", gridCV.best_params_)
print("Best Accuracy Score:", round(gridCV.best_score_*100, 2))

Best Parameters (Decision Tree): {'criterion': 'friedman_mse', 'max_depth': 10, 'min_samples_split': 10}
Best Accuracy Score: 69.87


---
### # RandomizedSearchCV
---

In [29]:
#Random Forest Classifier using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV

parameter={
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'max_features': ['sqrt', 'log2']
}

In [31]:
randomCV=RandomizedSearchCV(RandomForestRegressor(), parameter, n_iter=10, cv=3, scoring='r2', random_state=42)
randomCV.fit(x_train_scaled,y_train)

In [32]:
print("Best Parameters (Decision Tree):", randomCV.best_params_)
print("Best Accuracy Score:", round(randomCV.best_score_*100, 2))

Best Parameters (Decision Tree): {'n_estimators': 100, 'min_samples_split': 2, 'max_features': 'log2', 'max_depth': None}
Best Accuracy Score: 81.41


### Step B5: Final Comparison & Best Model Selection
---
### Final Comparison – Regression Models

We trained and evaluated multiple regression models on the **California Housing** dataset:

| Model                    | MAE   | MSE   | RMSE  | R² Score |
|-------------------------|-------|-------|-------|----------|
| Linear Regression       | 0.53  | 0.56  | 0.28  | 57.58%   |
| Decision Tree Regressor | 0.46  | 0.50  | 0.25  | 61.59%   |
| Random Forest Regressor | 0.33  | 0.26  | 0.13  | 80.46%   |

We then tuned the models using hyperparameter optimization:

### Tuned Models:
- **GridSearchCV – Decision Tree**  
  - Best Parameters: `{'criterion': 'friedman_mse', 'max_depth': 10, 'min_samples_split': 10}`  
  - Best R² Score (Cross-Validation): **69.87%**

- **RandomizedSearchCV – Random Forest**  
  - Best Parameters: `{'n_estimators': 100, 'min_samples_split': 2, 'max_features': 'log2', 'max_depth': None}`  
  - Best R² Score (Cross-Validation): **81.41%**

### Final Regression Model Selected:
**Tuned Random Forest Regressor** – It provided the highest accuracy with the best cross-validated R² score and lowest error values, making it the best choice for predicting house prices in this task.
