# <u>Assignment 6</u>:
# WEEK 6: Model Evaluation and Hyperparameter Tuning
   Train multiple machine learning models and evaluate their performance using metrics such as accuracy, precision, recall, and F1-score. Implement hyperparameter tuning techniques like GridSearchCV and RandomizedSearchCV to optimize model parameters. Analyze the results to select the best-performing model.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Part A: <u>Classification</u> 
---
### Step A1: Load and Explore the Classification Dataset
We are using the **Breast Cancer** dataset from sklearn, which is a binary classification task.
The goal is to predict whether a tumor is **malignant (0)** or **benign (1)** based on features like radius, texture, and symmetry.

In [2]:
from sklearn.datasets import load_breast_cancer

In [8]:
data=load_breast_cancer()
cancerB=pd.DataFrame(data.data, columns=data.feature_names)
cancerB['target']=data.target

In [9]:
cancerB.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [10]:
cancerB.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [11]:
cancerB['target'].value_counts()

target
1    357
0    212
Name: count, dtype: int64

### NOTE: Target Class Distribution
- 0 = malignant{`count=212`}
- 1 = benign{`count=357`}
---

### Step A2: Preprocess the Classification Data

In [24]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

x=cancerB.drop('target', axis=1)
y=cancerB['target']

x_train, x_test, y_train, y_test =train_test_split(x,y,test_size=0.2, random_state=42)

scaler=StandardScaler()
x_train_scaled= scaler.fit_transform(x_train)
x_test_scaled= scaler.transform(x_test)

### Step A3: Train and Evaluate Classification Models
**We'll train the following models:**
- Logistic Regression
- Decision Tree Classifier
- Random Forest Classifier

**Evaluation Matrics:**
- Accuracy
- Precision
- Recall
- F1-score

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [57]:
models={
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier()
}

result=[]

for name, model in models.items():
    model.fit(x_train_scaled, y_train)
    y_pred=model.predict(x_test_scaled)
    
    #Evaluation
    accuracy =accuracy_score(y_test,y_pred)
    precision=precision_score(y_test,y_pred)
    recall=recall_score(y_test,y_pred)
    f1=f1_score(y_test,y_pred)
    
    result.append({
        "Model":name,
        "Accuracy": round(accuracy*100, 2),
        "Precision": round(precision*100,2),
        "Recall": round(recall*100,2),
        "F1 Score": round(f1*100,2)
        
    })

In [58]:
for res in result:
    for key, value in res.items():
        print(f"{key}: {value}")
    print()


Model: Logistic Regression
Accuracy: 97.37
Precision: 97.22
Recall: 98.59
F1 Score: 97.9

Model: Decision Tree
Accuracy: 92.98
Precision: 94.37
Recall: 94.37
F1 Score: 94.37

Model: Random Forest
Accuracy: 96.49
Precision: 95.89
Recall: 98.59
F1 Score: 97.22



In [61]:
results = pd.DataFrame(result)
results.set_index('Model', inplace=True)
print(results)

                     Accuracy  Precision  Recall  F1 Score
Model                                                     
Logistic Regression     97.37      97.22   98.59     97.90
Decision Tree           92.98      94.37   94.37     94.37
Random Forest           96.49      95.89   98.59     97.22


### Step A4: Hyperparameter Tuning (Classification)
##### We'll tune hyperparameters for
- Decision Tree Classifer
- Random Forest Classifier

##### We will use two techniques to tune our models:
- **GridSearchCV** for `Decision Tree` : which tries all possible combinations of given parameters
- **RandomizedSearchCV** for `Random Forest` : Randomly selects a few combinations from a larger grid
----
### # GridSearchCV
---

In [68]:
#Decision Tree, Using GridSearchCV
from sklearn.model_selection import GridSearchCV

parameters={
    'max_depth':[3,5,7],
    'criterion': ['gini','entropy'],
    'min_samples_split':[2,4,6]
}

In [69]:
gridCV = GridSearchCV(DecisionTreeClassifier(), parameters, cv=5, scoring='accuracy')
gridCV.fit(x_train_scaled, y_train)

In [75]:
print("Best Parameters (Decision Tree):", gridCV.best_params_)
print("Best Accuracy Score:", round(gridCV.best_score_*100, 2))

Best Parameters (Decision Tree): {'criterion': 'entropy', 'max_depth': 5, 'min_samples_split': 6}
Best Accuracy Score: 94.51


---
### # RandomizedSearchCV
---

In [71]:
#Random Forest Classifier using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

parameter={
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'max_features': ['sqrt', 'log2']
}

In [72]:
randomCV=RandomizedSearchCV(RandomForestClassifier(), parameter, n_iter=10, cv=5, scoring='accuracy', random_state=42)
randomCV.fit(x_train_scaled,y_train)

In [76]:
print("Best Parameters (Decision Tree):", randomCV.best_params_)
print("Best Accuracy Score:", round(randomCV.best_score_*100, 2))

Best Parameters (Decision Tree): {'n_estimators': 50, 'min_samples_split': 2, 'max_features': 'sqrt', 'max_depth': 10}
Best Accuracy Score: 96.26


### Step A5: Final Comparison & Best Model Selection
---
### Final Comparison – Classification Models

We trained and evaluated three classification models on the Breast Cancer dataset:

| Model                | Accuracy | Precision | Recall | F1 Score |
|---------------------|----------|-----------|--------|----------|
| Logistic Regression | 97.37%   | 97.22%    | 98.59% | 97.90%   |
| Decision Tree        | 92.98%   | 94.37%    | 94.37% | 94.37%   |
| Random Forest        | 96.49%   | 95.89%    | 98.59% | 97.22%   |

We also performed **hyperparameter tuning** using:
- **GridSearchCV** on Decision Tree:
  - Best Parameters: `{'criterion': 'entropy', 'max_depth': 5, 'min_samples_split': 6}`
  - Best Accuracy (Cross-Validation): **94.51%**
- **RandomizedSearchCV** on Random Forest:
  - Best Parameters: `{'n_estimators': 50, 'min_samples_split': 2, 'max_features': 'sqrt', 'max_depth': 10}`
  - Best Accuracy (Cross-Validation): **96.26%**

### Final Model Selection:
While **Logistic Regression** slightly outperformed others in accuracy and F1-score, the **Random Forest Classifier** showed:
- Very close performance to Logistic Regression
- Higher flexibility and robustness
- Good results both before and after tuning

Therefore, we select **Random Forest Classifier** (tuned) as the **final classification model**, balancing performance and generalization.

