[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nepal-College-of-Information-Technology/AI-Data-Science-Workshop-2024/blob/main/Day%2011%3A%20Model%20Evaluation%20and%20Hyperparameter%20Tuning/Part2_Hyperparameter_Tuning.ipynb)


# Part 2: Hyperparameter Tuning

In this notebook, we will learn how to improve model performance by tuning the **hyperparameters**. Specifically, we will perform **grid search** to find the best hyperparameters for a **RandomForestClassifier**.

---

## Step 1: Import Libraries and Load Dataset

We will use the **breast cancer dataset** from `sklearn.datasets` as the dataset for this hyperparameter tuning.

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer

# Load the breast cancer dataset
data = load_breast_cancer()
X = data['data']
y = data['target']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

---

## Step 2: Define the Hyperparameter Grid

In **grid search**, we define a set of hyperparameters that we want to tune. The grid search algorithm will then train the model on each combination of these hyperparameters and select the best one based on model performance.

In this case, we will tune the following hyperparameters for the **RandomForestClassifier**:

1. **n_estimators**: The number of trees in the forest.
2. **max_depth**: The maximum depth of the tree.
3. **min_samples_split**: The minimum number of samples required to split an internal node.
4. **min_samples_leaf**: The minimum number of samples required to be at a leaf node.

In [2]:
# Define the hyperparameter grid
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize the RandomForestClassifier
model = RandomForestClassifier(random_state=42)

---

## Step 3: Perform Grid Search for Hyperparameter Tuning

We will use **GridSearchCV** to perform grid search with 5-fold cross-validation. This will help us find the best combination of hyperparameters for the model.

In [3]:
# Perform grid search with cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

# Print the best parameters found by GridSearchCV
print(f"Best Hyperparameters: {grid_search.best_params_}")

Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Hyperparameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}


---

## Step 4: Evaluate the Optimized Model

Once we have the best hyperparameters from the grid search, we will evaluate the performance of the optimized model on the test set.

In [4]:
# Get the best model from grid search
best_model = grid_search.best_estimator_

# Evaluate the model on the test set
test_accuracy = best_model.score(X_test, y_test)
print(f"Test Accuracy with Optimized Hyperparameters: {test_accuracy:.4f}")

# Get predictions and evaluate other metrics (precision, recall, f1-score)
from sklearn.metrics import classification_report
y_pred_optimized = best_model.predict(X_test)
print("Classification Report for Optimized Model:")
print(classification_report(y_test, y_pred_optimized))

Test Accuracy with Optimized Hyperparameters: 0.9649
Classification Report for Optimized Model:
              precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



---

### Conclusion

In this notebook, we performed **hyperparameter tuning** using **GridSearchCV** to optimize a **RandomForestClassifier**. We found the best combination of hyperparameters that improved the model's performance on the test set. 

This approach ensures that we get the most out of the model without overfitting or underfitting by tuning the hyperparameters systematically.