# Model Tuning with Grid Search

This notebook focuses on hyperparameter tuning for machine learning models using GridSearchCV from Scikit-learn. The goal is to systematically search for the best combination of hyperparameters to optimize model performance. We will use a sample dataset (or a synthetic one if no dataset is provided) and a Random Forest Classifier as the example model. The process includes data preprocessing, grid search implementation, evaluation of results, and visualization of performance metrics.

## 1. Import Libraries

Let's start by importing the necessary libraries for data handling, model building, tuning, and visualization.

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

## 2. Load and Prepare Data

We will use a synthetic dataset generated by `make_classification` for demonstration purposes. If you have your own dataset, replace this section with your data loading logic (e.g., loading a CSV file).

In [None]:
# Generate a synthetic dataset for classification
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=42)

# Convert to DataFrame for better handling
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
y = pd.Series(y, name='target')

# Display the first few rows of the data
print("First 5 rows of features:")
print(X.head())
print("\nFirst 5 rows of target:")
print(y.head())

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nTraining set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

## 3. Preprocess the Data

We will scale the features using StandardScaler to ensure that the model performs optimally, especially for algorithms sensitive to feature scales.

In [None]:
# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the training data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for consistency
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

print("Data scaling completed.")

## 4. Baseline Model

Before tuning, let's train a baseline Random Forest model to understand its default performance.

In [None]:
# Initialize the baseline model
baseline_model = RandomForestClassifier(random_state=42)

# Train the model
baseline_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_baseline = baseline_model.predict(X_test_scaled)

# Evaluate the baseline model
print("Baseline Model Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_baseline))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_baseline))

## 5. Hyperparameter Tuning with GridSearchCV

Now, we will use GridSearchCV to search for the best hyperparameters for the Random Forest Classifier. We define a parameter grid to explore various combinations.

In [None]:
# Define the parameter grid for Random Forest
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Initialize the model
rf = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, 
                           cv=5, n_jobs=-1, verbose=2, scoring='accuracy')

# Fit GridSearchCV
grid_search.fit(X_train_scaled, y_train)

# Print the best parameters and score
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Score:", grid_search.best_score_)

## 6. Evaluate the Best Model

Using the best parameters from GridSearchCV, we will evaluate the tuned model on the test set.

In [None]:
# Get the best model
best_model = grid_search.best_estimator_

# Make predictions on the test set
y_pred_tuned = best_model.predict(X_test_scaled)

# Evaluate the tuned model
print("Tuned Model Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_tuned))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_tuned))

# Plot confusion matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred_tuned)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix for Tuned Model')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

## 7. Visualize Hyperparameter Tuning Results

Let's visualize the results of the grid search to understand how different hyperparameters affected the performance.

In [None]:
# Convert grid search results to DataFrame
results = pd.DataFrame(grid_search.cv_results_)

# Plot mean test score vs. n_estimators for different max_depth values
plt.figure(figsize=(10, 6))
for depth in param_grid['max_depth']:
    temp = results[results['param_max_depth'] == depth]
    plt.plot(temp['param_n_estimators'], temp['mean_test_score'], label=f'max_depth={depth}')
plt.xlabel('Number of Estimators')
plt.ylabel('Mean CV Score')
plt.title('Grid Search Scores by Number of Estimators and Max Depth')
plt.legend()
plt.grid(True)
plt.show()

## 8. Save the Best Model

We will save the best model for future use or deployment.

In [None]:
# Save the best model to a file
joblib.dump(best_model, 'best_random_forest_model.pkl')
print("Best model saved as 'best_random_forest_model.pkl'")

## 9. Summary and Next Steps

In this notebook, we performed hyperparameter tuning for a Random Forest Classifier using GridSearchCV. We compared the baseline model performance with the tuned model and visualized the results. Key findings include:
- The best hyperparameters identified through grid search.
- Improvement (if any) in model performance after tuning.

**Next Steps:**
- Experiment with other algorithms (e.g., XGBoost, SVM) or ensemble methods.
- Use RandomizedSearchCV for faster experimentation with larger parameter grids.
- Integrate the tuned model into a production pipeline or further evaluate it with additional metrics.