# Hyperparameter Tuning with Optuna

In this assignment, you'll learn how to use **Optuna** for automated hyperparameter optimization of boosting models.

## What is Optuna?
Optuna is an automatic hyperparameter optimization framework that:
- Uses smart search strategies (Tree-structured Parzen Estimator)
- Supports pruning of unpromising trials
- Provides easy visualization of optimization history
- Works seamlessly with any ML library

## Learning Objectives
- Understand how to define search spaces for hyperparameters
- Use Optuna to optimize model performance
- Compare tuned vs untuned models
- Visualize optimization progress

## Setup and Installation

In [None]:
!pip install optuna scikit-learn xgboost lightgbm matplotlib seaborn pandas numpy

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import optuna
from optuna.visualization import plot_optimization_history, plot_param_importances, plot_slice

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, roc_auc_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

optuna.logging.set_verbosity(optuna.logging.WARNING)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

## 1. Load and Prepare Dataset

In [None]:
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

## 2. Baseline Model (No Tuning)

First, let's train a baseline XGBoost model with default parameters to compare against.

In [None]:
baseline_model = XGBClassifier(random_state=42, eval_metric='logloss')
baseline_model.fit(X_train, y_train)
baseline_pred = baseline_model.predict(X_test)
baseline_accuracy = accuracy_score(y_test, baseline_pred)

print(f"Baseline XGBoost Accuracy: {baseline_accuracy:.4f}")

## 3. Define Optuna Objective Function for XGBoost

### Task: Create an objective function for Optuna

**Understanding Optuna:**
- Optuna calls the objective function many times (trials)
- Each trial tests different hyperparameter combinations
- The function should return a metric to optimize (maximize or minimize)

**Key Hyperparameters for XGBoost:**
- `n_estimators`: Number of boosting rounds (50-300)
- `max_depth`: Maximum tree depth (3-10)
- `learning_rate`: Step size shrinkage (0.01-0.3)
- `subsample`: Row sampling ratio (0.6-1.0)
- `colsample_bytree`: Column sampling ratio (0.6-1.0)
- `min_child_weight`: Minimum sum of instance weight (1-10)
- `gamma`: Minimum loss reduction (0-5)

**Hints:**
1. Use `trial.suggest_int()` for integer parameters
2. Use `trial.suggest_float()` for continuous parameters
3. Use `cross_val_score()` with 3-5 folds for robust evaluation
4. Return the mean cross-validation score

**TODO:** Complete the objective function below.

In [None]:
def objective_xgboost(trial):
    """
    Objective function for Optuna to optimize XGBoost hyperparameters.
    
    Args:
        trial: Optuna trial object
    
    Returns:
        float: Mean cross-validation accuracy score
    """
    # TODO: Define hyperparameter search space
    # Example:
    # n_estimators = trial.suggest_int('n_estimators', 50, 300)
    # max_depth = trial.suggest_int('max_depth', 3, 10)
    # learning_rate = trial.suggest_float('learning_rate', 0.01, 0.3)
    
    # TODO: Create XGBoost model with suggested parameters
    # model = XGBClassifier(
    #     n_estimators=n_estimators,
    #     max_depth=max_depth,
    #     learning_rate=learning_rate,
    #     gamma=gamma,
    #     random_state=42,
    #     eval_metric='logloss'
    # )
    
    # TODO: Evaluate using cross-validation
    # scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    # return scores.mean()
    
    pass

## 4. Run Optuna Optimization for XGBoost

### Task: Create and run an Optuna study

**Hints:**
1. Create a study with `optuna.create_study(direction='maximize')` (we want to maximize accuracy)
2. Use `study.optimize()` to run trials
3. Set `n_trials` to 50-100 (more trials = better results but slower)
4. Access best parameters with `study.best_params`
5. Access best score with `study.best_value`

**TODO:** Complete the optimization process.

In [None]:
# TODO: Create Optuna study
# study_xgb = optuna.create_study(direction='maximize')

# TODO: Run optimization
# study_xgb.optimize(objective_xgboost, n_trials=50)

# TODO: Print best results
# print("\n" + "="*50)
# print("XGBoost Optimization Results")
# print("="*50)
# print(f"Best CV Accuracy: {study_xgb.best_value:.4f}")
# print(f"\nBest Parameters:")
# for param, value in study_xgb.best_params.items():
#     print(f"  {param}: {value}")

## 5. Train Final Model with Optimized Parameters

**TODO:** Train a final XGBoost model using the best parameters found by Optuna.

In [None]:
# TODO: Create model with best parameters
# optimized_xgb = XGBClassifier(**study_xgb.best_params, random_state=42, eval_metric='logloss')

# TODO: Train on full training set
# optimized_xgb.fit(X_train, y_train)

# TODO: Evaluate on test set
# optimized_pred = optimized_xgb.predict(X_test)
# optimized_accuracy = accuracy_score(y_test, optimized_pred)

# print(f"\nBaseline Accuracy: {baseline_accuracy:.4f}")
# print(f"Optimized Accuracy: {optimized_accuracy:.4f}")
# print(f"Improvement: {(optimized_accuracy - baseline_accuracy):.4f} ({(optimized_accuracy - baseline_accuracy)*100:.2f}%)")

## 6. Visualize Optimization History

**TODO:** Visualize how the optimization progressed over trials.

**Hints:**
- Use `plot_optimization_history(study)` to see how scores improved
- Use `plot_param_importances(study)` to see which parameters matter most
- Use `plot_slice(study)` to see parameter-score relationships

In [None]:
# TODO: Plot optimization history
# fig = plot_optimization_history(study_xgb)
# fig.show()

# TODO: Plot parameter importances
# fig = plot_param_importances(study_xgb)
# fig.show()

# TODO: Plot parameter relationships
# fig = plot_slice(study_xgb)
# fig.show()

## 7. Optimize LightGBM (Bonus Challenge)

### Task: Create objective function for LightGBM

**Other Parameters:**
- `n_estimators`: Number of boosting rounds (50-300)
- `max_depth`: Maximum tree depth (3-10, -1 for no limit)
- `learning_rate`: Step size (0.01-0.3)
- `num_leaves`: Maximum tree leaves (20-150)
- `subsample`: Row sampling ratio (0.6-1.0)
- `colsample_bytree`: Column sampling ratio (0.6-1.0)
- `min_child_samples`: Minimum data in leaf (5-100)
- `reg_alpha`: L1 regularization (0-5)
- `reg_lambda`: L2 regularization (0-5)

**TODO:** Complete the LightGBM objective function.

In [None]:
def objective_lightgbm(trial):
    """
    Objective function for Optuna to optimize LightGBM hyperparameters.
    """
    # TODO: Define search space for LightGBM
    # TODO: Create LightGBM model
    # TODO: Evaluate with cross-validation
    # TODO: Return mean score
    
    pass

**TODO:** Run optimization for LightGBM.

In [None]:
# TODO: Create study for LightGBM
# TODO: Run optimization
# TODO: Print results
# TODO: Train final model
# TODO: Compare with baseline and XGBoost