# XGBoost Hyperparameter-Tuning with Ray Tune

This notebook demonstrates how a XGBoost Classifier (using the Sklearn-API) can be tuned with **Ray Tune**.

We will use the Optuna Search Algorithm and the ASHA Scheduler for aggressive early stopping of bad trials.

* [Data Loading and Preprocessing](#loading-preprocessing)
* [Model Training and Hyperparameter-Optimization](#training-optim)
    - [Step 1: Define the parameter space](#parameter-space)
    - [Step 2: Define the objective function](#objective)
    - [Step 3: Define Search Algorithm and Scheduler](#search-scheduler)
    - [Step 4: Define the Tuner object and run the optimization](#tune)
    - [Step 5: Evaluate the results](#evaluate)

In [None]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.integration.xgboost import TuneReportCallback
from ray.tune.search.optuna import OptunaSearch
import matplotlib.pyplot as plt

<a id="loading-preprocessing"></a>
# Data Loading and Preprocessing

In [None]:
train = pd.read_csv("/kaggle/input/playground-series-s3e5/train.csv", index_col=0)
test = pd.read_csv("/kaggle/input/playground-series-s3e5/test.csv", index_col=0)
sample_submission = pd.read_csv(
    "/kaggle/input/playground-series-s3e5/sample_submission.csv", index_col=0
)

In [None]:
# Original dataset
original = pd.read_csv("/kaggle/input/wine-quality-dataset/WineQT.csv", index_col="Id")
original_red = pd.read_csv(
    "/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv"
)
train = pd.concat([train, original, original_red], axis=0).reset_index(drop=True)
train = train[~train.duplicated()]
train.info()

In [None]:
target = "quality"

X = train.drop(columns=target)
y = train.loc[:, target]
y -= 3

X_test = test.copy()

# apply standard scaling
scaler = StandardScaler().fit(X)
X = scaler.transform(X)
X_test = scaler.transform(X_test)

y = y.astype("long")
X = X.astype("float32")
X_test = X_test.astype("float32")
print("Train samples: ", X.shape[0])
print("Test samples: ", X_test.shape[0])

<a id="training-optim"></a>
# Model Training and Hyperparameter Optimization

We will now train a `XGBClassifier` and tune its hyperparameters with ray tune.


1. Define the Parameter Space
2. Define the objective
2. Define Search Algorithm and Scheduler
4. Define a Tuner Object
5. Evaluate Results

<a id="parameter-space"></a>
## Step 1: Define Parameter Space

We define the parameter space using the functions provided by tune:

For example:
- A random integer in a given interval (discrete uniform distribution) can be specified with `tune.randint(low, high)`
- A random float in a given interval (continuous uniform distribution) can be specified with `tune.uniform(low, high)`

For more information, see here: https://docs.ray.io/en/latest/tune/api/search_space.html

In [None]:
param_space = {
    "objective": "multi:softprob",
    "tree_method": "hist",
    "early_stopping_rounds": 20,
    "eval_metric": "mlogloss",  # mlogloss is the multi-class negative log-likelihood
    "n_estimators": tune.randint(200, 600),
    "gamma": tune.randint(1, 5),
    "max_depth": tune.randint(2, 9),
    "min_child_weight": tune.randint(1, 5),
    "subsample": tune.uniform(0.5, 1.0),
    "eta": tune.loguniform(1e-4, 1e-1),
    "colsample_bytree": tune.uniform(0.5, 1),
}

<a id="objective"></a>
## Step 2: Define the objective function

This function trains a single classifier and takes in a `config`.
It also reports the metrics to the Tuner.

In [None]:
def objective(config):
    """Objective to be optimized.

    Uses a simple 0.8/0.2 train-validation-split and logs the validation logloss using the `TuneReportCallback`.

    Parameters
    ----------
    config: dict
        The config object.
    """
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, stratify=y, shuffle=True, test_size=0.2
    )
    trc = TuneReportCallback({"loss": "validation_0-mlogloss"})
    clf = XGBClassifier(**config, callbacks=[trc]).fit(
        X_train, y_train, eval_set=[(X_val, y_val)], verbose=False
    )

<a id="search-scheduler"></a>
## Step 3: Search Algorithm and Scheduler

We will use Optuna's Search Algorithm, combined with the ASHA Scheduler.

Note that with Ray Tune, it is really easy to switch out both of them. Ray Tune supports many more search algorithms (see https://docs.ray.io/en/latest/tune/api/suggestion.html)

In [None]:
scheduler = ASHAScheduler(grace_period=10, reduction_factor=3)

search_alg = OptunaSearch()

<a id="tune"></a>
## Step 4: Define the Tuner object and run the optimization

We specify the objective, the parameter space and addionitional parameters via the `tune.TuneConfig`.

We can specify how many samples the tuning should use. Here we also specify the scheduler and the search algorithm.

In this example, we use `num_samples=500`, e.g. 500 trials will be executed. This takes approx. 300s (or 5min)

In [None]:
tuner = tune.Tuner(
    objective,
    param_space=param_space,
    tune_config=tune.TuneConfig(
        num_samples=500,
        metric="loss",
        mode="min",
        scheduler=scheduler,
        search_alg=search_alg,
    ),
)
results = tuner.fit()

<a id="evaluate"></a>
## Step 5: Evaluate the results of the optimization

In [None]:
fig, ax = plt.subplots(figsize=(15, 10))
for result in results:
    result.metrics_dataframe.plot("training_iteration", "loss", ax=ax, legend=None)

Notice how a lot of bad performing trials get stopped early on by the ASHA Scheduler. This allows for efficient search on a lot of samples.

In [None]:
best_params = results.get_best_result("loss", mode="min").config
best_params

### Finally: Train the classifier on the full data set with best parameter config

In [None]:
clf = XGBClassifier(**best_params).fit(X, y, eval_set=[(X, y)], verbose=50)

## Submit Predictions

In [None]:
sub = sample_submission.copy()
sub[target] = clf.predict(X_test)
sub += 3
sub.to_csv("submission.csv")
sub.head()