# Introduction

In this notebook we let Optuna optimize a few hyperparameters of an XGBoost model that will be used to make a submission to the [Housing Prices Competition for Kaggle Learn Users](https://www.kaggle.com/competitions/home-data-for-ml-course/data).

# Data Preparation

The following code cell checks that the required inputs are available.

In [None]:
# Load helpful packages
import numpy as np
import pandas as pd
import ames_housing_utils as utils

# List data files
utils.list_input()

We can now load the training set and convert it into the `DMatrix` that is expected by the [XGBoost cross validation function](https://xgboost.readthedocs.io/en/latest/r_docs/R-package/docs/reference/xgb.cv.html). We enable support for categorical data.

In [None]:
import xgboost as xgb

# Load and display training set
X_train, y_train = utils.load_train()
display(pd.concat([X_train, y_train], axis=1))

# Convert training set into DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)

# Optuna Study

First we define the objectives of the optimization:

0. **Mean Absolute Error (MAE)** for accuracy
1. **Standard Error (SE)** for robustness

We'll let Optuna search for the best tree constraints (`max_depth`, `min_child_weight`) and stochasticity (`subsample`, `colsample_bytree`) given a fixed `learning_rate`. We'll rely on early stopping to find the best number of trees.

In [None]:
FIXED_PARAMS = {
        "objective": "reg:absoluteerror",
        "tree_method": "hist",  # required for categorical support
        "learning_rate": 0.05
}

def objective(trial):
    params = {
        **FIXED_PARAMS,
        "max_depth": trial.suggest_int("max_depth", 3, 6),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 20),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0, step=0.01),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0, step=0.01),
    }

    cv_results = xgb.cv(
        params=params,
        dtrain=dtrain,
        num_boost_round=5000,
        nfold=5,
        early_stopping_rounds=50,
        metrics="mae",
        seed=42
    )
    trial.set_user_attr("num_boost_round", cv_results.shape[0])

    last_round = cv_results.iloc[-1]
    return last_round['test-mae-mean'], last_round['test-mae-std'] / (5**0.5)

Now we can perform the actual study. For a warm start, we enqueue the result of previous tuning efforts.

In [None]:
import optuna

study = optuna.create_study(directions=["minimize", "minimize"])
study.enqueue_trial({
    'max_depth': 5,
    'min_child_weight': 18,
    'subsample': 0.64,
    'colsample_bytree': 0.57
})

# Prepare warm start
%time study.optimize(objective, n_trials=50, show_progress_bar=True)

Now we can display a scatter plot with the Pareto front of best trials.

In [None]:
import plotly.io as pio
pio.renderers.default = "iframe"

optuna.visualization.plot_pareto_front(study).show()

Finally, let's select the trial with the lowest $MAE + 2 \times SE$ as the best trial and print the relevant properties.

In [None]:
best_trial = min(study.best_trials, key=lambda t: t.values[0] + (2 * t.values[1]))
print(f'Best trial: {best_trial.number}')
print(f'Best score: {best_trial.values[0]:.2f} +/- {best_trial.values[1]:.2f}')

best_params = best_trial.params
print(f'Best params: {best_params}')

best_iteration = best_trial.user_attrs['num_boost_round']
print(f'Best iteration: {best_iteration}')

# Refit and Predict

Now we can refit the XGBoost model to the full training set using the best parameters found in the study.

In [None]:
model = xgb.train(
    params={**FIXED_PARAMS, **best_params},
    dtrain=dtrain,
    num_boost_round=best_iteration
)

We conclude this notebook by preparing a submission to the competition.

In [None]:
# Define test matrix
X_test, test_ids = utils.load_test()
dtest = xgb.DMatrix(X_test, enable_categorical=True)

# Make and save predictions
preds = model.predict(dtest)
utils.save_preds(test_ids, preds)