# Hyperparameter tuning in XGBoost

### Hyperpameters and techniques to tune your model.

Other topics that you will come across in this tutorial include:
- Tuning XGboost hyperparameters
- Using a watchlist and early_stopping_round with XGBoost's native API
- DMatrices (XGBoost data format)
- Bias and variance trade off
- Timing in a Jupyter notebook
- Cross-validation
- Using a baseline model
- Mean Absoluate Error
- Grid Search
- Saving and loading an XGboost model

### The native XGBoost API

Although the scikit-learn API of XGBoost (shown in the previous tutorial) is easy to use and fits well in a scikit-learn pipeline, it is sometimes better to use the native API. Advantages include:
- Automatically find the best number of boosting rounds
- Built-in cross validation
- Custom objective functions

### DMatrices
Instead of numpy arrays or pandas dataFrame, XGBoost uses DMatrices. A DMatrix can contain both the features and the target. If you already have loaded you data into numpy arrays X and y, you can create a DMatrix with:

In [None]:
xgb.DMatrix(X, label=y)

### Data/problem
We will solve a regression problem here, but what you will learn is also applicable to classification. Download the dataset and unzip it.
This dataset is composed of 53 features describing a post on Facebook: the number of likes on the page it was posted, the category of the page, the time and day it was posted, etc.

In [None]:
!!pip install xgboost scikit-learn pandas numpy

In [None]:
import pandas as pd
file = "datasets/facebook_comments/Dataset/Training/Features_Variant_1.csv"
df = pd.read_csv(file, header=None)
df.sample(n=5)

In [None]:
print("Dataset has {} entries and {} features".format(*df.shape))

#### Data Split

In [None]:
# First, let's extract the features and the target from our dataset.
X, y = df.loc[:,:52].values, df.loc[:,53].values

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.1, random_state=42)

### Loading data into DMatrices
As mentioned before, in order to use the native API for XGBoost, we will first need to build DMatrices.

In [None]:
import xgboost as xgb
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

## Building a baseline model

In [None]:
from sklearn.metrics import mean_absolute_error
import numpy as np
# "Learn" the mean from the training data
mean_train = np.mean(y_train)
# Get predictions on the test set
baseline_predictions = np.ones(y_test.shape) * mean_train
# Compute MAE
mae_baseline = mean_absolute_error(y_test, baseline_predictions)
print("Baseline MAE is {:.2f}".format(mae_baseline))

## Training and Tuning an XGBoost model

In [None]:
params = {
    # Parameters that we are going to tune.
    'max_depth':6,
    'min_child_weight': 1,
    'eta':.3,
    'subsample': 1,
    'colsample_bytree': 1,
    # Other parameters
    'objective':'reg:linear',
}

In [None]:
params['eval_metric'] = "mae"
num_boost_round = 999

In order to automatically find the best number of boosting rounds, we need to pass extra parameters on top of the params dictionary, the training DMatrix and num_boost_round:
>evals: a list of pairs (test_dmatrix, name_of_test). Here we will use our dtest DMatrix.

>early_stopping_rounds: The number of rounds without improvements after which we should stop, here we set it to 10.

In [None]:
model = xgb.train(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    evals=[(dtest, "Test")],
    early_stopping_rounds=10
)

print("Best MAE: {:.2f} with {} rounds".format(
                 model.best_score,
                 model.best_iteration+1))

## Using XGBoost’s CV
In order to tune the other hyperparameters, we will use the cv function from XGBoost. It allows us to run cross-validation on our training dataset and returns a mean MAE score.

We need to pass it:

- params: our dictionary of parameters.
- our dtrain matrix.
- num_boost_round: number of boosting rounds. Here we will use a large number again and count on early_stopping_rounds to find the optimal number of rounds before reaching the maximum.
seed: random seed. It's important to set a seed here, to ensure we are using the same folds for each step so we can properly compare the scores with different parameters.
- nfold: the number of folds to use for cross-validation
- metrics: the metrics to use to evaluate our model, here we use MAE.

As you can see, we don’t need to pass a test dataset here. It’s because the cross-validation function is splitting the train dataset into nfolds and iteratively keeps one of the folds for test purposes.

In [None]:
# Let’s see what cross-validation score we get with our current parameters:
cv_results = xgb.cv(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    seed=42,
    nfold=5,
    metrics={'mae'},
    early_stopping_rounds=10
)
cv_results

###### We can get the best MAE score from cv with:

In [None]:
cv_results['test-mae-mean'].min()

### Parameters max_depth and min_child_weight
Those parameters add constraints on the architecture of the trees.

>max_depth is the maximum number of nodes allowed from the root to the farthest leaf of a tree. Deeper trees can model more complex relationships by adding more nodes, but as we go deeper, splits become less relevant and are sometimes only due to noise, causing the model to overfit.

>min_child_weight is the minimum weight (or number of samples if all samples have a weight of 1) required in order to create a new node in the tree. A smaller min_child_weight allows the algorithm to create children that correspond to fewer samples, thus allowing for more complex trees, but again, more likely to overfit.

Thus, those parameters can be used to control the complexity of the trees. It is important to tune them together in order to find a good trade-off between model bias and variance
Let’s make a list containing all the combinations max_depth/min_child_weight that we want to try.

In [None]:
# You can try wider intervals with a larger step between
# each value and then narrow it down. Here after several
# iteration I found that the optimal value was in the
# following ranges.
gridsearch_params = [
    (max_depth, min_child_weight)
    for max_depth in range(9,12)
    for min_child_weight in range(5,8)
]

Let’s run cross validation on each of those pairs. It can take some time…

In [None]:
# Define initial best params and MAE
min_mae = float("Inf")
best_params = None
for max_depth, min_child_weight in gridsearch_params:
    print("CV with max_depth={}, min_child_weight={}".format(
                             max_depth,
                             min_child_weight))
    # Update our parameters
    params['max_depth'] = max_depth
    params['min_child_weight'] = min_child_weight
    # Run CV
    cv_results = xgb.cv(
        params,
        dtrain,
        num_boost_round=num_boost_round,
        seed=42,
        nfold=5,
        metrics={'mae'},
        early_stopping_rounds=10
    )
    # Update best MAE
    mean_mae = cv_results['test-mae-mean'].min()
    boost_rounds = cv_results['test-mae-mean'].argmin()
    print("\tMAE {} for {} rounds".format(mean_mae, boost_rounds))
    if mean_mae < min_mae:
        min_mae = mean_mae
        best_params = (max_depth,min_child_weight)
print("Best params: {}, {}, MAE: {}".format(best_params[0], best_params[1], min_mae))

We get the best score with a max_depth of 10 and min_child_weight of 6, so let's update our params

In [None]:
params['max_depth'] = 10
params['min_child_weight'] = 6

### Parameters subsample and colsample_bytree

Those parameters control the sampling of the dataset that is done at each boosting round.

Instead of using the whole training set every time, we can build a tree on slightly different data at each step, which makes it less likely to overfit to a single sample or feature.
>subsample corresponds to the fraction of observations (the rows) to subsample at each step. By default it is set to 1 meaning that we use all rows.

>colsample_bytree corresponds to the fraction of features (the columns) to use. By default it is set to 1 meaning that we will use all features.

Let’s see if we can get better results by tuning those parameters together.

In [None]:
gridsearch_params = [
    (subsample, colsample)
    for subsample in [i/10. for i in range(7,11)]
    for colsample in [i/10. for i in range(7,11)]
]

In [None]:
min_mae = float("Inf")
best_params = None
# We start by the largest values and go down to the smallest
for subsample, colsample in reversed(gridsearch_params):
    print("CV with subsample={}, colsample={}".format(
                             subsample,
                             colsample))
    # We update our parameters
    params['subsample'] = subsample
    params['colsample_bytree'] = colsample
    # Run CV
    cv_results = xgb.cv(
        params,
        dtrain,
        num_boost_round=num_boost_round,
        seed=42,
        nfold=5,
        metrics={'mae'},
        early_stopping_rounds=10
    )
    # Update best score
    mean_mae = cv_results['test-mae-mean'].min()
    boost_rounds = cv_results['test-mae-mean'].argmin()
    print("\tMAE {} for {} rounds".format(mean_mae, boost_rounds))
    if mean_mae < min_mae:
        min_mae = mean_mae
        best_params = (subsample,colsample)
print("Best params: {}, {}, MAE: {}".format(best_params[0], best_params[1], min_mae))

In [None]:
# Again, we update our params dictionary.
params['subsample'] = .8
params['colsample_bytree'] = 1.

### Parameter ETA
The ETA parameter controls the learning rate. It corresponds to the shrinkage of the weights associated to features after each round, in other words it defines the amount of "correction" we make at each step (remember how each boosting round is correcting the errors of the previous? if not, check our first tutorial here).


In practice, having a lower eta makes our model more robust to overfitting thus, usually, the lower the learning rate, the best. But with a lower eta, we need more boosting rounds, which takes more time to train, sometimes for only marginal improvements. Let's try a couple of values here, and time them with the notebook command:

In [None]:
%time
# This can take some time…
min_mae = float("Inf")
best_params = None
for eta in [.3, .2, .1, .05, .01, .005]:
    print("CV with eta={}".format(eta))
    # We update our parameters
    params['eta'] = eta
    # Run and time CV
    %time cv_results = xgb.cv(
            params,
            dtrain,
            num_boost_round=num_boost_round,
            seed=42,
            nfold=5,
            metrics=['mae'],
            early_stopping_rounds=10
          )
    # Update best score
    mean_mae = cv_results['test-mae-mean'].min()
    boost_rounds = cv_results['test-mae-mean'].argmin()
    print("\tMAE {} for {} rounds\n".format(mean_mae, boost_rounds))
    if mean_mae < min_mae:
        min_mae = mean_mae
        best_params = eta
print("Best params: {}, MAE: {}".format(best_params, min_mae))

In [None]:
params['eta'] = .01

## Results
Here is how our final dictionary of parameters looks like:

In [None]:
params
{'colsample_bytree': 1.0,
 'eta': 0.01,
 'eval_metric': 'mae',
 'max_depth': 10,
 'min_child_weight': 6,
 'objective': 'reg:linear',
 'subsample': 0.8}

In [None]:
# Let’s train a model with it and see how well it does on our test set!
model = xgb.train(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    evals=[(dtest, "Test")],
    early_stopping_rounds=10
)

In [None]:
print("Best MAE: {:.2f} in {} rounds".format(model.best_score, model.best_iteration+1))

## Saving your model
Although we found the best number of rounds, our model has been trained with more rounds than optimal, thus before using it for predictions, we should retrain it with the good number of rounds. Since we now the exact best num_boost_round, we don't need the early_stopping_round anymore.

In [None]:
num_boost_round = model.best_iteration + 1
best_model = xgb.train(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    evals=[(dtest, "Test")]
)

In [None]:
mean_absolute_error(best_model.predict(dtest), y_test)

Great! If you want to re-use your model on new data in the future, it can be a good idea to save it to a file, here is how you can do it with XGBoost:

In [None]:
best_model.save_model("my_model.model")

You can then load the model later with:

In [None]:
loaded_model = xgb.Booster()
loaded_model.load_model("my_model.model")
# And use it for predictions.
loaded_model.predict(dtest)