# **Tuning XGBoost: A practical outlook of three popular hyperparameter tuning methods**
### **Important concepts:**

**What is an XGBoost?** 

XGBoost is a highly scalable and distributed machine learning algorithm that leverages gradient boosting and decision trees to address a wide range of problems, including regression, classification, and ranking. XGBoost, which is short for Extreme Gradient Boosting, gets its "extreme" label because it employs advanced regularization techniques to enhance the performance of gradient boosting. XGBoost was developed with the specific goal of making gradient boosting more versatile and regularized across various applications.

**What are hyperparameters?**

Hyperparameters are external model parameter settings whose values cannot be learned from the data during training. These parameters are configured manually before initiating the training process, and they remain unaffected by the training itself.

**What is Hyperparameter Tuning?** 

Hyperparameter tuning involves the quest for the ideal hyperparameter values that can maximize a machine learning algorithm's performance. The primary objectives are to boost a model's predictive precision and potentially optimize its computational efficiency. It's essential to note that the optimal hyperparameters can be subjective, varying from one dataset to another, and it's the responsibility of the model developer to specify these hyperparameters, as they are fundamental settings for all machine learning models.

**Hyperparameters in XGBoost**

In XGboost, there are two main categories of hyperparameters: 

* tree booster hyperparameters
* learning task hyperparameters

Hyperparameters related to tree booster govern the construction and intricacy of the decision trees within the model. Examples of tree booster parameters include: 

* *max_depth:* Maximum depth of a tree increases the model complexity. A model with a very high max_depth will most likely overfits. Default value is 6. 

* *subsample:* Subsample ratio of the training instances. Setting it to 0.5 means XGboost would randomly sample half of the training data prior to growing trees, and this helps prevent overfitting. Subsampling will occur once in every boosting iteration. Default value is 1.

Hyperparameters related to learning task play a pivotal role in shaping both the model's behavior and the entire learning process. Examples of such are: 

* *learning_rate:* Step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new
features, and learning rate shrinks the feature weights to make the boosting process more conservative. Lower values make the model more robust by taking smaller steps. Default value is 0.3

* *alpha:* L1 regularization term on weights. Increasing this value will make model more conservation. *alpha* ranges from 0 to postive infinity. Default value is 0

* *lambda:* L2 regularization term on weights. Increasing this value will make model more conservation. *lambda* ranges from 0 to postive infinity. Default value is 1.

**Methods of Hyperparameter tuning**

In this project, I have explored and compared only three hyperparameter methods. And of course, there are other methods beyond these three. The three includes:

* Grid Search: This is a systematic and automated method for hyperparameter tuning. Think of it as a way of finding the best combination of hyperparameters for your model without having to manually try different values one by one. The time and resources to run every combinations becomes increase as the combination increases. Hence, it is very time consuming and inefficient for a production task. 

* Randomized Search: This is an alternative approach to Grid Search, which exhaustively tries all possible combinations of hyperparameter values. This method performs a random selection/combination of hyperparameters for find the optimal values. The downside to using this method might be the inability to find the optimal hyperparameters. 

* Optuna:  Optuna is a sophisticated technique for hyperparameter tuning, employing a Bayesian optimization approach. It harnesses the power of Bayesian reasoning by calculating probabilities to pinpoint the best hyperparameter values, which efficiently reduces computational overhead by eliminating combinations of parameters that are not contributing to model performance. Optuna stands out for its effectiveness in both sampling potential hyperparameters and pruning out less promising ones, making it a valuable tool for enhancing the efficiency of hyperparameter optimization.  


In [2]:
import pandas as pd
import numpy as np
import random 
random.seed(123)
import time
#--------------------------------------------------------------------------------------------
import xgboost as xgb
from xgboost import XGBClassifier
#---------------------------------------------------------------------------------------------
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score  
from pprint import pprint
#---------------------------------------------------------------------------------------------
import optuna
from optuna.samplers import TPESampler
#----------------------------------------------------------------------------------------------
import warnings
warnings.filterwarnings("ignore") 

# **1. Import data**

For this project, I have used the wholesale customers data, downloaded from kaggle. The dataset refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u) on diverse product categories. Each model in this notebook aim to predict the category to which customers belong - Hotel/Restaurant/Cafe OR Retail channel. Hence, this is a classification problem.

Two things to note about the target variable:

* The target variable is unbalanced, that is, one class has more instance than the other
* No sampling technique is used to adjust for class imbalance. The data is used as is. 

In [2]:
data = pd.read_csv('C:/Users/oyeni/Projects/XGboost/Data/Wholesale customers data.csv')
data.head()

Unnamed: 0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,2,3,12669,9656,7561,214,2674,1338
1,2,3,7057,9810,9568,1762,3293,1776
2,2,3,6353,8808,7684,2405,3516,7844
3,1,3,13265,1196,4221,6404,507,1788
4,2,3,22615,5410,7198,3915,1777,5185


# **2. Define independent and dependent variables/Split data into train and test sets**

In [3]:
#create dependent and independent variables 
X = data.drop('Channel', axis = 1)
y = data['Channel']

#convert labels into binary values
y[y==2] = 0
y[y==1] = 1

#split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=0)

print('Training Data Shape', X_train.shape, y_train.shape)
print('Testing Data Shape', X_test.shape, y_test.shape)

Training Data Shape (308, 7) (308,)
Testing Data Shape (132, 7) (132,)


# **3. Define functions for data summary and model evaluation**

In [4]:
def data_statistics(df, response):
    """Returns the data's different statistics.
        Parameters:
              X: independent variables 
              y: dependent variable
    
        Returns: 
              Top 5 rows of data
              Five number summary
              Number of unique elements per column
              The shape of the data
              Data type information
              distribution of dependent variable
              Dependent variable label count
              missing data
    """
    print(f"Top 5 rows of data: \n{df.head(5)}")
    print("---------------------------------------------------------------------------------------------------------------------")
    print(f"Five number summary: \n{df.describe()}")
    print("---------------------------------------------------------------------------------------------------------------------")
    print(f"Number of unique element per column: \n{df.nunique()}")
    print("---------------------------------------------------------------------------------------------------------------------")
    print(f"Shape of the data is : \n{df.shape}")
    print("---------------------------------------------------------------------------------------------------------------------")
    print(df.info())
    print("---------------------------------------------------------------------------------------------------------------------")
    print(f"Distribution of dependent variable: \n{df[response].value_counts()}")
    print("---------------------------------------------------------------------------------------------------------------------")
    print(f"Unique label in dependent variable: {sorted(df[response].unique())}")
    print("---------------------------------------------------------------------------------------------------------------------")
    print(f"How many data are missing: \n{df.isnull().sum()}")


In [5]:
def evaluate_train(model):
    """
    Return predictions, precision, recall and F1_score obtained on train data.
        Parameters: 
              model: the model used
        
        Returns:
              Predictions, precision, recall and F1_score on train data

    """
    pred = model.predict(X_train)
    precision = precision_score(y_train, pred)
    recall = recall_score(y_train, pred)
    F1_score = f1_score(y_train, pred)
    
    return pred, precision, recall, F1_score

In [6]:
def evaluate_test(model):
    """
    Return predictions, precision, recall and F1_score obtained on test data.
        Parameters: 
              model: the model used
        
        Returns:
              Predictions, precision, recall and F1_score on train data

    """
    pred = model.predict(X_test)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    F1_score = f1_score(y_test, pred)
    
    return pred, precision, recall, F1_score

In [7]:
data_statistics(df=data, response='Channel')

Top 5 rows of data: 
   Channel  Region  Fresh  Milk  Grocery  Frozen  Detergents_Paper  Delicassen
0        0       3  12669  9656     7561     214              2674        1338
1        0       3   7057  9810     9568    1762              3293        1776
2        0       3   6353  8808     7684    2405              3516        7844
3        1       3  13265  1196     4221    6404               507        1788
4        0       3  22615  5410     7198    3915              1777        5185
---------------------------------------------------------------------------------------------------------------------
Five number summary: 
          Channel      Region          Fresh          Milk       Grocery  \
count  440.000000  440.000000     440.000000    440.000000    440.000000   
mean     0.677273    2.543182   12000.297727   5796.265909   7951.277273   
std      0.468052    0.774272   12647.328865   7380.377175   9503.162829   
min      0.000000    1.000000       3.000000     55.000000   

# **4. Tuning XGBoost with Grid Search**

In [8]:
# specify hyperparameters to try
max_depth = [int(x) for x in np.linspace(3,7, num=3)]
learning_rate = [0.1, 0.01, 0.001]
subsample = [float(x) for x in np.linspace(0.5,1.0, num=3)]
n_estimator = [int(x) for x in np.linspace(100,300, num=3)]
alpha = [int(x) for x in np.linspace(3,9, num=3)]
objective = ['binary:logistic']

param = {
    'max_depth': max_depth,
    'learning_rate': learning_rate,
    'subsample': subsample,
    'n_estimator': n_estimator,
    'alpha': alpha,
    'objective': objective
}

In [9]:
#instantiate xgboost classifier
xgb_model = xgb.XGBClassifier()

In [10]:
start_gridsearch = time.time()
#Instantiate GridSearch
grid_search = GridSearchCV(estimator = xgb_model, 
                           param_grid = param, 
                           scoring="f1", 
                           cv=5, 
                           n_jobs=-1, 
                           verbose=1)

#fit training data to the xgboost gridsearch algorithm
grid_search.fit(X_train, y_train)
end_gridsearch = time.time()
grid_time = (time.time() - start_gridsearch)

Fitting 5 folds for each of 243 candidates, totalling 1215 fits


In [11]:
_, precision_gridtrain, recall_gridtrain, f1_gridtrain = evaluate_train(grid_search)
print("Precision = {} \nRecall = {} \nf1 = {}".format(precision_gridtrain, recall_gridtrain, f1_gridtrain))

Precision = 0.9716981132075472 
Recall = 0.9716981132075472 
f1 = 0.9716981132075472


In [12]:
_, precision_gridtest, recall_gridtest, f1_gridtest = evaluate_test(grid_search)
print("Precision = {} \nRecall = {} \nf1 = {}".format(precision_gridtest, recall_gridtest, f1_gridtest))

Precision = 0.8829787234042553 
Recall = 0.9651162790697675 
f1 = 0.9222222222222223


# **5. Tuning XGBoost with Randomized Search**

In [13]:
start_randomsearch = time.time()
#Instantiate RandomizedSearch
random_search = RandomizedSearchCV(estimator = xgb_model, 
                           param_distributions = param, 
                           scoring="f1", 
                           cv=5,
                           random_state=35,
                           n_jobs=-1, 
                           verbose=1)

#fit training data to the xgboost gridsearch algorithm
random_search.fit(X_train, y_train)
end_randomsearch = time.time()
random_time = (time.time() - start_randomsearch)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [14]:
#random_search.best_params_

In [15]:
_, precision_randtrain, recall_randtrain, f1_randtrain = evaluate_train(random_search)
print("Precision = {} \nRecall = {} \nf1 = {}".format(precision_randtrain, recall_randtrain, f1_randtrain))

Precision = 0.9619047619047619 
Recall = 0.9528301886792453 
f1 = 0.957345971563981


In [16]:
_, precision_randtest, recall_randtest, f1_randtest = evaluate_test(random_search)
print("Precision = {} \nRecall = {} \nf1 = {}".format(precision_randtest, recall_randtest, f1_randtest))

Precision = 0.8804347826086957 
Recall = 0.9418604651162791 
f1 = 0.9101123595505618


# **6. Tuning XGBoost with Optuna**

In [17]:
# define objective function 
def objective(trial):
    """
    The function returns the mean F1 score 

       Parameters: 
           trial: a process of evaluating an objective function
       
       Returns: f1-score

    """ 
    n_estimators = trial.suggest_int('n_estimators', low=100, high=300, step=50)
    learning_rate = trial.suggest_loguniform('learning_rate', 0.001, 0.1)
    #learning_rate = trial.suggest_float('learning_rate', low=0.001, high=0.1,step=3)
    subsample = trial.suggest_float('subsample', low=0.5, high=1.0, step=3)
    max_depth = trial.suggest_int('max_depth', low=3, high=7, step=2)
    alpha = trial.suggest_int('alpha', low=3, high=9, step=3)
    objective = trial.suggest_categorical('objective', ['binary:logistic'])

    xgb_c = XGBClassifier(n_estimator = n_estimators, 
                              objective = objective,
                              learning_rate = learning_rate,
                              subsample = subsample, 
                              max_depth = max_depth,
                              alpha = alpha)
    
    score = cross_val_score(estimator=xgb_c,
                            X=X_train,   
                            y=y_train,
                            scoring = 'f1',
                            cv=5,
                            n_jobs=-1).mean()
    
    return score

study = optuna.create_study(sampler=TPESampler(), direction='maximize')

time_start = time.time()
optuna.logging.set_verbosity(optuna.logging.WARNING)
study.optimize(objective, n_trials=100)
optuna_time = time.time() - time_start

[I 2023-09-27 16:36:41,440] A new study created in memory with name: no-name-35ffda45-ede3-47e5-9ddf-b43186a5f673


In [18]:
#pprint(study.best_params)

In [19]:
#best hyperparameters
xgb_optuna = XGBClassifier(n_estimator = 250, 
                              objective = 'binary:logistic',
                              learning_rate = 0.08262982765412646,
                              subsample = 0.5, 
                              max_depth = 5,
                              alpha = 3).fit(X_train, y_train)

In [20]:
_, precision_optunatrain, recall_optunatrain, f1_optunatrain = evaluate_train(xgb_optuna)
print("Precision = {} \nRecall = {} \nf1 = {}".format(precision_optunatrain, recall_optunatrain, f1_optunatrain))

Precision = 0.9624413145539906 
Recall = 0.9669811320754716 
f1 = 0.9647058823529412


In [21]:
_, precision_optunatest, recall_optunatest, f1_optunatest = evaluate_test(xgb_optuna)
print("Precision = {} \nRecall = {} \nf1 = {}".format(precision_optunatest, recall_optunatest, f1_optunatest))

Precision = 0.8736842105263158 
Recall = 0.9651162790697675 
f1 = 0.9171270718232045


# **7. Conclusion**

In [22]:
model_values = ['xgboost_grid', grid_time, precision_gridtrain, precision_gridtest, recall_gridtrain, recall_gridtest, f1_gridtrain, f1_gridtest]
columns = ['models', 'Time Elapsed (s)', 'model_precision_train', 'model_precision_test', 'model_recall_train', 'model_recall_test', 'model_f1_train', 'model_f1_test']
grid_results = pd.DataFrame([model_values], columns = columns)

In [23]:
model_values = ['xgboost_random', random_time, precision_randtrain, precision_randtest, recall_randtrain, recall_randtest, f1_randtrain, f1_randtest]
columns = ['models', 'Time Elapsed (s)', 'model_precision_train', 'model_precision_test', 'model_recall_train', 'model_recall_test', 'model_f1_train', 'model_f1_test']
random_results = pd.DataFrame([model_values], columns = columns)

In [24]:
model_values = ['xgboost_optuna', optuna_time, precision_optunatrain, precision_optunatest, recall_optunatrain, recall_optunatest, f1_optunatrain, f1_optunatest]
columns = ['models', 'Time Elapsed (s)', 'model_precision_train', 'model_precision_test', 'model_recall_train', 'model_recall_test', 'model_f1_train', 'model_f1_test']
optuna_results = pd.DataFrame([model_values], columns = columns)

In [25]:
#stack table 
final_results = grid_results.append(random_results).append(optuna_results)
final_results.index = ['Grid Search', 'Randomized Search', 'Optuna']
final_results.sort_values('model_f1_test', ascending = False)

Unnamed: 0,models,Time Elapsed (s),model_precision_train,model_precision_test,model_recall_train,model_recall_test,model_f1_train,model_f1_test
Grid Search,xgboost_grid,14.147263,0.971698,0.882979,0.971698,0.965116,0.971698,0.922222
Optuna,xgboost_optuna,12.232657,0.962441,0.873684,0.966981,0.965116,0.964706,0.917127
Randomized Search,xgboost_random,0.582441,0.961905,0.880435,0.95283,0.94186,0.957346,0.910112


In terms of performance, when evaluating based on the F1-score, it became evident that XGBoost, with both grid search and Optuna, demonstrated the most outstanding performance. However, when we take into account not only model performance but also runtime efficiency, Optuna emerged as the top-performing method.

Nevertheless, if there's a willingness to make a slight compromise on model performance and prioritize runtime efficiency, XGBoost with randomized search outperformed both grid search and Optuna.

To sum it up, the choice of hyperparameter tuning method is heavily contingent on the specific context. For instance, in a production environment where time and resources are at a premium, grid search may prove too resource-intensive due to its exhaustive exploration of hyperparameter combinations. 