# Quick demonstration of the application of hyperopt and XGBoost
## This is not a comprehensive guide, but rather aimed at those aspiring Data Scientists that might want to run before they can walk.  
Reading:
* https://xgboost.readthedocs.io/en/stable/
* http://hyperopt.github.io/hyperopt/

Better Guides and Examples:
* https://www.kaggle.com/prashant111/a-guide-on-xgboost-hyperparameters-tuning 


![image.png](attachment:dfa36ecc-d52a-4c2e-aab3-f1e9e8e6309e.png)!

# Import Libraries

In [141]:
import pandas as pd
import numpy as np
import xgboost
from xgboost import XGBClassifier
from hyperopt import  fmin, tpe, hp, Trials,STATUS_OK
from hyperopt.early_stop import no_progress_loss
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.datasets import fetch_openml
from sklearn.metrics import classification_report
from functools import partial

# Data cleaning, processing, imputing, interactions...
Out of Demo scope...but dont forget, that 80/90% of the work 

![image.png](attachment:f58fcfce-ec51-437c-a6f3-3c0bf98e4506.png)

# Get Some Data - Titanic Dataset....

In [75]:
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

In [91]:
#Select some features and make sure find for model.
df = fetch_openml("titanic", version=1, as_frame=True, return_X_y=False)["data"]
df["target"] = fetch_openml("titanic", version=1, as_frame=True, return_X_y=False)["target"]
df["target"] = df["target"].astype(int)
df = df[["age","sibsp","parch","pclass","target"]].copy()
df.head()

Unnamed: 0,age,sibsp,parch,pclass,target
0,29.0,0.0,0.0,1.0,1
1,0.9167,1.0,2.0,1.0,1
2,2.0,1.0,2.0,1.0,0
3,30.0,1.0,2.0,1.0,0
4,25.0,1.0,2.0,1.0,0


# Define Target and Split 

In [92]:
X = df.drop("target",axis=1)
y= df["target"]

In [95]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# As a wise man once said.... Meat & Potatoes

Steps:
1: Define the search space for the optimizer. Dont make it so narrow that is anticlimatic, not so wide that youll rather watch paint dry. Be sensible!
2: Construct the objective function, can be simple, can be complex the world is your oyster
3: Chuck it into gear and full throttle ahead

![image.png](attachment:5981f0dd-97ae-451d-b857-5559e4b62b39.png)!



In [122]:
"""
Define hyperopt search space
"""

search_space = {
    'max_depth' : hp.choice('max_depth', range(10, 50, 1)), # 50 is way too much for titanic dataset. It will overfit, but thats out of the scope.  
    'learning_rate' : hp.quniform('learning_rate', 0.01, 0.15, 0.02), # The weight of new trees added.  
    'n_estimators' : hp.choice('n_estimators', range(5, 20, 5)), #Number of trees
    'min_child_weight': hp.choice('min_child_weight', range(20, 30, 10)) #Minimum number of samples/obervsations required to create a new node. Set it high enough to address overfitting. Youll have to experiment  
}

In [135]:
def objective(params, X_train, y_train,cv,metric,maximise=True): 
    """
        * This is the objective function called by the hyperopt minimisation algorithm. 
        * For minimisation, you have to make the score metric negative to comply with "Larger is Better". Else, dont change anything. maximise param takes care of this. Set it to False if you want to minimise
        * This object will be called for each of the evaluations specified in the fmin() hyperopt function
        * If using cross_val_score, the "Scoring" metric can be a custom (see sklearn makescorer) or one can be chosen from  from:https://scikit-learn.org/stable/modules/model_evaluation.html
    Args:
        params: Dictionary defining tuning search space
    
    Returns:
        * Average score returned by cross_val_score
        * Metric used for optimisation
        * Parameters used to achieve returned loss
        * Status
    """
    
    clf = XGBClassifier(n_jobs = 40,**params,use_label_encoder=False,eval_metric = "logloss")
    # evalround+=1
   
    score =  cross_val_score(estimator = clf, 
                              X = X_train, 
                              y = y_train, 
                              scoring = metric, 
                              cv=StratifiedKFold(n_splits=cv)).mean()
    
    if not maximise is True:
        score = -score
    
    print("Parameters: {}".format(params))
    print("Score {}".format(score))
    return {"loss":score,"Metric": metric, "paramdict":params,"status": STATUS_OK}

In [136]:
#Full throttle ahead
trials = Trials()
fmin_objective = partial(objective, 
                         X_train=X_train, 
                         y_train=y_train,
                         cv = 4,
                         maximise = False,
                         metric = "precision")

fmin_output = fmin(fn=fmin_objective,
            space=search_space,
            algo=tpe.suggest,
            max_evals=100,
            trials=trials,
            early_stop_fn=no_progress_loss(70))

Parameters: {'learning_rate': 0.14, 'max_depth': 37, 'min_child_weight': 20, 'n_estimators': 5}                        
Score -0.6124809448000639                                                                                              
Parameters: {'learning_rate': 0.02, 'max_depth': 47, 'min_child_weight': 20, 'n_estimators': 10}                       
Score -0.5768878524024752                                                                                              
Parameters: {'learning_rate': 0.14, 'max_depth': 27, 'min_child_weight': 20, 'n_estimators': 5}                        
Score -0.6124809448000639                                                                                              
Parameters: {'learning_rate': 0.06, 'max_depth': 35, 'min_child_weight': 20, 'n_estimators': 10}                       
Score -0.5782929106120296                                                                                              
Parameters: {'learning_rate': 0.14, 'max

# Get the Best Params


In [137]:
best = trials.best_trial["result"]["paramdict"]
print(best)

{'learning_rate': 0.14, 'max_depth': 39, 'min_child_weight': 20, 'n_estimators': 15}


# Fit model with best params 

In [140]:
BestModel = XGBClassifier(**best,use_label_encoder=False,eval_metric="logloss")
BestModel.fit(X_train,y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
              gamma=0, gpu_id=-1, importance_type='gain',
              interaction_constraints='', learning_rate=0.14, max_delta_step=0,
              max_depth=39, min_child_weight=20, missing=nan,
              monotone_constraints='()', n_estimators=15, n_jobs=48,
              num_parallel_tree=1, random_state=0, reg_alpha=0, reg_lambda=1,
              scale_pos_weight=1, subsample=1, tree_method='exact',
              use_label_encoder=False, validate_parameters=1, verbosity=None)

In [142]:
preds = BestModel.predict(X_test)

In [144]:
print(classification_report(y_true=y_test,y_pred=preds))

              precision    recall  f1-score   support

           0       0.68      0.90      0.77       254
           1       0.73      0.38      0.50       178

    accuracy                           0.69       432
   macro avg       0.70      0.64      0.64       432
weighted avg       0.70      0.69      0.66       432



# Hope it helps