<center><h1>Hyperparameters optimization XGBoost Classifier</h1></center>

<i><center><h2>Hypertuning, Training best parameters, save model, feature importance</h2></center></i>

<h2>Contents</h2>

- [Introduction](#part_intro)
- [I - Hyperparameters Optimization](#part_1)
    - [1 - Libraries](#part_1_1)
    - [2 - Load Data](#part_1_2)
    - [3 - Stopwords data](#part_1_3)
    - [4 - Preparation train test](#part_1_4)
- [II - Hyperparameters](#part_2)
    - [1 - RandomSearchCV](#part_2_1)
    - [2 - GridSearchCV](#part_2_2)
- [III - Best Model](#part_3)
    - [III - 1. Train model with best parameters](#part_3_1)
    - [III - 2. Save Model](#part_3_2)
    - [III - 3. Load and test to compare metrics](#part_3_3)
- [VI - Feature importance](#part_4)
    - [Part VI - 1](#part_4_1)
    - [Part VI - 2](#part_4_2)
- [Conclusion](#part_conclusion)

<h2><a id="part_intro">Introduction</a></h2>

<h2><a id="part_1">I - Hyperparameters Optimization</a></h2>

In [1]:
# Choose a method for data treatment 
count_vect         = True
tf_idf             = False
tf_idf_ngram       = False
tf_idf_ngram_char  = False
concat_methods     = False # concat data representation
random_search_model= True
grid_search_model  = False
num_gpu            = len(tf.config.experimental.list_physical_devices('GPU'))   # detect the number of gpu
# Name file 
NAME_COUNT_VECT_MODEL         = "count_vect_model.sav"
NAME_TF_IDF_MODEL             = "TF_IDF_model.sav"
NAME_TF_IDF_NGRAM_MODEL       = "TF_IDF_ngram_model.sav"
NAME_TF_IDF_NGRAM_CHAR_MODEL  = "TF_IDF_ngram_chars_model.sav"
NAME_XGB_CLASSIFIER_MODEL     = "XGBoost_classifier.sav"

<h3><a id="part_1_1">I 1 - Libraries</a></h3>

In [2]:
#imported libs
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
import sys
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from tqdm import tqdm
import pickle
import sklearn
# ---- Call tqdm to see progress bar with pandas
tqdm().pandas()

0it [00:00, ?it/s]
  from pandas import Panel


<h3><a id="part_1_2">I 2 - Load Data</a></h3>

In [3]:
train = pd.read_csv("../projet_classification_mails/mails_clean_concat_ref_folders.csv", sep=";")

In [4]:
TEXT   = "mails"
LABEL  = "label"

In [5]:
train.dropna(subset=[TEXT], inplace=True)

<h3><a id="part_1_3">I 3 - Stopwords data</a></h3>

In [6]:
if count_vect:
    stop_word = np.loadtxt("../stopwords/stopwords-fr.txt", dtype=str)

In [7]:
if count_vect:
    def remove_stop_words( x, stop_word):
        '''
        Function to remove a list of words
        @param x : (str) text 
        @param stop_word: (list) list of stopwords to delete 
        @return: (str) new string without stopwords 
        '''
        x_new = text_to_word_sequence(x)    # tokenize text 
        x_ = []
        for i in x_new:
            if i not in stop_word:
                x_.append(i)
        return " ".join(x_)
    
    train.loc[:,TEXT+"_sw"] = train.loc[:,TEXT].progress_apply(lambda x : remove_stop_words(x, stop_word))

100%|██████████| 31382/31382 [03:06<00:00, 168.58it/s]


In [8]:
if train[LABEL].isnull().sum()>0:
    train.dropna(subset=[LABEL], inplace=True)

In [9]:
train.isnull().sum()

mails         0
ref_folder    0
label         0
mails_sw      0
dtype: int64

In [10]:
train[LABEL][train[LABEL]!="annulation"] = "other"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train[LABEL][train[LABEL]!="annulation"] = "other"


<h3><a id="part_1_4">I 4 - Preparation train test</a></h3>

In [11]:
# split the dataset into training and validation datasets 
# ML classic 
if count_vect:
    train_x, test_x, y_train, y_test = train_test_split(train[TEXT+"_sw"], train[LABEL], random_state=42, stratify=train[LABEL], test_size=0.2)
else:
    train_x, test_x, y_train, y_test = train_test_split(train[TEXT], train[LABEL], random_state=42, stratify=train[LABEL], test_size=0.2)

# Validation set
train_x, valid_x, y_train, y_valid = train_test_split(train_x, y_train, random_state=42, stratify=y_train, test_size=0.2)
# label encode the target variable 
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(y_train)
valid_y = encoder.fit_transform(y_valid)
test_y = encoder.fit_transform(y_test)

In [12]:
train_x.shape

(20084,)

<center><a id="part_6_4"><h3>One-Hot encoding (CountVectorizing)</h3></a></center>

In [13]:
%%time
if count_vect:
    # create a count vectorizer object 
    count_vect_ = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
    count_vect_.fit(train[TEXT]+"_sw")

    # transform the training and validation data using count vectorizer object
    xtrain =  count_vect_.transform(train_x)
    xvalid =  count_vect_.transform(valid_x)
    xtest =  count_vect_.transform(test_x)
    

CPU times: user 20.4 s, sys: 1.25 s, total: 21.6 s
Wall time: 21.8 s


<center><a id="part_6_5"><h3>TF-IDF</h3></a></center>

In [14]:
%%time
if tf_idf:
    # word level tf-idf
    tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=10000)
    tfidf_vect.fit(train[TEXT])
    xtrain =  tfidf_vect.transform(train_x)
    xvalid =  tfidf_vect.transform(valid_x)
    xtest  =  tfidf_vect.transform(test_x)
    print("word level tf-idf done")
    
if tf_idf_ngram:
    # ngram level tf-idf 
    tfidf_vect_ngram_ = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=10000)
    tfidf_vect_ngram_.fit(train[TEXT])
    xtrain =  tfidf_vect_ngram_.transform(train_x)
    xvalid =  tfidf_vect_ngram_.transform(valid_x)
    xtest =  tfidf_vect_ngram_.transform(test_x)
    print("ngram level tf-idf done")
    
if tf_idf_ngram_char:
    # characters level tf-idf
    tfidf_vect_ngram_chars_ = TfidfVectorizer(analyzer='char',  ngram_range=(2,3), max_features=10000) #token_pattern=r'\w{1,}',
    tfidf_vect_ngram_chars_.fit(train[TEXT])
    xtrain =  tfidf_vect_ngram_chars_.transform(train_x) 
    xvalid =  tfidf_vect_ngram_chars_.transform(valid_x) 
    xtest  =  tfidf_vect_ngram_chars_.transform(test_x) 
    print("characters level tf-idf done")

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 11.7 µs


In [15]:
if count_vect:
    # save the model to disk
    filename = NAME_COUNT_VECT_MODEL
    pickle.dump(count_vect_, open(filename, 'wb'))

if tf_idf:
    # save the model to disk
    filename = NAME_TF_IDF_MODEL
    pickle.dump(tfidf_vect, open(filename, 'wb'))
    
if tf_idf_ngram:
    # save the model to disk
    filename = NAME_TF_IDF_NGRAM_MODEL
    pickle.dump(tfidf_ngram_, open(filename, 'wb'))
    
if tf_idf_ngram_char:
    # save the model to disk
    filename = NAME_TF_IDF_NGRAM_CHAR_MODEL
    pickle.dump(tfidf_vect_ngram_chars_, open(filename, 'wb'))

<h2><a id="part_2">II - Hyperparameters</a></h2>

<h3><a id="part_2_1">II 1 - RandomizedSearchCV</a></h3>

In [16]:
from sklearn.metrics import make_scorer

In [17]:
if random_search_model:
    #For classification 

    #Random Search
    if num_gpu>0:
        xgb_pipeline =XGBClassifier(tree_method= 'gpu_hist')
    else:
        xgb_pipeline =XGBClassifier()
        
    params = {
            'learning_rate': [0.03, 0.01, 0.003, 0.001],
            'min_child_weight': [1,3, 5,7, 10],
            'gamma': [0, 0.5, 1, 1.5, 2, 2.5, 5],
            'subsample': [0.6, 0.8, 1.0, 1.2, 1.4],
            'colsample_bytree': [0.6, 0.8, 1.0, 1.2, 1.4],
            'max_depth': [3, 4, 5, 6, 7, 8, 9 ,10],
            'reg_lambda':np.array([0.4, 0.6, 0.8, 1, 1.2, 1.4])}

    fit_params = { 
            'early_stopping_rounds':10,
            'eval_set':[(xvalid, valid_y)]}


    random_search = RandomizedSearchCV(xgb_pipeline, param_distributions=params, n_iter=1000,
                                       scoring="precision", n_jobs=-1,  verbose=3, random_state=42, cv=3 )
    random_search.fit(xtrain,train_y, **fit_params)

Fitting 3 folds for each of 1000 candidates, totalling 3000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:   45.3s
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed: 15.3min
[Parallel(n_jobs=-1)]: Done 504 tasks      | elapsed: 26.9min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 43.9min
[Parallel(n_jobs=-1)]: Done 1144 tasks      | elapsed: 66.3min
[Parallel(n_jobs=-1)]: Done 1560 tasks      | elapsed: 92.4min
[Parallel(n_jobs=-1)]: Done 2040 tasks      | elapsed: 122.1min
[Parallel(n_jobs=-1)]: Done 2584 tasks      | elapsed: 151.0min
[Parallel(n_jobs=-1)]: Done 3000 out of 3000 | elapsed: 174.3min finished


[0]	validation_0-error:0.11352
Will train until validation_0-error hasn't improved in 10 rounds.
[1]	validation_0-error:0.09759
[2]	validation_0-error:0.09281
[3]	validation_0-error:0.07748
[4]	validation_0-error:0.07170
[5]	validation_0-error:0.07269
[6]	validation_0-error:0.07250
[7]	validation_0-error:0.06632
[8]	validation_0-error:0.06194
[9]	validation_0-error:0.06154
[10]	validation_0-error:0.06134
[11]	validation_0-error:0.06055
[12]	validation_0-error:0.06234
[13]	validation_0-error:0.06114
[14]	validation_0-error:0.05935
[15]	validation_0-error:0.05776
[16]	validation_0-error:0.05736
[17]	validation_0-error:0.05696
[18]	validation_0-error:0.05497
[19]	validation_0-error:0.05716
[20]	validation_0-error:0.05736
[21]	validation_0-error:0.05736
[22]	validation_0-error:0.05796
[23]	validation_0-error:0.05676
[24]	validation_0-error:0.05756
[25]	validation_0-error:0.05835
[26]	validation_0-error:0.05756
[27]	validation_0-error:0.05597
[28]	validation_0-error:0.05756
Stopping. Best i

In [80]:
import sklearn
sklearn.metrics.SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'adjusted_rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_weighted'])

In [18]:
if random_search_model:
    #Print out best parameters
    print(random_search.best_params_)
    #Print out scores on validation set
    print(random_search.score(xtest,test_y))

{'subsample': 1.0, 'reg_lambda': 0.8, 'min_child_weight': 1, 'max_depth': 10, 'learning_rate': 0.03, 'gamma': 0, 'colsample_bytree': 0.6}
0.9520525451559935


<h3><a id="part_2_2">II 2 - GridSearchCV</a></h3>

In [24]:
random_search.best_params_

{'subsample': 1.0,
 'reg_lambda': 0.8,
 'min_child_weight': 1,
 'max_depth': 10,
 'learning_rate': 0.03,
 'gamma': 0,
 'colsample_bytree': 0.6}

In [28]:
np.array(range(int(10*random_search.best_params_["reg_lambda"])-1, int(10*random_search.best_params_["reg_lambda"])+1, 1))/10

array([0.7, 0.8])

In [30]:
if grid_search_model:
    #Grid Search

    if num_gpu>0:
        xgb_pipeline =XGBClassifier(tree_method= 'gpu_hist')
    else:
        xgb_pipeline =XGBClassifier()

    gbm_param_grid = {
        'learning_rate': np.array(range(int(100*random_search.best_params_["learning_rate"])-1, int(100*random_search.best_params_["learning_rate"])+2, 1))/100,
        'subsample': np.array(range(int(10*random_search.best_params_["subsample"])-1, int(10*random_search.best_params_["subsample"])+2, 1))/10,
        'reg_lambda': np.array(range(int(10*random_search.best_params_["reg_lambda"])-1, int(10*random_search.best_params_["reg_lambda"])+2, 1))/10,
        'max_depth':np.array(range(random_search.best_params_["max_depth"]-1, random_search.best_params_["max_depth"]+3, 1)),
        'colsample_bytree': np.array(range(int(10*random_search.best_params_["colsample_bytree"])-1, int(10*random_search.best_params_["colsample_bytree"])+2, 1))/10,
        'min_child_weight': np.array(range(int(10*random_search.best_params_["min_child_weight"])-1, int(10*random_search.best_params_["min_child_weight"])+2, 1))/10
    }
    #'gamma': np.array(range(int(10*random_search.best_params_["gamma"])-3, int(10*random_search.best_params_["gamma"])+3, 1))/10,

    fit_params = { 
            'early_stopping_rounds':10,
            'eval_set':[(xvalid, valid_y)]}

    grid_search = GridSearchCV(estimator=xgb_pipeline, param_grid=gbm_param_grid, n_jobs=-1, cv=3,
                             scoring='precision', verbose=10 )

    grid_search.fit(xtrain,train_y,**fit_params)

Fitting 3 folds for each of 972 candidates, totalling 2916 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  6.0min
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:  8.5min
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed: 11.4min
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 14.7min
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed: 18.5min
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed: 23.6min
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed: 28.4min
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed: 33.8min
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed: 38.3min
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed: 44.4min
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed: 50.5min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 56.8min
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed: 64

[0]	validation_0-error:0.11472
Will train until validation_0-error hasn't improved in 10 rounds.
[1]	validation_0-error:0.07568
[2]	validation_0-error:0.06871
[3]	validation_0-error:0.06652
[4]	validation_0-error:0.06692
[5]	validation_0-error:0.06433
[6]	validation_0-error:0.06393
[7]	validation_0-error:0.06055
[8]	validation_0-error:0.06373
[9]	validation_0-error:0.06214
[10]	validation_0-error:0.06254
[11]	validation_0-error:0.06174
[12]	validation_0-error:0.06075
[13]	validation_0-error:0.06035
[14]	validation_0-error:0.05875
[15]	validation_0-error:0.05736
[16]	validation_0-error:0.05776
[17]	validation_0-error:0.05716
[18]	validation_0-error:0.05955
[19]	validation_0-error:0.05756
[20]	validation_0-error:0.05975
[21]	validation_0-error:0.05855
[22]	validation_0-error:0.05895
[23]	validation_0-error:0.05975
[24]	validation_0-error:0.05875
[25]	validation_0-error:0.05835
[26]	validation_0-error:0.05776
[27]	validation_0-error:0.05736
Stopping. Best iteration:
[17]	validation_0-erro

In [31]:
if grid_search_model:
    print(grid_search.best_params_)
    print(grid_search.score(xvalid,valid_y))

{'colsample_bytree': 0.5, 'learning_rate': 0.04, 'max_depth': 11, 'min_child_weight': 0.9, 'reg_lambda': 0.7, 'subsample': 0.9}
0.9508729192042225


<h2><a id="part_3">III - Best Model</a></h2>

In [33]:
def results_metrics(y_true, y_pred):
    print(f"Accuracy \t\t: {round(100*sklearn.metrics.accuracy_score(y_true, y_pred),2)}%")
    print(f"Kappa   \t\t: {round(100*sklearn.metrics.cohen_kappa_score(y_true, y_pred),2)}%")
    print(f"f1-score \t\t: {round(100*sklearn.metrics.f1_score(y_true, y_pred),2)}%")
    print(f"Precision \t\t: {round(100*sklearn.metrics.precision_score( y_true, y_pred),2)}%")
    print(f"Recall  \t\t: {round(100*sklearn.metrics.recall_score(y_true, y_pred),2)}%")
    print(f"ROC AUC \t\t: {round(100*sklearn.metrics.roc_auc_score( y_true, y_pred),2)}%")
    print(f"Matthews Corrcoef \t: {round(100*sklearn.metrics.matthews_corrcoef(  y_true, y_pred),2)}%")

<h3><a id="part_3_1">III - 1. Train best parameters</a></h3>

In [32]:
# Show best parameters 
random_search.best_params_

{'subsample': 1.0,
 'reg_lambda': 0.8,
 'min_child_weight': 1,
 'max_depth': 10,
 'learning_rate': 0.03,
 'gamma': 0,
 'colsample_bytree': 0.6}

In [34]:
# Train a new model with best parameters and early stopping
model = XGBClassifier(n_estimators=1000, **random_search.best_params_)
model.fit(xtrain,train_y, eval_metric="logloss",eval_set=[(xvalid, valid_y)],early_stopping_rounds=10, verbose=False)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.6, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.03, max_delta_step=0, max_depth=10,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=1000, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=0.8, scale_pos_weight=1, subsample=1.0,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [41]:
# See the metrics 
y_pred = model.predict(xtest)
results_metrics(test_y, y_pred)

Accuracy 		: 96.45%
Kappa   		: 92.89%
f1-score 		: 96.44%
Precision 		: 96.46%
Recall  		: 96.43%
ROC AUC 		: 96.45%
Matthews Corrcoef 	: 92.89%


<h3>Comparison default parameters</h3>

In [42]:
model_basic = XGBClassifier(n_estimators=1000,subsample=0.8)
model_basic.fit(xtrain,train_y, eval_metric="logloss",eval_set=[(xvalid, valid_y)],early_stopping_rounds=10, verbose=False)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=1000, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=0.8,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [44]:
y_pred_basic = model_basic.predict(xtest)
results_metrics(test_y, y_pred_basic)

Accuracy 		: 95.91%
Kappa   		: 91.81%
f1-score 		: 95.88%
Precision 		: 96.27%
Recall  		: 95.5%
ROC AUC 		: 95.91%
Matthews Corrcoef 	: 91.81%


<h3><a id="part_3_2">III - 2. Save model</a></h3>

In [45]:
# save the model to disk
filename = NAME_XGB_CLASSIFIER_MODEL
pickle.dump(model, open(filename, 'wb'))

<h3><a id="part_3_3">III - 3. Load and test to compare metrics</a></h3>

In [46]:
%%time
filename = NAME_COUNT_VECT_MODEL
# load the model from disk
preproc_model = pickle.load(open(filename, 'rb'))
valid__x = preproc_model.transform(valid_x)

filename = NAME_XGB_CLASSIFIER_MODEL
loaded_model = pickle.load(open(filename, 'rb'))
y_pred = loaded_model.predict(valid__x)

results_metrics(valid_y, y_pred)

Accuracy 		: 96.42%
Kappa   		: 92.83%
f1-score 		: 96.42%
Precision 		: 96.12%
Recall  		: 96.73%
ROC AUC 		: 96.42%
Matthews Corrcoef 	: 92.83%
CPU times: user 4.58 s, sys: 391 ms, total: 4.97 s
Wall time: 4.3 s


<h2><a id="part_4">VI - Feature Importance</a></h2>

<h3><a id="part_4_1">VI 1 - Part 4 1</a></h3>

<h3><a id="part_4_2">VI 2 - Part 4 2</a></h3>

<h2><a id="part_conclusion">Conclusion</a></h2>