# Hybrinfox - XGBoost Tutorial on Census - 2023-07

Basic data preparation, training with XGBoost and analysis for binary classification (Census).

Based on individuals with information on their age, occupation, relationship, sex... The purpose is to predict which one earns over $50_000 a year.  

The dataset we used is the [Adult Census Income](https://www.kaggle.com/datasets/uciml/adult-census-income) dataset, made up of 58_000 census records on American citizens in 1994.  

In [None]:
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

# Pre-processing of data to feed the XGBoost model

We begin with the required imports for data preparation. 

In [None]:
import sys
sys.path.append("../")

import time

from sklearn.datasets import fetch_openml

In [None]:
# preparing the dataset on clients for binary classification
data = fetch_openml(data_id=1590, as_frame=True)

X = data.data
Y = (data.target == '>50K') * 1

In [None]:
X

Once the data are loaded, split them in train / valid / test samples. 
- During training, the XGBoost model will use the train set to characterize, and the valid set to generalize the correlations between clients (cross-validation)
- The test sample will only be used to assess if the final model generalizes well the clients' patterns

In [None]:
from classif_basic.data_preparation import train_valid_test_split

preprocessing_cat_features = "label_encoding"

model_task = "classification"
xgb_eval_metric="auc"

X_train, X_valid, X_train_valid, X_test, Y_train, Y_valid, Y_train_valid, Y_test = train_valid_test_split(
    X=X,
    Y=Y, 
    model_task=model_task,
    preprocessing_cat_features=preprocessing_cat_features)

In [None]:
X_train

# Training

We begin with setting the hyperparameters (which need to be fine-tuned by the data-scientist):

There are the [most frequently tuned hyperparameters](https://www.kaggle.com/code/soheiltehranipour/xgboost-tutorial-classification):

1. **learning_rate**

Also called eta, it specifies how quickly the model fits the residual errors by using additional base learners.

Typical values: 0.01–0.2

2. **gamma, reg_alpha, reg_lambda**

These 3 parameters specify the values for 3 types of regularization (to make the model lighter) done by XGBoost - minimum loss reduction to create a new split, L1 reg on leaf weights, L2 reg leaf weights respectively

Typical values for gamma: 0 - 0.5 but highly dependent on the data 

Typical values for reg_alpha and reg_lambda: 0 - 1 is a good starting point but again, depends on the data

3. **max_depth**

How deep the tree's decision nodes can go (maximal number of splitting features). A high number enhanced the results on training set, but can also lead to overfitting - as slow computation. Must be a positive integer. 

Typical values: 1–10

4. **subsample** 

Fraction of the training set that can be used to train each tree. If this value is low, it may lead to underfitting or if it is too high, it may lead to overfitting.

Typical values: 0.5–0.9

5. **colsample_bytree** 

Fraction of the features that can be used to train each tree. A large value means almost all features can be used to build each decision tree

Typical values: 0.5–0.9

6. **seed**

To grant that trainings of every model with the same hyper-parameters will provide similar results - splittings will be done by XGBoost according to the same individuals.

Typical values: random integer, must only be specified

7. **n_estimators**

Maximal number of decision trees (weak learners) which can be aggregated in the XGBoost model. The generation of new trees is interrupted when the results on train-valid sets no more improve (cross-validation). 

... Let's try it!

In [None]:
learning_rate=0.01

gamma=0.2 
reg_alpha=0.7 
reg_lambda=0.8

max_depth=6

subsample=0.9

seed=7

n_estimators=1000

xgb_classif_params = {
    "learning_rate":learning_rate,
    "gamma":gamma,
    "reg_alpha":reg_alpha,
    "reg_lambda":reg_lambda,
    "max_depth":max_depth,
    "subsample":subsample,
    "seed":seed,
    "n_estimators":n_estimators,
    "objective": "binary:logistic",
    "importance_type": "gain",
    "use_label_encoder": False,
}

In [None]:
import xgboost

t0 = time.time()

model = xgboost.XGBClassifier(**xgb_classif_params)

model.fit(
    X_train,
    Y_train,
    eval_metric=xgb_eval_metric,
    early_stopping_rounds=100, # 10
    eval_set=[(X_train, Y_train), (X_valid, Y_valid)],
    verbose=100,
)

t1 = time.time()

print(f"\n Training of the XGBoost model took {round(t1-t0)} seconds")

# Evaluating the statistical performance of the new classifier

We note that this evaluation is highly dependent on the performance indicator set by the user (e.g. ROC-AUC)...

First, we must switch from probabilities to binary classification (here, does the individual earn over $50_000?). We choose a threshold maximising the F1 score (measure of the balance between precision and recall) on the **validation set** to avoid overfitting:

In [None]:
from classif_basic.model import compute_best_fscore

proba_valid = model.predict_proba(X_valid)[:, 1]
proba_train_valid = model.predict_proba(X_train_valid)[:, 1]

## set y predicted with optimised thresholds
best_threshold, best_fscore = compute_best_fscore(Y_valid, proba_valid)

Y_pred_train_valid = (proba_train_valid >= best_threshold).astype(int)

Now, we compare the performances of the XGBoost model on train, valid and test set:

In [None]:
from sklearn.metrics import classification_report

report_train_valid = classification_report(y_true=Y_train_valid, y_pred=Y_pred_train_valid)

print(f"Statistical Performance on Train & Valid Samples \n \n {report_train_valid}")

Do the same on the test sample, to inspect if the XGBoost model generalizes well on unseen clients:

In [None]:
proba_test = model.predict_proba(X_test)[:, 1]
Y_pred_test = (proba_test >= best_threshold).astype(int)

report_test = classification_report(y_true=Y_test, y_pred=Y_pred_test)

print(f"Statistical Performance on Test Samples \n \n {report_test}")