## XGBoost Test

**Author:** Shaun Khoo  
**Date:** 14 Oct 2021  
**Context:** Need a suitable benchmark to compare our hierarchical classifier against (and not just to another neural network)  
**Objective:** Compare how much better a hierarchical multi-class classification model is compared to a flat multi-class classification model. We use XGBoost as the primary algorithm as this is sufficiently performant and fast.

## Setting up

We import the required libraries and data

In [1]:
import os
os.chdir('..')

In [2]:
import pandas as pd
import xgboost as xgb

In [16]:
data = pd.read_csv('Data/Processed/Training/train_full.csv')
SSOC_2020 = pd.read_csv('Data/Processed/Training/train.csv')

## Generating embeddings for the train/test sets

In [30]:
from transformers import AutoTokenizer, AutoModel
import torch

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = AutoModel.from_pretrained('distilbert-base-uncased')

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [42]:
len(data)

42842

In [56]:
model.eval()
import time
for i in range(len(data)):
    if i % 100 == 0:
        start = time.time()
        
        sentences = data['Cleaned_Description'][i:min(i+100, len(data))].tolist()

        encoded_input = tokenizer(text = sentences, 
                                  max_length = 512,
                                  add_special_tokens = True,
                                  padding = 'max_length', 
                                  truncation = True,
                                  return_tensors = 'pt')

        # Compute token embeddings
        with torch.no_grad():
            model_output = model(**encoded_input)

        # Perform pooling. In this case, max pooling.
        sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

        all_embeddings.append(sentence_embeddings)
        
        print(f'Processed {i+100}/{len(data)}... took {(time.time()-start)/60:.2f} mins\r', end = "")
       

Processed 42900/42842... took 0.20 mins

In [59]:
all_embeddings[428].shape

torch.Size([42, 768])

Adding in extra data

In [134]:
extra_data = pd.read_csv('Data/Processed/Training/train.csv')

In [136]:
model.eval()
import time

extra_embeddings = []

for i in range(len(extra_data)):
    if i % 100 == 0:
        start = time.time()
        
        sentences = extra_data['Description'][i:min(i+100, len(extra_data))].tolist()

        encoded_input = tokenizer(text = sentences, 
                                  max_length = 512,
                                  add_special_tokens = True,
                                  padding = 'max_length', 
                                  truncation = True,
                                  return_tensors = 'pt')

        # Compute token embeddings
        with torch.no_grad():
            model_output = model(**encoded_input)

        # Perform pooling. In this case, max pooling.
        sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

        extra_embeddings.append(sentence_embeddings)
        
        print(f'Processed {min(i+100, len(extra_data))}/{len(extra_data)}... took {(time.time()-start)/60:.2f} mins\r', end = "")
       

Processed 1000/997... took 0.47 mins

In [139]:
all_embeddings.extend(extra_embeddings)

In [140]:
import numpy as np
output = np.concatenate(all_embeddings)

In [142]:
output.shape

(43839, 768)

In [143]:
sentence_embeddings = pd.DataFrame(output)

In [144]:
sentence_embeddings.to_csv('Data/Processed/Training/Sentence_Embeddings.csv')

## Splitting into train/test sets

We need to be careful with the train/test split to ensure that the test set only includes SSOCs that were already in the training set.

In [147]:
extra_data['Cleaned_Description'] = extra_data['Description']

In [149]:
import copy
old_data = copy.deepcopy(data)

In [151]:
data = pd.concat([data, extra_data], axis = 0, ignore_index = True)

In [18]:
len(data)*0.2

8568.4

In [22]:
test_set = data[data.duplicated('SSOC 2020')].sample(8568)

In [152]:
train_set = data.filter(items = [idx for idx in data.index if idx not in test_set.index],
                        axis = 0)

In [154]:
train_set.to_csv('Data/Processed/Training/train-7oct.csv')
test_set.to_csv('Data/Processed/Training/test-7oct.csv')

In [153]:
X_train = sentence_embeddings.loc[train_set.index].reset_index(drop = True)
y_train = train_set['SSOC 2020']
X_test = sentence_embeddings.loc[test_set.index].reset_index(drop = True)
y_test = test_set['SSOC 2020']
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(35271, 768)
(35271,)
(8568, 768)
(8568,)


## Preparing data

Import the encoding of SSOC to index for each digit-level of SSOC

In [84]:
import json
with open("encoding.json", 'r') as outfile:
    encoding = json.load(outfile)

## Train the model

In [77]:
from xgboost.sklearn import XGBClassifier

In [172]:
xgb_params = {
    'n_estimators': 300,
    'objective': 'multi:softprob',
    'use_label_encoder': False,
    'eval_metric': 'mlogloss',
    'tree_method': 'gpu_hist',
    'max_depth': 10,
    'min_child_weight': 10,
    'gamma': 5
}

**Layer 1**: 1-Digit SSOC

In [156]:
y_train_1D = y_train.astype('str').str.slice(0, 1).map(encoding['SSOC_1D']['ssoc_idx']).tolist()
y_test_1D = y_test.astype('str').str.slice(0, 1).map(encoding['SSOC_1D']['ssoc_idx']).tolist()

In [157]:
%%time
xgb_1D = XGBClassifier(**xgb_params)
xgb_1D.fit(X_train, y_train_1D)

Wall time: 1min 24s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eval_metric='mlogloss',
              gamma=5, gpu_id=0, importance_type='gain',
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=10, min_child_weight=10, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=12,
              num_parallel_tree=1, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='gpu_hist', use_label_encoder=False,
              validate_parameters=1, verbosity=None)

In [158]:
from sklearn.metrics import classification_report

In [159]:
print(classification_report(xgb_1D.predict(X_train), y_train_1D))

              precision    recall  f1-score   support

           0       0.86      0.86      0.86      6327
           1       0.95      0.90      0.92     13754
           2       0.80      0.83      0.82      7387
           3       0.83      0.87      0.85      3117
           4       0.86      0.92      0.89      1624
           5       0.68      1.00      0.81        13
           6       0.74      0.96      0.84       698
           7       0.91      0.92      0.92      1402
           8       0.88      0.89      0.88       949

    accuracy                           0.88     35271
   macro avg       0.83      0.90      0.86     35271
weighted avg       0.88      0.88      0.88     35271



In [160]:
print(classification_report(xgb_1D.predict(X_test), y_test_1D))

              precision    recall  f1-score   support

           0       0.59      0.61      0.60      1525
           1       0.81      0.74      0.77      3403
           2       0.51      0.50      0.51      1945
           3       0.55      0.61      0.58       740
           4       0.58      0.63      0.61       372
           6       0.33      0.65      0.44       100
           7       0.73      0.76      0.74       311
           8       0.50      0.61      0.55       172

    accuracy                           0.65      8568
   macro avg       0.58      0.64      0.60      8568
weighted avg       0.66      0.65      0.65      8568



**Layer 2**: 2-Digit SSOC

In [168]:
pred_proba_1D_train = pd.DataFrame(xgb_1D.predict_proba(X_train))
pred_proba_1D_train.columns = [f'pred_proba_1D_{col}' for col in pred_proba_1D_train.columns]
pred_proba_1D_test = pd.DataFrame(xgb_1D.predict_proba(X_test))
pred_proba_1D_test.columns = [f'pred_proba_1D_{col}' for col in pred_proba_1D_test.columns]

In [169]:
print(pred_proba_1D_train.shape)
print(pred_proba_1D_test.shape)

(35271, 9)
(8568, 9)


In [163]:
y_train_2D = y_train.astype('str').str.slice(0, 2).map(encoding['SSOC_2D']['ssoc_idx']).tolist()
y_test_2D = y_test.astype('str').str.slice(0, 2).map(encoding['SSOC_2D']['ssoc_idx']).tolist()

In [170]:
X_train_2D = pd.concat([X_train, pred_proba_1D_train], axis = 1)
X_test_2D = pd.concat([X_test, pred_proba_1D_test], axis = 1)

In [173]:
%%time
xgb_2D = XGBClassifier(**xgb_params)
xgb_2D.fit(X_train_2D, y_train_2D)

Wall time: 8min 14s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eval_metric='mlogloss',
              gamma=5, gpu_id=0, importance_type='gain',
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=10, min_child_weight=10, missing=nan,
              monotone_constraints='()', n_estimators=300, n_jobs=12,
              num_parallel_tree=1, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='gpu_hist', use_label_encoder=False,
              validate_parameters=1, verbosity=None)

In [174]:
print(classification_report(xgb_2D.predict(X_train_2D), y_train_2D))

              precision    recall  f1-score   support

           0       0.63      0.86      0.73       482
           1       0.90      0.84      0.87      2585
           2       0.85      0.85      0.85      2218
           3       0.84      0.88      0.86       990
           4       0.92      0.89      0.91      6070
           5       0.93      0.96      0.95       449
           6       0.91      0.93      0.92       251
           7       0.82      0.86      0.84      2396
           8       0.90      0.87      0.89      3756
           9       0.85      0.94      0.89       296
          10       0.89      0.89      0.89      1673
          11       0.83      0.94      0.88       297
          12       0.92      0.88      0.90      4832
          13       0.76      0.96      0.85       274
          14       0.60      1.00      0.75        55
          15       0.94      0.95      0.94       526
          16       0.00      0.00      0.00         0
          17       0.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [175]:
print(classification_report(xgb_2D.predict(X_test_2D), y_test_2D))

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.21      0.29      0.24       119
           1       0.50      0.47      0.49       662
           2       0.32      0.33      0.32       507
           3       0.45      0.44      0.44       254
           4       0.63      0.58      0.60      1531
           5       0.68      0.81      0.74        95
           6       0.43      0.55      0.48        42
           7       0.37      0.34      0.35       621
           8       0.72      0.70      0.71       943
           9       0.31      0.50      0.38        40
          10       0.39      0.43      0.41       359
          11       0.32      0.44      0.37        57
          12       0.46      0.41      0.43      1400
          13       0.19      0.42      0.26        31
          14       0.14      0.67      0.24         3
          15       0.77      0.74      0.75       126
          18       0.44      0.48      0.46       327
          19       0.29    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


**Layer 3**: 3-Digit SSOC

In [None]:
pred_proba_2D = pd.DataFrame(xgb_2D.predict_proba(X_train_2D))
pred_proba_2D.columns = [f'pred_proba_2D_{col}' for col in pred_proba_2D.columns]

In [None]:
pred_proba_2D.shape

In [None]:
y_train_3D = y_train.astype('str').str.slice(0, 3).map(encoding['SSOC_3D']['ssoc_idx']).tolist()
y_test_3D = y_test.astype('str').str.slice(0, 3).map(encoding['SSOC_3D']['ssoc_idx']).tolist()

In [None]:
X_train_3D = pd.concat([X_train, pred_proba_2D], axis = 1)

In [None]:
%%time
xgb_2D = XGBClassifier(**xgb_params)
xgb_2D.fit(X_train_2D, y_train_2D)