In [9]:
import numpy as np
import pandas as pd

import pickle
import json

from datetime import date

import catboost
from catboost import CatBoostClassifier

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV

Here we would like to begin training a few different models. First lay out our choice of models and why they were selected and then begin training for later comparison.

First let's read in our pre-processed training data:

In [43]:
X_train_minmax = np.load('data/data_train.npy')
y_train = np.load('data/labels_train.npy',)

#### Model selection

I am going to choose 3 models to train, test and compare on:

1. `Random Forest classifier`
2. `XGBoost`
3. `CatBoost` - I wanted to try this out as I hadn't used it before and saw this problem with lots of categorical variables as a good chance

#### RandomizedSearch

Before doing a proper training run, we will want to optimize the hyperparameters of our different models. So we will do a "randomized" parameter search, which will train a number of models with random different combinations over a range of hyperparameters and will return the hyperparameters which optmize a chosen `score` method best. The search will also apply `cross validation` to help determine the model with best parameters to use.

I have chosen to use RandomizedSearch over GridSearch for time reasons as I didn't have a great understanding of the ball park range of parameters that would lead to the models performing well. So using random search let me explore a larger range of parameters more quickly

## `Random Forest classifier`

In [23]:
RFclf = RandomForestClassifier()

In [91]:
parameters = {'n_estimators': [int(x) for x in np.linspace(start = 100, stop = 1800, num = 100)],
              'max_features': ['auto', 'sqrt'],
              'max_depth': [int(x) for x in np.linspace(30, 90, num = 11)] + [None],
              'min_samples_split': [2, 5],
              'min_samples_leaf': [1, 2]}

In [100]:
random_search_RF = RandomizedSearchCV(estimator = RFclf,
                                      param_distributions = parameters,
                                      cv = 5,
                                      n_iter=100,
                                      scoring = 'roc_auc',
                                      verbose=True, 
                                      n_jobs = -1)

I have chosen the `roc_auc` score as the `score` to optmize as I want to have the best performance possible on a combination of both the positive and negative class.

In [101]:
random_search_RF.fit(X_train_minmax, y_test)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'max_depth': [30, 36, 42, 48, 54, 60,
                                                      66, 72, 78, 84, 90,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2],
                                        'min_samples_split': [2, 5],
                                        'n_estimators': [100, 117, 134, 151,
                                                         168, 185, 203, 220,
                                                         237, 254, 271, 288,
                                                         306, 323, 340, 357,
                                                         374, 391, 409, 426,
                                                         443, 460, 477, 494,
           

In [27]:
rf_params = rf_random.best_params_
print(rf_params)

{'n_estimators': 1800, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 90, 'bootstrap': True}


Save params:

In [104]:
with open(f'params/RF-{date.today()}.json', 'w') as fp:
    json.dump(rf_params, fp)

Train:

In [44]:
RFclf = RandomForestClassifier(n_estimators=rf_params['n_estimators'],
                               min_samples_split=rf_params['min_samples_split'],
                               min_samples_leaf=rf_params['min_samples_leaf'],
                               max_features=rf_params['max_features'],
                               max_depth=rf_params['max_depth'],
                               bootstrap=rf_params['bootstrap'])

In [45]:
RFclf.fit(X_train_minmax, y_train)

RandomForestClassifier(max_depth=90, max_features='sqrt', min_samples_leaf=2,
                       n_estimators=1800)

Save model:

In [107]:
pickle.dump(RFclf, open(f'models/RF-{date.today()}.pkl', 'wb'))

In [46]:
pickle.dump(RFclf, open(f'models/RF-2023-01-18.pkl', 'wb'))

## `XGBoost`

In [59]:
XGBoost = GradientBoostingClassifier()

There are more parameters that could have been chosen, but only selected a few to reduce complexity and save time. These parameters were chosen to try and reduce overfitting

In [60]:
parameters = {
    'max_depth': range (2, 5, 1),
    'n_estimators': range(60, 220, 40),
    'learning_rate': [0.2, 0.1, 0.01, 0.05],
    'subsample': [0.1, 0.2, 0.5, 0.7, 1.]
}

In [61]:
grid_search = RandomizedSearchCV(
    estimator=XGBoost,
    param_distributions=parameters,
    scoring = 'roc_auc',
    n_iter=100,
    n_jobs = -1,
    cv = 5,
    verbose=True
)

In [63]:
grid_search.fit(X_train_minmax, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


RandomizedSearchCV(cv=5, estimator=GradientBoostingClassifier(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'learning_rate': [0.2, 0.1, 0.01, 0.05],
                                        'max_depth': range(2, 5),
                                        'n_estimators': range(60, 220, 40),
                                        'subsample': [0.1, 0.2, 0.5, 0.7, 1.0]},
                   scoring='roc_auc', verbose=True)

In [64]:
XGparams = grid_search.best_params_
print(XGparams)

{'subsample': 1.0, 'n_estimators': 100, 'max_depth': 4, 'learning_rate': 0.2}


Save params:

In [65]:
with open(f'params/XGBoost-{date.today()}.json', 'w') as fp:
    json.dump(XGparams, fp)

In [None]:
with open(f'params/XGBoost-2023-01-18.json', 'w') as fp:
    json.dump(XGparams, fp)

Train:

In [74]:
XGBoost = GradientBoostingClassifier(learning_rate=XGparams['learning_rate'],
                                     max_depth=XGparams['max_depth'], 
                                     n_estimators=XGparams['n_estimators'], 
                                     subsample=XGparams['subsample'])
XGBoost.fit(X_train_minmax, y_train)

GradientBoostingClassifier(learning_rate=0.2, max_depth=4)

Save model:

In [89]:
pickle.dump(XGBoost, open(f'models/XGBoost-{date.today()}.pkl', 'wb'))

## `CatBoost`

Read in our data in categorical form to try train on CatBoost 

In [127]:
X_train_CB = pd.read_csv('data/train_data_cat.csv')
y_train_CB = pd.read_csv('data/train_labels_cat.csv')

In [130]:
cat_features = list(range(0, X_train_CB.shape[1]))

In [116]:
params = {
    'learning_rate': [0.05, 0.07, 0.09],
    'depth': [5, 6, 7],
    'l2_leaf_reg': [1, 3, 5, 7],
    'grow_policy': ['SymmetricTree', 'Depthwise', 'Lossguide']
}

In [118]:
train_pool = catboost.Pool(X_train_CB, label=y_train_CB['wage'], cat_features=cat_features)

In [None]:
CBoost.randomized_search(params, n_iter=5, X=train_pool)

Save params:

In [87]:
CBoost_params = {'depth': 6,
  'l2_leaf_reg': 1,
  'learning_rate': 0.05,
  'grow_policy': 'SymmetricTree'}

with open(f'params/CBoost-2023-01-20.json', 'w') as fp:
    json.dump(CBoost_params, fp)

Train:

In [131]:
CBoost = CatBoostClassifier(depth=CBoost_params['depth'],
                            learning_rate=CBoost_params['learning_rate'],
                            l2_leaf_reg=CBoost_params['l2_leaf_reg'],
                            grow_policy=CBoost_params['grow_policy'])
# Fit 
CBoost.fit(X_train_CB, y_train_CB, cat_features)

0:	learn: 0.6504355	total: 445ms	remaining: 7m 24s
1:	learn: 0.6107876	total: 859ms	remaining: 7m 8s
2:	learn: 0.5739641	total: 1.28s	remaining: 7m 7s
3:	learn: 0.5429564	total: 1.54s	remaining: 6m 23s
4:	learn: 0.5167021	total: 1.83s	remaining: 6m 4s
5:	learn: 0.4934616	total: 2.1s	remaining: 5m 47s
6:	learn: 0.4733071	total: 2.35s	remaining: 5m 32s
7:	learn: 0.4561010	total: 2.56s	remaining: 5m 17s
8:	learn: 0.4419967	total: 2.79s	remaining: 5m 7s
9:	learn: 0.4266074	total: 3.05s	remaining: 5m 2s
10:	learn: 0.4157652	total: 3.34s	remaining: 5m
11:	learn: 0.4059716	total: 3.69s	remaining: 5m 3s
12:	learn: 0.3973174	total: 4s	remaining: 5m 3s
13:	learn: 0.3902136	total: 4.3s	remaining: 5m 3s
14:	learn: 0.3826012	total: 4.61s	remaining: 5m 2s
15:	learn: 0.3764542	total: 4.96s	remaining: 5m 4s
16:	learn: 0.3706479	total: 5.27s	remaining: 5m 4s
17:	learn: 0.3640497	total: 5.54s	remaining: 5m 2s
18:	learn: 0.3595833	total: 5.91s	remaining: 5m 5s
19:	learn: 0.3557479	total: 6.21s	remaining:

158:	learn: 0.2603217	total: 1m 5s	remaining: 5m 48s
159:	learn: 0.2602593	total: 1m 6s	remaining: 5m 48s
160:	learn: 0.2601658	total: 1m 6s	remaining: 5m 48s
161:	learn: 0.2600676	total: 1m 7s	remaining: 5m 48s
162:	learn: 0.2595922	total: 1m 7s	remaining: 5m 48s
163:	learn: 0.2595386	total: 1m 8s	remaining: 5m 47s
164:	learn: 0.2593037	total: 1m 8s	remaining: 5m 47s
165:	learn: 0.2592047	total: 1m 9s	remaining: 5m 47s
166:	learn: 0.2587114	total: 1m 9s	remaining: 5m 46s
167:	learn: 0.2586337	total: 1m 10s	remaining: 5m 47s
168:	learn: 0.2582650	total: 1m 10s	remaining: 5m 46s
169:	learn: 0.2581631	total: 1m 10s	remaining: 5m 46s
170:	learn: 0.2577962	total: 1m 11s	remaining: 5m 46s
171:	learn: 0.2577084	total: 1m 11s	remaining: 5m 46s
172:	learn: 0.2576086	total: 1m 12s	remaining: 5m 45s
173:	learn: 0.2575220	total: 1m 12s	remaining: 5m 46s
174:	learn: 0.2574400	total: 1m 13s	remaining: 5m 45s
175:	learn: 0.2573544	total: 1m 13s	remaining: 5m 45s
176:	learn: 0.2572870	total: 1m 14s	r

311:	learn: 0.2395780	total: 2m 16s	remaining: 5m 1s
312:	learn: 0.2395399	total: 2m 17s	remaining: 5m 1s
313:	learn: 0.2394637	total: 2m 17s	remaining: 5m 1s
314:	learn: 0.2393813	total: 2m 18s	remaining: 5m
315:	learn: 0.2393063	total: 2m 18s	remaining: 5m
316:	learn: 0.2392465	total: 2m 19s	remaining: 4m 59s
317:	learn: 0.2391130	total: 2m 19s	remaining: 4m 59s
318:	learn: 0.2390299	total: 2m 19s	remaining: 4m 58s
319:	learn: 0.2389525	total: 2m 20s	remaining: 4m 58s
320:	learn: 0.2389167	total: 2m 21s	remaining: 4m 58s
321:	learn: 0.2388384	total: 2m 21s	remaining: 4m 58s
322:	learn: 0.2387645	total: 2m 22s	remaining: 4m 57s
323:	learn: 0.2386806	total: 2m 22s	remaining: 4m 57s
324:	learn: 0.2385638	total: 2m 23s	remaining: 4m 57s
325:	learn: 0.2384232	total: 2m 23s	remaining: 4m 56s
326:	learn: 0.2383637	total: 2m 23s	remaining: 4m 56s
327:	learn: 0.2382885	total: 2m 24s	remaining: 4m 55s
328:	learn: 0.2382565	total: 2m 24s	remaining: 4m 55s
329:	learn: 0.2378777	total: 2m 25s	rem

464:	learn: 0.2284006	total: 3m 32s	remaining: 4m 4s
465:	learn: 0.2283490	total: 3m 33s	remaining: 4m 4s
466:	learn: 0.2283259	total: 3m 34s	remaining: 4m 4s
467:	learn: 0.2282774	total: 3m 34s	remaining: 4m 4s
468:	learn: 0.2281566	total: 3m 35s	remaining: 4m 4s
469:	learn: 0.2281352	total: 3m 36s	remaining: 4m 3s
470:	learn: 0.2280543	total: 3m 36s	remaining: 4m 3s
471:	learn: 0.2280196	total: 3m 37s	remaining: 4m 2s
472:	learn: 0.2279727	total: 3m 37s	remaining: 4m 2s
473:	learn: 0.2279501	total: 3m 38s	remaining: 4m 2s
474:	learn: 0.2278437	total: 3m 38s	remaining: 4m 1s
475:	learn: 0.2277899	total: 3m 39s	remaining: 4m 1s
476:	learn: 0.2277474	total: 3m 39s	remaining: 4m 1s
477:	learn: 0.2277175	total: 3m 40s	remaining: 4m 1s
478:	learn: 0.2276713	total: 3m 41s	remaining: 4m
479:	learn: 0.2275698	total: 3m 41s	remaining: 4m
480:	learn: 0.2275219	total: 3m 42s	remaining: 4m
481:	learn: 0.2273681	total: 3m 43s	remaining: 3m 59s
482:	learn: 0.2273001	total: 3m 43s	remaining: 3m 59s


617:	learn: 0.2199584	total: 5m 4s	remaining: 3m 7s
618:	learn: 0.2199098	total: 5m 4s	remaining: 3m 7s
619:	learn: 0.2198937	total: 5m 5s	remaining: 3m 7s
620:	learn: 0.2198570	total: 5m 5s	remaining: 3m 6s
621:	learn: 0.2198238	total: 5m 6s	remaining: 3m 5s
622:	learn: 0.2197721	total: 5m 6s	remaining: 3m 5s
623:	learn: 0.2197267	total: 5m 6s	remaining: 3m 4s
624:	learn: 0.2196776	total: 5m 7s	remaining: 3m 4s
625:	learn: 0.2196073	total: 5m 7s	remaining: 3m 3s
626:	learn: 0.2194458	total: 5m 8s	remaining: 3m 3s
627:	learn: 0.2193230	total: 5m 8s	remaining: 3m 2s
628:	learn: 0.2192544	total: 5m 9s	remaining: 3m 2s
629:	learn: 0.2191163	total: 5m 9s	remaining: 3m 1s
630:	learn: 0.2190887	total: 5m 9s	remaining: 3m 1s
631:	learn: 0.2190193	total: 5m 10s	remaining: 3m
632:	learn: 0.2189853	total: 5m 10s	remaining: 3m
633:	learn: 0.2189540	total: 5m 11s	remaining: 2m 59s
634:	learn: 0.2189110	total: 5m 11s	remaining: 2m 59s
635:	learn: 0.2188781	total: 5m 12s	remaining: 2m 58s
636:	learn

771:	learn: 0.2115736	total: 6m 16s	remaining: 1m 51s
772:	learn: 0.2115607	total: 6m 16s	remaining: 1m 50s
773:	learn: 0.2115288	total: 6m 17s	remaining: 1m 50s
774:	learn: 0.2115201	total: 6m 17s	remaining: 1m 49s
775:	learn: 0.2114731	total: 6m 18s	remaining: 1m 49s
776:	learn: 0.2114404	total: 6m 19s	remaining: 1m 48s
777:	learn: 0.2114129	total: 6m 19s	remaining: 1m 48s
778:	learn: 0.2113927	total: 6m 20s	remaining: 1m 47s
779:	learn: 0.2113676	total: 6m 20s	remaining: 1m 47s
780:	learn: 0.2113462	total: 6m 21s	remaining: 1m 46s
781:	learn: 0.2113181	total: 6m 22s	remaining: 1m 46s
782:	learn: 0.2112869	total: 6m 22s	remaining: 1m 46s
783:	learn: 0.2112612	total: 6m 23s	remaining: 1m 45s
784:	learn: 0.2112375	total: 6m 23s	remaining: 1m 45s
785:	learn: 0.2112305	total: 6m 24s	remaining: 1m 44s
786:	learn: 0.2111773	total: 6m 24s	remaining: 1m 44s
787:	learn: 0.2111529	total: 6m 25s	remaining: 1m 43s
788:	learn: 0.2111343	total: 6m 25s	remaining: 1m 43s
789:	learn: 0.2110690	total:

925:	learn: 0.2068690	total: 7m 26s	remaining: 35.7s
926:	learn: 0.2068363	total: 7m 27s	remaining: 35.2s
927:	learn: 0.2068065	total: 7m 27s	remaining: 34.7s
928:	learn: 0.2067842	total: 7m 28s	remaining: 34.3s
929:	learn: 0.2067655	total: 7m 28s	remaining: 33.8s
930:	learn: 0.2067445	total: 7m 29s	remaining: 33.3s
931:	learn: 0.2067409	total: 7m 29s	remaining: 32.8s
932:	learn: 0.2067089	total: 7m 29s	remaining: 32.3s
933:	learn: 0.2066772	total: 7m 30s	remaining: 31.8s
934:	learn: 0.2066520	total: 7m 30s	remaining: 31.3s
935:	learn: 0.2066233	total: 7m 31s	remaining: 30.9s
936:	learn: 0.2066058	total: 7m 31s	remaining: 30.4s
937:	learn: 0.2065770	total: 7m 32s	remaining: 29.9s
938:	learn: 0.2065525	total: 7m 32s	remaining: 29.4s
939:	learn: 0.2065311	total: 7m 33s	remaining: 28.9s
940:	learn: 0.2064970	total: 7m 33s	remaining: 28.4s
941:	learn: 0.2063850	total: 7m 34s	remaining: 28s
942:	learn: 0.2063669	total: 7m 34s	remaining: 27.5s
943:	learn: 0.2063342	total: 7m 34s	remaining: 2

<catboost.core.CatBoostClassifier at 0x2c71f3e10>

Save model:

In [132]:
pickle.dump(CBoost, open(f'models/CBoost-{date.today()}.pkl', 'wb'))