```
Copyright 2021 IBM Corporation

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```

# Boosting Machine on Credit Card Fraud Dataset

## Background 

The goal of this learning task is to predict if a credit card transaction is fraudulent or genuine based on a set of anonymized features.

## Source

The raw dataset can be obtained directly from [Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud). 

In this example, we download the dataset directly from Kaggle using their API. 

In order for this to work, you must login into Kaggle and folow [these instructions](https://www.kaggle.com/docs/api) to install your API token on your machine.

## Goal

The goal of this notebook is to illustrate how Snap ML's boosting machine can provide best-in-class accuracy when compared to XGBoost and LightGBM.

## Code

In [1]:
cd ../../

/Users/tpa/Code/snapml-examples/examples


In [2]:
CACHE_DIR='cache-dir'

In [3]:
import numpy as np
import pandas as pd
import time
from datasets import CreditCardFraud
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from snapml import BoostingMachineClassifier as SnapBoostingMachineClassifier
from sklearn.metrics import log_loss, make_scorer
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV, train_test_split, PredefinedSplit
from sklearn.utils import parallel_backend
from sklearn.utils.class_weight import compute_class_weight, compute_sample_weight

In [4]:
dataset = CreditCardFraud(cache_dir=CACHE_DIR)
X_train, X_test, y_train, y_test = dataset.get_train_test_split()

Reading binary CreditCardFraud dataset (cache) from disk.


In [5]:
print("Number of examples: %d" % (X_train.shape[0]))
print("Number of features: %d" % (X_train.shape[1]))
print("Number of classes:  %d" % (len(np.unique(y_train))))

Number of examples: 213605
Number of features: 28
Number of classes:  2


Define the validation set:

In [6]:
train_ind, val_ind = train_test_split(range(0, X_train.shape[0]), test_size=0.3, 
                                      shuffle=True, random_state=42)
tmp = np.zeros(shape=(X_train.shape[0],))
for i in train_ind:
    tmp[i] = -1
splitter = PredefinedSplit(tmp)

Calculate the class weights (to account for imbalance):

In [7]:
class_weights = {
    0: y_train.shape[0]/2.0/np.sum(y_train == 0),
    1: y_train.shape[0]/2.0/np.sum(y_train == 1)
}
print(class_weights)

w_train = compute_sample_weight(class_weights, y_train)

{0: 0.5008652385150725, 1: 289.43766937669375}


Define our custom scoring function (class-weighted logistic loss):

In [8]:
def weighted_log_loss(y, p):
    w = compute_sample_weight(class_weights, y)
    return log_loss(y, p.astype(np.float64), sample_weight=w)

scorer = make_scorer(weighted_log_loss, greater_is_better=False, needs_proba=True)

### Hyper-parameter tuning

We will tune all 3 boosting frameworks using the `HalvingRandomSearchCV` optimizer from scikit-learn. 

We will use the following parameters for the optimization in all cases:

In [9]:
sh_params = {
    'n_candidates': 256,
    'min_resources': 16,
    'max_resources': 1024,
    'factor': 4,
    'scoring': scorer,
    'random_state': 42,
    'n_jobs': 4,
    'cv': splitter,
    'return_train_score': False,
}

In [10]:
df = pd.DataFrame(columns=['t_fit', 'holdout_score'])

#### XGBoost

In [11]:
clf = XGBClassifier(random_state=42, 
                    n_jobs=1,
                    max_bin=256,
                    tree_method='hist',
                    use_label_encoder=False,
                    eval_metric='logloss')

xgb_distributions = {
    "max_depth": range(1, 20),
    "learning_rate": 10 ** np.linspace(-2.5, -1),
    "colsample_bytree": np.linspace(0.5, 1.0),
    "subsample": np.linspace(0.5, 1.0),
    "reg_lambda": 10 ** np.linspace(-2, 2)
}

search = HalvingRandomSearchCV(clf, xgb_distributions, resource='n_estimators', **sh_params)
                        
t0 = time.time()
with parallel_backend("loky"): 
    search.fit(X_train, y_train.astype(np.int32), sample_weight=w_train)
t_fit_xgboost  = time.time()-t0

print("Optimized XGBoost hyper-parameters:")
for k, v in search.best_params_.items():
    print("%30s:" % (k), v)
    
score_xgboost = weighted_log_loss(y_test, search.predict_proba(X_test)[:,1])

res_xgboost = pd.Series({'t_fit': t_fit_xgboost, 'holdout_score': score_xgboost}, name='xgboost')
df = df.append(res_xgboost)
print(df)

Optimized XGBoost hyper-parameters:
                     subsample: 0.6326530612244898
                    reg_lambda: 56.89866029018293
                     max_depth: 1
                 learning_rate: 0.09319395762340775
              colsample_bytree: 0.7142857142857143
                  n_estimators: 1024
              t_fit  holdout_score
xgboost  269.207851        0.26633


#### LightGBM

In [12]:
clf = LGBMClassifier(random_state=42, 
                     n_jobs=1, 
                     max_bin=256)

lgbm_distributions = {
    "num_leaves": 2 ** np.array(range(1, 15)),
    "learning_rate": 10 ** np.linspace(-2.5, -1),
    "colsample_bytree": np.linspace(0.5, 1.0),
    "subsample": np.linspace(0.5, 1.0),
    "reg_lambda": 10 ** np.linspace(-2, 2)
}

search = HalvingRandomSearchCV(clf, lgbm_distributions, resource='n_estimators', **sh_params)
                        
t0 = time.time()
with parallel_backend("loky"): 
    search.fit(X_train, y_train.astype(np.int32), sample_weight=w_train)
t_fit_lightgbm  = time.time()-t0

print("Optimized LightGBM hyper-parameters:")
for k, v in search.best_params_.items():
    print("%30s:" % (k), v)

score_lightgbm = weighted_log_loss(y_test, search.predict_proba(X_test)[:,1])

res_lightgbm = pd.Series({'t_fit': t_fit_lightgbm, 'holdout_score': score_lightgbm}, name='lightgbm')
df = df.append(res_lightgbm)
print(df)



Optimized LightGBM hyper-parameters:
                     subsample: 0.9387755102040816
                    reg_lambda: 0.04498432668969444
                    num_leaves: 4
                 learning_rate: 0.05302611335911987
              colsample_bytree: 0.7857142857142857
                  n_estimators: 1024
               t_fit  holdout_score
xgboost   269.207851       0.266330
lightgbm  213.848340       0.381463


#### SnapBoost

In [13]:
clf = SnapBoostingMachineClassifier(random_state=42, 
                                    n_jobs=1,
                                    hist_nbins=256)

snap_distributions = {
    "max_depth": range(1, 20),
    "tree_select_probability": np.linspace(0.9, 1.0),
    "learning_rate": 10 ** np.linspace(-2.5, -1),
    "colsample_bytree": np.linspace(0.5, 1.0),
    "subsample": np.linspace(0.5, 1.0),
    "lambda_l2": 10 ** np.linspace(-2, 2),
    "regularizer": 10 ** np.linspace(-6, 3),
    "fit_intercept": [False, True],
    "gamma": 10 ** np.linspace(-3, 3),
    "n_components": range(1, 100)   
}

search = HalvingRandomSearchCV(clf, snap_distributions, resource='num_round', **sh_params)
                             
t0 = time.time()
with parallel_backend("loky"): 
    search.fit(X_train, y_train, sample_weight=w_train)
t_fit_snapml = time.time()-t0

print("Optimized SnapBoost hyper-parameters:")
for k, v in search.best_params_.items():
    print("%30s:" % (k), v)
    
score_snapml = weighted_log_loss(y_test, search.predict_proba(X_test)[:,1])

res_snapml = pd.Series({'t_fit': t_fit_snapml, 'holdout_score': score_snapml}, name='snapml')
df = df.append(res_snapml)
print(df)



Optimized SnapBoost hyper-parameters:
       tree_select_probability: 0.9346938775510204
                     subsample: 0.846938775510204
                   regularizer: 6.250551925273976
                  n_components: 58
                     max_depth: 1
                 learning_rate: 0.08094001216083124
                     lambda_l2: 56.89866029018293
                         gamma: 568.9866029018293
                 fit_intercept: True
              colsample_bytree: 0.5408163265306123
                     num_round: 1024
               t_fit  holdout_score
xgboost   269.207851       0.266330
lightgbm  213.848340       0.381463
snapml    267.732386       0.253518


### Calculate Leaderboard

In [14]:
df = df.sort_values(by='holdout_score')
df['rank'] = df['holdout_score'].rank()
df

Unnamed: 0,t_fit,holdout_score,rank
snapml,267.732386,0.253518,1.0
xgboost,269.207851,0.26633,2.0
lightgbm,213.84834,0.381463,3.0


## Disclaimer

Performance results always depend on the hardware and software environment. 

Information regarding the environment that was used to run this notebook are provided below:

In [15]:
import utils
environment = utils.get_environment()
for k,v in environment.items():
    print("%20s: %s" % (k, v))

            platform: macOS-10.16-x86_64-i386-64bit
           cpu_count: 8
        cpu_freq_min: 2300
        cpu_freq_max: 2300
        total_memory: 32.0
      snapml_version: 1.7.0
     sklearn_version: 0.24.1
     xgboost_version: 1.3.3
    lightgbm_version: 3.1.1


# Record Statistics

Finally, we record the enviroment and performance statistics for analysis outside of this standalone notebook.

In [16]:
import scrapbook as sb
sb.glue("result", {
    'dataset': dataset.name,
    'n_examples_train': X_train.shape[0],
    'n_examples_test': X_test.shape[0],
    'n_features': X_train.shape[1],
    'n_classes': len(np.unique(y_train)),
    'model': 'BoostingMachineClassifier',
    'score': 'weighted_log_loss',
    't_fit_xgboost': t_fit_xgboost,
    'score_xgboost': score_xgboost,
    't_fit_lightgbm': t_fit_lightgbm,
    'score_lightgbm': score_lightgbm,
    't_fit_snapml': t_fit_snapml,
    'score_snapml': score_snapml,
    **environment,
})