```
Copyright 2022 IBM Corporation

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```

# Random Forest on Credit Card Fraud Dataset

## Background 

The goal of this competition is to predict if a credit card transaction is fraudulent or genuine based on a set of anonymized features.

## Source

The raw dataset can be obtained directly from [Kaggle: Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud).

In this example, we download the dataset directly from Kaggle using their API. In order for it to work, you must: login into Kaggle and folow [these instructions](https://www.kaggle.com/docs/api) to install your API token on your machine.

## Goal

The goals of this notebook are to illustrate how to use Snap ML to: 1) import a scikit-learn random forest trained on this dataset into Snap ML, and 2) run inference on the Z AI accelerator using the Snap ML prediction engine.

## Code

In [3]:
cd ../../

/root/snapml-examples/examples


In [4]:
CACHE_DIR='cache-dir'

In [43]:
import numpy as np
import time
from datasets import CreditCardFraud
from sklearn.ensemble import RandomForestClassifier
from snapml import RandomForestClassifier as SnapRandomForestClassifier
from sklearn.metrics import accuracy_score as score
from sklearn2pmml import sklearn2pmml
from sklearn2pmml import PMMLPipeline

In [33]:
dataset = CreditCardFraud(cache_dir=CACHE_DIR)
X_train, X_test, y_train, y_test = dataset.get_train_test_split()

Reading binary CreditCardFraud dataset (cache) from disk.


In [34]:
print("Number of examples: %d" % (X_train.shape[0]))
print("Number of features: %d" % (X_train.shape[1]))
print("Number of classes:  %d" % (len(np.unique(y_train))))

Number of examples: 213605
Number of features: 28
Number of classes:  2


In [53]:
# Create a scikit-learn Random Forest Classifier model
model = RandomForestClassifier(n_estimators = 100, max_depth=4, n_jobs=4, random_state=42)

# Train a PMML pipeline that uses the scikit-learn model defined above
pipeline = PMMLPipeline([("model", model)]).fit(X_train, y_train)

# Save the trained PMML pipeline to a file, e.g., "model.pmml"
sklearn2pmml(pipeline, "model.pmml", with_repr=True)

# Create a batch of rows to be scored using the PMML pipeline
test_data_size = 128
np.random.seed(10)
test_data_indices = np.random.choice(X_test.shape[0], test_data_size)

# Evaluate the accuracy of the scikit-learn-based pipeline on the test dataset
t0 = time.time()
score_sklearn = score(y_test[test_data_indices], pipeline.predict(X_test[test_data_indices]))
t_predict_sklearn = time.time() - t0

print("Inference time (sklearn): %6.2f milliseconds" % (1000*t_predict_sklearn))
print("Accuracy score (sklearn): %.4f" % (score_sklearn))

Inference time (sklearn):   9.65 milliseconds
Accuracy score (sklearn): 1.0000


In [54]:
# Create a Snap ML Random Forest Classifier model
snapml_model = SnapRandomForestClassifier()

# Import the scikit-learn model into Snap ML
# To indicate that the Snap ML predict engine should run on the Z AI accelerator use the "zdnn_tensors" tree format
snapml_model.import_model("model.pmml", "pmml", tree_format = "zdnn_tensors")

# Set the number of CPU threads used at inference time
snapml_model.set_params(n_jobs=4)

# The current implementation can run inference on test data sets with less than 32768 rows
test_data_size = 128 
np.random.seed(10)
test_data_indices = np.random.choice(X_test.shape[0], test_data_size)

# Evaluate the accuracy of the scikit-learn imported into Snap ML
t0 = time.time()
score_snapml = score(y_test[test_data_indices], snapml_model.predict(X_test[test_data_indices]))
t_predict_snapml = time.time() - t0

print("Inference time (snapml): %6.2f milliseconds" % (1000*t_predict_snapml))
print("Accuracy score (snapml): %.4f" % (score_snapml))

Inference time (snapml):   0.99 milliseconds
Accuracy score (snapml): 1.0000


In [57]:
speed_up = t_predict_sklearn/t_predict_snapml
score_diff = (score_snapml - score_sklearn)/score_sklearn
print("Snap ML vs Scikit-Learn Speed-up: %.1f x" % (speed_up))
print("Relative diff. in score: %.4f" % (score_diff))

Snap ML vs Scikit-Learn Speed-up: 9.8 x
Relative diff. in score: 0.0000


## Disclaimer

Performance results always depend on the hardware and software environment. 

Information regarding the environment that was used to run this notebook are provided below:

In [58]:
import utils
environment = utils.get_environment()
for k,v in environment.items():
    print("%15s: %s" % (k, v))

       platform: Linux-5.4.0-107-generic-s390x-with-glibc2.31
      cpu_count: 12
   cpu_freq_min: 0.0
   cpu_freq_max: 0.0
   total_memory: 251.740966796875
 snapml_version: 1.9.0
sklearn_version: 0.24.2
xgboost_version: 1.3.3
lightgbm_version: 3.3.2


## Record Statistics

Finally, we record the enviroment and performance statistics for analysis outside of this standalone notebook.

In [None]:
import scrapbook as sb
sb.glue("result", {
    'dataset': dataset.name,
    'n_examples_train': X_train.shape[0],
    'n_examples_test': X_test.shape[0],
    'n_features': X_train.shape[1],
    'n_classes': len(np.unique(y_train)),
    'model': type(model).__name__,
    'score': score.__name__,
    't_fit_sklearn': t_fit_sklearn,
    'score_sklearn': score_sklearn,
    't_fit_snapml': 0,
    'score_snapml': score_snapml,
    'score_diff': score_diff,
    'speed_up': speed_up,
    **environment,
})