# Fraud Detection

This notebook shows how to build a fraud detection model with ThirdAI's Universal Deep Transformer (UDT) model, our all-purpose classifier for tabular datasets. In this demo, we will train and evaluate the model on a fraud detection dataset from kaggle, but you can easily replace this with your own dataset.

To run this notebook, you will need to obtain a ThirdAI license at the following link if you have not already: https://www.thirdai.com/try-bolt/

In [None]:
!pip3 install -r requirements.txt
!pip3 install kaggle
!pip3 install thirdai --upgrade

# Dataset Download
Here we use the kaggle api to download the fraud detection dataset found here: https://www.kaggle.com/datasets/ealaxi/paysim1

Downloading this dataset requires authentication from a kaggle account. To use the kaggle API like we do below requires a valid kaggle.json file with credentials stored. Visit https://github.com/Kaggle/kaggle-api#api-credentials for more documentation on the kaggle API. 

You may also choose to download the dataset directly from the source, in which case you should provide the path to the dataset in the prep_fraud_dataset() call later on. 

In [None]:
import kaggle

kaggle.api.authenticate()

kaggle.api.dataset_download_files('ealaxi/paysim1', path='./fraud_detection', unzip=True)

We will use the utils module in this repo to prepare the data for training (if you have just copied this notebook and not cloned the entire repo, you will need to copy the utils.py file as well). You can replace this step and the next step with a download method and a UDT initialization step that is specific to your dataset.

In [None]:
from utils import prep_fraud_dataset

dataset_filename = "./fraud_detection/PS_20174392719_1491204439457_log.csv"

train_filename, test_filename, inference_batch = prep_fraud_dataset(dataset_filename)

# UDT Initialization
We can now create a UDT model by passing in the types of each column in the dataset and the target column we want to be able to predict.

In [None]:
from thirdai import bolt

model = bolt.UniversalDeepTransformer(
    data_types={
        "step": bolt.types.categorical(),
        "type": bolt.types.categorical(),
        "amount": bolt.types.numerical(range=(0, 10000001)),
        "nameOrig": bolt.types.categorical(),
        "oldbalanceOrg": bolt.types.numerical(range=(0, 59585041)),
        "newbalanceOrig": bolt.types.numerical(range=(0, 49585041)),
        "nameDest": bolt.types.categorical(),
        "oldbalanceDest": bolt.types.numerical(range=(0, 356015890)),
        "newbalanceDest": bolt.types.numerical(range=(0, 356179279)),
        "isFraud": bolt.types.categorical(),
        "isFlaggedFraud": bolt.types.categorical(),
    },
    target="isFraud",
    n_target_classes=2,
)

# Training
We can now train our UDT model with just two lines! Feel free to customize the number of epochs and the learning rate; we have chosen values that give good convergence.

In [None]:
model.train(train_filename, epochs=2, learning_rate=0.001, metrics=["categorical_accuracy"])

# Evaluation
Evaluating the performance of the UDT model is also just two lines!

In [None]:
model.evaluate(test_filename, metrics=["categorical_accuracy"]);

# Saving and Loading
Saving and loading a trained UDT model to disk is also extremely straight forward.

In [None]:
save_location = "fraud_detection.model"

# Saving
model.save(save_location)

# Loading
model = bolt.UniversalDeepTransformer.load(save_location)

# Testing Predictions
The evaluation method is great for testing, but it requires labels, which don't exist in a production setting. We also have a predict method that can take in an in-memory batch of rows or a single row (without the target column), allowing easy integration into production pipelines.

In [None]:
import numpy as np

print("Inference batch:", inference_batch, "\n")

prediction = model.predict(inference_batch[0])
class_name = model.class_name(np.argmax(prediction))
print("Input:", inference_batch[0], "Prediction:", class_name, "\n")

prediction_batch = model.predict_batch(inference_batch)
class_names = [
    model.class_name(class_id) for class_id in np.argmax(prediction_batch, axis=1)
]
print("Batch Prediction Results")
for input_sample, class_name in zip(inference_batch, class_names):
    print("Input:", input_sample, "Prediction:", class_name)