# Financial Fraud Detection

- The objective of this notebook is to showcase the usage of the [___financial-fraud-training___ container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/cugraph/containers/financial-fraud-training) and how to deploy the produced trained models on [NVIDIA Dynamo-Triton](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver).
- We use [IBM TabFormer](https://github.com/IBM/TabFormer) as an example dataset and the dataset is preprocess before model training

NOTE:
* The preprocessing code is written specifically for the TabFormer dataset and will not work with other datasets.
* Additionally, a familiarity with [Jupyter](https://docs.jupyter.org/en/latest/what_is_jupyter.html) is assumed.

# Environment Setup (Local and Brev)
This Notebook is designed to work in both a ___Local___ and ___Brev___ environment.  However, there are a few slight differences that will be pointed out. 

### For Local Environment Setup
Please create a Conda environment and add that to the notebook - See the [README](../README.md) file

In [None]:
# default host for local run
HOST = "0.0.0.0"
BREV = False

### For Brev Environment Setup

In [None]:
!pip install -r "./requirements.txt"

In [None]:
BREV = True

In [None]:
# Brev public IP address
if BREV:
    HOST = 'host.docker.internal'
HOST

-----
## Import libraries (both environments)

In [None]:
import os
import sys
import json
import time

----
# Step 1: Get and Prepare the data

## For Local
1. Download the dataset: https://ibm.ent.box.com/v/tabformer-data/folder/130747715605
2. untar and uncompreess the file: `tar -xvzf ./transactions.tgz`
3. Put card_transaction.v1.csv in in the `data/TabFormer/raw` folder


## For Brev 
1. Download the dataset: https://ibm.ent.box.com/v/tabformer-data/folder/130747715605
2. In the Jupyter notebook window, use the "File Browser" section to the data/Tabformer/raw folder
3. Drag-and-drop the "transactions.tgz" file into the folder
    - There is also an "upload" option that displays a file selector
    - Please wait for the upload to finish, it could take a while, by lookign at the status indocator at the bottom of the window
4. Now uncompress and untar by running the following command
    - Note: if somethign goes wrong you will need to delete the file rather than trying to overwrite it.

In [None]:
# verify that the compressed file was uploaded successfully - the size should be 266M
!ls -lh ../data/TabFormer/raw

In [None]:
# Uncompress/untar the file
!tar xvzf ../data/TabFormer/raw/transactions.tgz -C ../data/TabFormer/raw/

__If__ drag-and-drop is not working, please run the [Download TabFormer](./extra/download-tabformer.ipynb) notebook is the "extra" folder 

## Check data folder structure
The goal is to produce the following structure

```
.
    data
    └── TabFormer
        └── raw
            └── card_transaction.v1.csv
```

In [None]:
# Once the raw data is placed as described above, set the path to the TabFormer directory

# Change this path to point to TabFormer data
data_root_dir = os.path.abspath('../data/TabFormer/') 

# Change this path to the directory where you want to save your model
model_output_dir = os.path.join(data_root_dir, 'trained_models')

# Path to save the trained model
os.makedirs(model_output_dir, exist_ok=True)

### Define python function to print directory tree

In [None]:
def print_tree(directory, prefix=""):
    """Recursively prints the directory tree starting at 'directory'."""
    # Retrieve a sorted list of entries in the directory
    entries = sorted(os.listdir(directory))
    entries_count = len(entries)
    
    for index, entry in enumerate(entries):
        path = os.path.join(directory, entry)
        # Determine the branch connector
        if index == entries_count - 1:
            connector = "└── "
            extension = "    "
        else:
            connector = "├── "
            extension = "│   "
        
        print(prefix + connector + entry)
        
        # If the entry is a directory, recursively print its contents
        if os.path.isdir(path):
            print_tree(path, prefix + extension)

In [None]:
# Check if the raw data has been placed properly
print_tree(data_root_dir)

---
# Step 2: Preprocess the data 
- Import the Python function for preprocessing the TabFormer data
- Call `preprocess_TabFormer` function to prepare the data

NOTE: The preprocessing can takes a few minutes


In [None]:
# Add the "src" directory to the search path
src_dir = os.path.abspath(os.path.join(os.path.dirname(os.getcwd()), 'src'))
sys.path.insert(0, src_dir)

# should be able to import from "src" folder now
from preprocess_TabFormer import preprocess_data

In [None]:
# Preprocess the data
mask_mapping, feature_mask = preprocess_data(data_root_dir)

# this will output status as it correlates different attributes with target column

In [None]:
# You should not see files under a "gnn" folder and under a "xgb" folder
print_tree(data_root_dir)

-----
# Step 3:  Now train the model using the financial-fraud-training container


## Create training configuration file
NOTE: Training configuration file must conform to schema defined [here](https://docs.nvidia.com/nim/financial-fraud-training/latest/configuration/config-json.html)

__Important: Models and configuration files needed for deployment using NVIDIA Dynamo-Triton will be saved in model-repository under the folder that is mounted in /trained_models inside the container__

In [None]:
training_config = {
  "paths": {
    "data_dir": "/data", # Mount dataset root directory under /data in the container
    "output_dir": "/trained_models" # Mount path to save the trained models.
                                    # NOTE: This path is inside the docker container 
  },

  "models": [
    {
      "kind": "GraphSAGE_XGBoost",
      "gpu": "single",
      "hyperparameters": {
        "gnn":{
          "hidden_channels": 16,
          "n_hops": 1,
          "dropout_prob": 0.1,
          "batch_size": 1024,
          "fan_out": 16,
          "num_epochs": 16
        },
        "xgb": {
          "max_depth": 6,
          "learning_rate": 0.2,
          "num_parallel_tree": 3,
          "num_boost_round": 512,
          "gamma": 0.0
        }

      }
    }
  ]
}


#### Save the training configuration as a json file

In [None]:
training_config_file_name = 'training_config.json'

with open(os.path.join(training_config_file_name), 'w') as json_file:
    json.dump(training_config, json_file, indent=4)

## Pull and run the financial_fraud_training container


#### Logging into the NVIDIA Container Registry.

The NVIDIA NGC API Key is a mandatory key that is required to use this blueprint. This is needed to log into the NVIDIA container registry, nvcr.io, and to pull secure container images used in this NVIDIA NIM Blueprint. Refer to [Generating NGC API Keys](https://docs.nvidia.com/ngc/gpu-cloud/ngc-user-guide/index.html#generating-api-key) in the NVIDIA NGC User Guide for more information.


In [None]:
API_KEY="PASTE YOUR NGC API KEY HERE"

#### Authenticate with the NVIDIA Container Registry with the following command

In [None]:
!echo "$API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

#### Pull the container image from the NGC registry

In [None]:
!docker pull nvcr.io/nvidia/cugraph/financial-fraud-training:1.0.1

#### Set container name and ports for running the container

In [None]:
NIM_HTTP_PORT = 8002
NIM_GRPC_PORT = 50051
CONTAINER_NAME = "financial-fraud-training"
gnn_data_dir = os.path.join(data_root_dir, "gnn")

In [None]:
# Stop any running container with the same name
container_ids = !docker ps --filter "name={CONTAINER_NAME}" -q
if len(container_ids) > 0:
    !docker stop {CONTAINER_NAME}

#### Run the container

In [None]:
if BREV:
    host_path_gnn_data = gnn_data_dir.replace('/root/verb-workspace', '/home/ubuntu/workspace')
    host_path_trained_models = model_output_dir.replace('/root/verb-workspace', '/home/ubuntu/workspace')
else:
    host_path_gnn_data = gnn_data_dir
    host_path_trained_models = model_output_dir


In [None]:
!docker run -d -it --rm --name={CONTAINER_NAME} --gpus "device=0" \
    -p {NIM_HTTP_PORT}:{NIM_HTTP_PORT} -e NIM_HTTP_API_PORT={NIM_HTTP_PORT} -p {NIM_GRPC_PORT}:{NIM_GRPC_PORT} \
    -e NIM_DISABLE_MODEL_DOWNLOAD=True -e NIM_GRPC_API_PORT={NIM_GRPC_PORT} -v {host_path_gnn_data}:/data \
    -v {host_path_trained_models}:/trained_models nvcr.io/nvidia/cugraph/financial-fraud-training:1.0.1 -e NGC_API_KEY={API_KEY}

In [None]:
time.sleep(5)

#### Finally, initiate model training using the training configuration defined earlier

- Initiate training via the /train endpoint by sending the training configuration as a JSON payload.

In [None]:
!curl -X POST "http://{HOST}:$NIM_HTTP_PORT/train"   -H "Content-Type: application/json"   -d @{training_config_file_name}

In [None]:
container_ids = !docker ps --filter "name={CONTAINER_NAME}" -q
if len(container_ids) > 0:
    !docker stop {CONTAINER_NAME}

#### Make sure that `python_backend_model_repository` has been created with right contents
According to the training configuration file defined earlier, if the trining run successfully, a folder titled `python_backend_model_repository` containing a python backend model and a configuration file will be created under 
{model_output_dir} and its contents should look like

```sh
python_backend_model_repository/
└── prediction_and_shapley
    ├── 1
    │   ├── embedding_based_xgboost.json
    │   ├── model.py
    │   └── state_dict_gnn_model.pth
    └── config.pbtxt

```


In [None]:
print_tree(os.path.join(model_output_dir, 'model_repository'))

----
# Step 4:  Serve your python backend model using NVIDIA Dynamo-Triton
__!Important__: Change MODEL_REPO_PATH to point to `{model_output_dir}` / `python_backend_model_repository` if you used a different path in your training configuration file

#### Install NVIDIA Dynamo-Triton Client

In [None]:
!pip install 'tritonclient[all]'

In [None]:
import tritonclient.grpc as triton_grpc
import tritonclient.http as httpclient
from tritonclient import utils as triton_utils


##### Replace HOST with the actual URL where your NVIDIA Dynamo-Triton server is hosted.


In [None]:
HTTP_PORT = 8005
GRPC_PORT = 8006
METRICS_PORT = 8007

### Serve your models with NVIDIA Dynamo-Triton
- Pull the NVIDIA Dynamo-Triton docker image
- Deploy server with models and configuration files (produced by the training container)
- Double check that your `python_backend_model_repository` folder, located under `${model_output_dir}`, has the following structures
```sh
python_backend_model_repository/
└── prediction_and_shapley
    ├── 1
    │   ├── embedding_based_xgboost.json
    │   ├── model.py
    │   └── state_dict_gnn_model.pth
    └── config.pbtxt
```

In [None]:
# NVIDIA Dynamo-Triton image
TRITON_IMAGE = 'nvcr.io/nvidia/tritonserver:25.04-py3'

# Pull the Dynamo image
!docker pull {TRITON_IMAGE}

# Stop and remove any existing container
container_ids = !docker ps --filter "name=tritonserver" -q
if len(container_ids) > 0:
    !docker stop tritonserver
    !docker rm tritonserver


In [None]:
# Run the container

MODEL_REPO_PATH = os.path.join(model_output_dir, 'python_backend_model_repository')
if BREV:
    HOST_MODEL_REPO_PATH = MODEL_REPO_PATH.replace('/root/verb-workspace', '/home/ubuntu/workspace')
else:
    HOST_MODEL_REPO_PATH = MODEL_REPO_PATH

!docker run --gpus "device=0" -d -p {HTTP_PORT}:{HTTP_PORT} -p {GRPC_PORT}:{GRPC_PORT} \
    -v {HOST_MODEL_REPO_PATH}:/models --name tritonserver {TRITON_IMAGE} tritonserver \
    --model-repository=/models --exit-timeout-secs=6000 --http-port={HTTP_PORT} --grpc-port={GRPC_PORT} \
    --metrics-port={METRICS_PORT}

### URLs for GRPC and HTTP request to the inference server

In [None]:
client_grpc = triton_grpc.InferenceServerClient(url=f'{HOST}:{GRPC_PORT}')
client_http = httpclient.InferenceServerClient(url=f'{HOST}:{HTTP_PORT}')

### Wait for NVIDIA Dynamo-Triton to install packages and come online
**NOTE**: This cell can take a few minutes to execute.
 If the following cell keeps running even after you see `Started HTTPService at {HOST}:{HTTP_PORT}` in the log, you can interrupt the execution of this cell and continue from the next cell.

In [None]:
import subprocess
container_name = "tritonserver"

while True:
    client_grpc = triton_grpc.InferenceServerClient(url=f'{HOST}:{GRPC_PORT}')
    try:
        if client_grpc.is_server_ready():
            break
    except triton_utils.InferenceServerException as e:
        pass
    try:
        # Run the docker logs command with the --tail option
        output = subprocess.check_output(["docker", "logs", "--tail", "10", container_name])
        print(output.decode("utf-8"))
    except subprocess.CalledProcessError as e:
        print("Error retrieving logs:", e)
    time.sleep(10)

### Check if NVIDIA Dynamo-Triton is running properly

In [None]:
!docker logs tritonserver

## Prediction without computing Shapley values

### Read preprocessed input transactions to send query to NVIDIA Dynamo-Triton

In [None]:
import os
import pandas as pd
import numpy as np

model_name = "prediction_and_shapley"
test_X_path = os.path.join(gnn_data_dir, "test_gnn", "nodes/node.csv") # already preprocessed data
test_X = pd.read_csv(test_X_path)
X = test_X.values.astype(np.float32)

test_y_path = os.path.join(gnn_data_dir, "test_gnn", "nodes/node_label.csv") # already preprocessed data
test_y = pd.read_csv(test_y_path)
y = test_y.values.astype(np.float32)

test_ei_path = os.path.join(gnn_data_dir, "test_gnn", "edges/node_to_node.csv") 
test_ei = pd.read_csv(test_ei_path)


In [None]:
edge_index = test_ei.values.T.astype(np.int64)
compute_shap = np.array([False], dtype=bool) 


### Evaluate performance for a batch of transactions

In [None]:
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score)
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay


In [None]:
# Decision threshold to flag a transaction as fraud
#Change to trade-off precision and recall
decision_threshold = 0.5

In [None]:

def compute_score_for_batch(edge_idx, X, y, batch_size, decision_threshold = 0.5, shap=False, feature_mask=None):
    edge_index = edge_idx.T.astype(np.int64)
    compute_shap = np.array([shap], dtype=bool) # Skip Shapley value computation
    
    with httpclient.InferenceServerClient(f"{HOST}:{HTTP_PORT}") as client:
        input_features = httpclient.InferInput("NODE_FEATURES", X.shape, datatype="FP32")
        input_features.set_data_from_numpy(X)

        input_edge_indices = httpclient.InferInput("EDGE_INDEX", edge_index.shape, datatype="INT64")
        input_edge_indices.set_data_from_numpy(edge_index)

        # Even though Shapley values are not requested, it still requires a feature mask.
        # It can be a dummy array of int values, but the length must be same as number of features.

        if shap:
            assert X.shape[1] == len(feature_mask)
            feature_mask = np.array(feature_mask).astype(np.int32)
        else:
            feature_mask = np.zeros(X.shape[1]).astype(np.int32)

        input_feature_mask = httpclient.InferInput("FEATURE_MASK", feature_mask.shape, datatype="INT32")
        input_feature_mask.set_data_from_numpy(feature_mask)

        compute_shap_flag = httpclient.InferInput("COMPUTE_SHAP", compute_shap.shape, datatype="BOOL")
        compute_shap_flag.set_data_from_numpy(compute_shap)
        
        outputs = [
            httpclient.InferRequestedOutput("PREDICTION"),
            httpclient.InferRequestedOutput("SHAP_VALUES")
        ]

        # Send query to the server
        response = client.infer(
            model_name,
            inputs=[input_features, input_edge_indices, compute_shap_flag, input_feature_mask],
            request_id=str(1),
            outputs=outputs,
            timeout= 3000
            )
        
    predictions = response.as_numpy('PREDICTION')

    assert y.sum() == y[-batch_size:].sum()
    if shap == False:
        y_pred = (predictions > decision_threshold).astype(int)
        
        # Compute evaluation metrics
        accuracy = accuracy_score(y[-batch_size:], y_pred[-batch_size:])
        precision = precision_score(y[-batch_size:], y_pred[-batch_size:], zero_division=0)
        recall = recall_score(y[-batch_size:], y_pred[-batch_size:], zero_division=0)
        f1 = f1_score(y[-batch_size:], y_pred[-batch_size:], zero_division=0)



        classes = ['Non-Fraud', 'Fraud']
        columns = pd.MultiIndex.from_product([["Predicted"], classes])
        index = pd.MultiIndex.from_product([["Actual"], classes])

        conf_mat = confusion_matrix(y[-batch_size:], y_pred[-batch_size:])
        cm_df = pd.DataFrame(conf_mat, index=index, columns=columns)
        print(cm_df)

        # Plot the confusion matrix directly from predictions
        disp = ConfusionMatrixDisplay.from_predictions(
            y[-batch_size:], y_pred[-batch_size:], display_labels=classes)
        disp.ax_.set_title('Confusion Matrix')
        plt.show()

        print("----Summary---")
        print(f"Accuracy: {accuracy:.4f}")
        print(f"Precision: {precision:.4f}")
        print(f"Recall: {recall:.4f}")
        print(f"F1 Score: {f1:.4f}")

    return predictions, response.as_numpy('SHAP_VALUES') if shap else None


#### Sample a batch of transactions from the test data

In [None]:
# NOTE:
# In the preprocessing code, zero-based user node indices come first, 
# followed by merchant node indices, and then transaction node indices.

# Each transaction is represented by four edges,
    # user to transaction,
    # transaction to merchant,
    # transaction to user
    # merchant to transaction

# Each transaction involves three nodes - an user, a transaction and a merchant


NR_TX =  test_ei.shape[0]//4
batch_size = NR_TX

transaction_batch = np.random.choice(NR_TX, size=batch_size, replace=False)
idx_of_edges = transaction_batch.reshape(-1, 1) + np.arange(4)*NR_TX
edges_batch = test_ei.iloc[idx_of_edges.ravel()]
unique_vertices, renumbered_edges =  np.unique(edges_batch.values, return_inverse=True)
eidx = renumbered_edges.reshape(edges_batch.shape)

In [None]:
assert y[unique_vertices][-batch_size:].sum() == y[unique_vertices].sum()
y[unique_vertices][-batch_size:].sum(), y[unique_vertices].sum()

In [None]:
predictions, _ = compute_score_for_batch(eidx, X[unique_vertices], y[unique_vertices], batch_size=batch_size)

### Compute Shapley values of different features for a transaction
NOTE: Shapely computation is very expensive

In [None]:
NR_TX =  test_ei.shape[0]//4
batch_size = 1 

transaction_batch = np.random.choice(NR_TX, size=batch_size, replace=False)
idx_of_edges = transaction_batch.reshape(-1, 1) + np.arange(4)*NR_TX

edges_batch = test_ei.iloc[idx_of_edges.ravel()]
unique_vertices, renumbered_edges =  np.unique(edges_batch.values, return_inverse=True)


predictions, shap_values = compute_score_for_batch(renumbered_edges.reshape(edges_batch.shape), X[unique_vertices], y[unique_vertices], batch_size=batch_size, shap=True, feature_mask=feature_mask)

In [None]:
feature_to_attribution_map = dict(zip(feature_mask, shap_values[2]))
feature_name_to_id_map = {v:k for k, v in mask_mapping.items()}

#### Shapley values for different features

In [None]:
{feature_name_to_id_map[k]: f"{v:.3f}" for k, v in feature_to_attribution_map.items()}