```
Copyright 2022 IBM Corporation

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```

# Graph Features Extraction for Anti-Money Laudering

The Snap ML GraphFeaturePreprocessor is a scikit-learn compatible preprocessor that enables scalable and real-time feature extraction from graph-structured data. It provides utilities for creating and updating in-memory graphs as well as extracting new features from these graphs. The goal of this example is to show how to use the API of this preprocessor. As input, we will use a synthethic dataset in tabular format where each row represents a financial transaction. For each transaction 4 features are available: transaction ID, source account ID, target accound ID and transaction timestamp. 

In [2]:
# Import the Graph Feature Preprocessor from Snap ML
from snapml import GraphFeaturePreprocessor

# Import other libraries
import numpy as np
import time
import json
import pandas as pd
from IPython.display import display

pd.options.display.max_columns = None

Here we assume that the user has access to a set of (labeled) transactions with raw features which could be used to train a machine learning (ML) model, e.g., for fraud detection. The user will extract graph features using the Graph Features Preprocessor which will be added to the initial raw features present in the transactions. The enriched set of features will be used to train an ML model. The main steps associated with this use case are shown below:

<div> <img src="img/gfp-use-case1.png" width="1000"> </div>


In [15]:
# Path to the file that contains financial transactions, e.g., used for training ML models
train_graph_path = "../datasets/graph-feature-preprocessor/aml_custom_train.txt"

print("Loading the transactions ")
X_train = np.loadtxt(train_graph_path, dtype=np.float64, delimiter=" ", comments="#", usecols=range(4))
print("Input dataset shape: ", X_train.shape)

df = pd.DataFrame(X_train, columns=['transactionID', 'sourceAccountID', 'targetAccountID', 'timestamp'])
display(df)

Loading the transactions 
Input dataset shape:  (12, 4)


Unnamed: 0,transactionID,sourceAccountID,targetAccountID,timestamp
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,0.0
3,1.0,1.0,2.0,1.0
4,2.0,2.0,3.0,2.0
5,3.0,1.0,3.0,3.0
6,4.0,3.0,1.0,4.0
7,4.0,3.0,1.0,4.0
8,4.0,3.0,1.0,4.0
9,5.0,3.0,0.0,5.0


In [30]:
# The following dictionary defines the configuration parameters of the Graph Feature Preprocessor

params = {
    "num_threads": 4,             # number of software threads to be used (important for performance)
    "time_window": 16,            # time window used if no pattern was specified
    
    "vertex_stats": True,         # produce vertex statistics
    "vertex_stats_cols": [3],     # produce vertex statistics using the selected input columns
    
    # features: 0:fan,1:deg,2:ratio,3:avg,4:sum,5:min,6:max,7:median,8:var,9:skew,10:kurtosis
    "vertex_stats_feats": [0, 1, 2, 3, 4, 8, 9, 10],  # fan,deg,ratio,avg,sum,var,skew,kurtosis
    
    # fan in/out parameters
    "fan": True,
    "fan_tw": 16,
    "fan_bins": [y+2 for y in range(2)],
    
    # in/out degree parameters
    "degree": True,
    "degree_tw": 16,
    "degree_bins": [y+2 for y in range(2)],
    
    # scatter gather parameters
    "scatter-gather": True,
    "scatter-gather_tw": 16,
    "scatter-gather_bins": [y+2 for y in range(2)],
    
    # temporal cycle parameters
    "temp-cycle": True,
    "temp-cycle_tw": 16,
    "temp-cycle_bins": [y+2 for y in range(2)],
    
    # length-constrained simple cycle parameters
    "lc-cycle": False,
    "lc-cycle_tw": 16,
    "lc-cycle_len": 8,
    "lc-cycle_bins": [y+2 for y in range(2)],
}

In [31]:
# Create a Graph Feature Preprocessor, set its configuration using the above dictionary and verify it

print("Creating a graph feature preprocessor ")
gp = GraphFeaturePreprocessor()

print("Setting the parameters of the graph feature preprocessor ")
gp.set_params(params)

print("Graph feature preprocessor parameters: ", json.dumps(gp.get_params(), indent=4))

Creating a graph feature preprocessor 
Setting the parameters of the graph feature preprocessor 
Graph feature preprocessor parameters:  {
    "num_threads": 4,
    "time_window": 16,
    "vertex_stats": true,
    "vertex_stats_cols": [
        3
    ],
    "vertex_stats_feats": [
        0,
        1,
        2,
        3,
        4,
        8,
        9,
        10
    ],
    "fan": true,
    "fan_tw": 16,
    "fan_bins": [
        2,
        3
    ],
    "degree": true,
    "degree_tw": 16,
    "degree_bins": [
        2,
        3
    ],
    "scatter-gather": true,
    "scatter-gather_tw": 16,
    "scatter-gather_bins": [
        2,
        3
    ],
    "temp-cycle": true,
    "temp-cycle_tw": 16,
    "temp-cycle_bins": [
        2,
        3
    ],
    "lc-cycle": false,
    "lc-cycle_tw": 16,
    "lc-cycle_len": 8,
    "lc-cycle_bins": [
        2,
        3
    ]
}


In [32]:
print("Enriching the transactions with new graph features ")
print("Raw dataset shape: ", X_train.shape)

# the fit_transform and transform functions are equivalent
# these functions can run on single transactions or on batches of transactions
X_train_enriched = gp.fit_transform(X_train.astype("float64")) 

print("Enriched dataset shape: ", X_train_enriched.shape)

Enriching the transactions with new graph features 
Raw dataset shape:  (12, 4)
Enriched dataset shape:  (12, 48)


We define a helper function to inspect the newly generated graph-based features for a given transaction:

In [33]:
def print_enriched_transaction(transaction, params):
    colnames = []
    
    # add raw features names
    colnames.append("transactionID")
    colnames.append("sourceAccountID")
    colnames.append("targetAccountID")
    colnames.append("timestamp")
    
    # add features names for the graph patterns
    for pattern in ['fan', 'degree', 'scatter-gather', 'temp-cycle', 'lc-cycle']:
        if pattern in params:
            if params[pattern]:
                bins = len(params[pattern +'_bins'])
                if pattern in ['fan', 'degree']:
                    for i in range(bins-1):
                        colnames.append(pattern+"_in_bins_"+str(params[pattern +'_bins'][i])+"-"+str(params[pattern +'_bins'][i+1]))
                    colnames.append(pattern+"_in_bins_"+str(params[pattern +'_bins'][i+1])+"-inf")
                    for i in range(bins-1):
                        colnames.append(pattern+"_out_bins_"+str(params[pattern +'_bins'][i])+"-"+str(params[pattern +'_bins'][i+1]))
                    colnames.append(pattern+"_out_bins_"+str(params[pattern +'_bins'][i+1])+"-inf")
                else:
                    for i in range(bins-1):
                        colnames.append(pattern+"_bins_"+str(params[pattern +'_bins'][i])+"-"+str(params[pattern +'_bins'][i+1]))
                    colnames.append(pattern+"_bins_"+str(params[pattern +'_bins'][i+1])+"-inf")

    vert_feat_names = ["fan","deg","ratio","avg","sum","min","max","median","var","skew","kurtosis"]

    # add features names for the vertex statistics
    for orig in ['source', 'dest']:
        for direction in ['out', 'in']:
            # add fan, deg, and ratio features
            for k in [0, 1, 2]:
                if k in params["vertex_stats_feats"]:
                    feat_name = orig + "_" + vert_feat_names[k] + "_" + direction
                    colnames.append(feat_name)
            for col in params["vertex_stats_cols"]:
                # add avg, sum, min, max, median, var, skew, and kurtosis features
                for k in [3, 4, 5, 6, 7, 8, 9, 10]:
                    if k in params["vertex_stats_feats"]:
                        feat_name = orig + "_" + vert_feat_names[k] + "_col" + str(col) + "_" + direction
                        colnames.append(feat_name)

    df = pd.DataFrame(transaction, columns=colnames)
    display(df)

In [34]:
print("Enriched transactions: ")
print_enriched_transaction(X_train_enriched, gp.get_params())

Enriched transactions: 


Unnamed: 0,transactionID,sourceAccountID,targetAccountID,timestamp,fan_in_bins_2-3,fan_in_bins_3-inf,fan_out_bins_2-3,fan_out_bins_3-inf,degree_in_bins_2-3,degree_in_bins_3-inf,degree_out_bins_2-3,degree_out_bins_3-inf,scatter-gather_bins_2-3,scatter-gather_bins_3-inf,temp-cycle_bins_2-3,temp-cycle_bins_3-inf,source_fan_out,source_deg_out,source_ratio_out,source_avg_col3_out,source_sum_col3_out,source_var_col3_out,source_skew_col3_out,source_kurtosis_col3_out,source_fan_in,source_deg_in,source_ratio_in,source_avg_col3_in,source_sum_col3_in,source_var_col3_in,source_skew_col3_in,source_kurtosis_col3_in,dest_fan_out,dest_deg_out,dest_ratio_out,dest_avg_col3_out,dest_sum_col3_out,dest_var_col3_out,dest_skew_col3_out,dest_kurtosis_col3_out,dest_fan_in,dest_deg_in,dest_ratio_in,dest_avg_col3_in,dest_sum_col3_in,dest_var_col3_in,dest_skew_col3_in,dest_kurtosis_col3_in
0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,3.0,2.0,2.0,1.0,3.0,6.0,9.0,0.0,1.0,2.0,2.0,1.0,6.0,12.0,1.0,0.0,1.0,2.0,2.0,1.0,2.0,4.0,1.0,0.0,1.0,2.0,2.0,1.0,2.0,4.0,4.0,0.0,1.0
1,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,3.0,2.0,2.0,1.0,3.0,6.0,9.0,0.0,1.0,2.0,2.0,1.0,6.0,12.0,1.0,0.0,1.0,2.0,2.0,1.0,2.0,4.0,1.0,0.0,1.0,2.0,2.0,1.0,2.0,4.0,4.0,0.0,1.0
2,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,3.0,2.0,2.0,1.0,3.0,6.0,9.0,0.0,1.0,2.0,2.0,1.0,6.0,12.0,1.0,0.0,1.0,2.0,2.0,1.0,2.0,4.0,1.0,0.0,1.0,2.0,2.0,1.0,2.0,4.0,4.0,0.0,1.0
3,1.0,1.0,2.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,3.0,2.0,2.0,1.0,2.0,4.0,1.0,0.0,1.0,2.0,2.0,1.0,2.0,4.0,4.0,0.0,1.0,2.0,2.0,1.0,4.5,9.0,6.25,0.0,1.0,2.0,2.0,1.0,3.5,7.0,6.25,0.0,1.0
4,2.0,2.0,3.0,2.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,3.0,2.0,2.0,1.0,4.5,9.0,6.25,0.0,1.0,2.0,2.0,1.0,3.5,7.0,6.25,0.0,1.0,2.0,2.0,1.0,4.5,9.0,0.25,0.0,1.0,2.0,2.0,1.0,2.5,5.0,0.25,0.0,1.0
5,3.0,1.0,3.0,3.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,2.0,2.0,1.0,2.0,4.0,1.0,0.0,1.0,2.0,2.0,1.0,2.0,4.0,4.0,0.0,1.0,2.0,2.0,1.0,4.5,9.0,0.25,0.0,1.0,2.0,2.0,1.0,2.5,5.0,0.25,0.0,1.0
6,4.0,3.0,1.0,4.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,2.0,2.0,1.0,4.5,9.0,0.25,0.0,1.0,2.0,2.0,1.0,2.5,5.0,0.25,0.0,1.0,2.0,2.0,1.0,2.0,4.0,1.0,0.0,1.0,2.0,2.0,1.0,2.0,4.0,4.0,0.0,1.0
7,4.0,3.0,1.0,4.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,2.0,2.0,1.0,4.5,9.0,0.25,0.0,1.0,2.0,2.0,1.0,2.5,5.0,0.25,0.0,1.0,2.0,2.0,1.0,2.0,4.0,1.0,0.0,1.0,2.0,2.0,1.0,2.0,4.0,4.0,0.0,1.0
8,4.0,3.0,1.0,4.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,2.0,2.0,1.0,4.5,9.0,0.25,0.0,1.0,2.0,2.0,1.0,2.5,5.0,0.25,0.0,1.0,2.0,2.0,1.0,2.0,4.0,1.0,0.0,1.0,2.0,2.0,1.0,2.0,4.0,4.0,0.0,1.0
9,5.0,3.0,0.0,5.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,3.0,2.0,2.0,1.0,4.5,9.0,0.25,0.0,1.0,2.0,2.0,1.0,2.5,5.0,0.25,0.0,1.0,2.0,2.0,1.0,3.0,6.0,9.0,0.0,1.0,2.0,2.0,1.0,6.0,12.0,1.0,0.0,1.0


This newly enriched set of transactions can now be used to train a ML model. Once trained, the model can be used for prediction (e.g., detect anomalies) on new (unlabeled) transactions. The main steps associated with this use case is shown below:

<div> <img src="img/gfp-use-case2.png" width="1000"> </div>

In [35]:
# Path to the file that contains financial transactions used for testing
test_transactions_path = "../datasets/graph-feature-preprocessor/aml_custom_test.txt"

print("Loading the test transactions ")
X_test = np.loadtxt(test_transactions_path, dtype=np.float64, delimiter=" ", comments="#", usecols=range(4))
print("Input dataset shape: ", X_test.shape)

df = pd.DataFrame(X_test, columns=['transactionID', 'sourceAccountID', 'destinationAccountID', 'timestamp'])
display(df)

Loading the test transactions 
Input dataset shape:  (8, 4)


Unnamed: 0,transactionID,sourceAccountID,destinationAccountID,timestamp
0,8.0,8.0,9.0,8.0
1,9.0,9.0,10.0,9.0
2,10.0,10.0,11.0,10.0
3,11.0,9.0,11.0,11.0
4,12.0,11.0,9.0,12.0
5,13.0,11.0,8.0,13.0
6,14.0,8.0,10.0,14.0
7,15.0,10.0,8.0,15.0


In [36]:
print("Creating a graph feature preprocessor ")
gp = GraphFeaturePreprocessor()

print("Setting the parameters of the graph feature preprocessor ")
gp.set_params(params)

print("Creating the graph using the training transactions ")
gp.fit(X_train)  # this step is optional, however recommended for capturing deeper graph feature

# transform can run on single transactions or on batches of transactions
print("Enriching the test transactions with new graph features ")
X_test_enriched = gp.transform(X_test.astype("float64"))
print_enriched_transaction(X_test_enriched, gp.get_params())

Creating a graph feature preprocessor 
Setting the parameters of the graph feature preprocessor 
Creating the graph using the training transactions 
Enriching the test transactions with new graph features 


Unnamed: 0,transactionID,sourceAccountID,targetAccountID,timestamp,fan_in_bins_2-3,fan_in_bins_3-inf,fan_out_bins_2-3,fan_out_bins_3-inf,degree_in_bins_2-3,degree_in_bins_3-inf,degree_out_bins_2-3,degree_out_bins_3-inf,scatter-gather_bins_2-3,scatter-gather_bins_3-inf,temp-cycle_bins_2-3,temp-cycle_bins_3-inf,source_fan_out,source_deg_out,source_ratio_out,source_avg_col3_out,source_sum_col3_out,source_var_col3_out,source_skew_col3_out,source_kurtosis_col3_out,source_fan_in,source_deg_in,source_ratio_in,source_avg_col3_in,source_sum_col3_in,source_var_col3_in,source_skew_col3_in,source_kurtosis_col3_in,dest_fan_out,dest_deg_out,dest_ratio_out,dest_avg_col3_out,dest_sum_col3_out,dest_var_col3_out,dest_skew_col3_out,dest_kurtosis_col3_out,dest_fan_in,dest_deg_in,dest_ratio_in,dest_avg_col3_in,dest_sum_col3_in,dest_var_col3_in,dest_skew_col3_in,dest_kurtosis_col3_in
0,8.0,8.0,9.0,8.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,3.0,2.0,2.0,1.0,11.0,22.0,9.0,0.0,1.0,2.0,2.0,1.0,14.0,28.0,1.0,0.0,1.0,2.0,2.0,1.0,10.0,20.0,1.0,0.0,1.0,2.0,2.0,1.0,10.0,20.0,4.0,0.0,1.0
1,9.0,9.0,10.0,9.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,3.0,2.0,2.0,1.0,10.0,20.0,1.0,0.0,1.0,2.0,2.0,1.0,10.0,20.0,4.0,0.0,1.0,2.0,2.0,1.0,12.5,25.0,6.25,0.0,1.0,2.0,2.0,1.0,11.5,23.0,6.25,0.0,1.0
2,10.0,10.0,11.0,10.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,3.0,2.0,2.0,1.0,12.5,25.0,6.25,0.0,1.0,2.0,2.0,1.0,11.5,23.0,6.25,0.0,1.0,2.0,2.0,1.0,12.5,25.0,0.25,0.0,1.0,2.0,2.0,1.0,10.5,21.0,0.25,0.0,1.0
3,11.0,9.0,11.0,11.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,2.0,2.0,1.0,10.0,20.0,1.0,0.0,1.0,2.0,2.0,1.0,10.0,20.0,4.0,0.0,1.0,2.0,2.0,1.0,12.5,25.0,0.25,0.0,1.0,2.0,2.0,1.0,10.5,21.0,0.25,0.0,1.0
4,12.0,11.0,9.0,12.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,2.0,2.0,1.0,12.5,25.0,0.25,0.0,1.0,2.0,2.0,1.0,10.5,21.0,0.25,0.0,1.0,2.0,2.0,1.0,10.0,20.0,1.0,0.0,1.0,2.0,2.0,1.0,10.0,20.0,4.0,0.0,1.0
5,13.0,11.0,8.0,13.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,3.0,2.0,2.0,1.0,12.5,25.0,0.25,0.0,1.0,2.0,2.0,1.0,10.5,21.0,0.25,0.0,1.0,2.0,2.0,1.0,11.0,22.0,9.0,0.0,1.0,2.0,2.0,1.0,14.0,28.0,1.0,0.0,1.0
6,14.0,8.0,10.0,14.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,2.0,2.0,1.0,11.0,22.0,9.0,0.0,1.0,2.0,2.0,1.0,14.0,28.0,1.0,0.0,1.0,2.0,2.0,1.0,12.5,25.0,6.25,0.0,1.0,2.0,2.0,1.0,11.5,23.0,6.25,0.0,1.0
7,15.0,10.0,8.0,15.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,2.0,2.0,1.0,12.5,25.0,6.25,0.0,1.0,2.0,2.0,1.0,11.5,23.0,6.25,0.0,1.0,2.0,2.0,1.0,11.0,22.0,9.0,0.0,1.0,2.0,2.0,1.0,14.0,28.0,1.0,0.0,1.0


Now the enriched transactions can be used as input to the ML model previously trained. 