# Experimental pipeline in action

Before our experimental pipeline kicks off, some global parameters need to be defined. We devised a manual
5-fold out-of-time validation, by dividing the dataset based on a rolling window approach. `timeframe` specifes which timeframe is selected. The `undersampling_rate` indicates the graph-level undersampling rate defined as the desired ratio of fraudulent transactions over legitimate ones. The `embedding_size` defines the dimension of the embeddings learned by our inductive graph representation learners and `add_additional_data` is a boolean indicating whether or not we would like to add the original transaction features to the transaction node embeddings before training and evaluating our downstream classification model.

In [1]:
# Global parameters:
timeframe = 4
undersampling_rate = None
embedding_size = 64
add_additional_data = True

### 1. Loading the Credit Card Transaction Data

Load numeric, preprocessed transaction data. 

In [2]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd 
import numpy as np

df = pd.read_csv("/Users/raf/googledrive/DOC/data/FUCC/ccf_preprocessed.csv", index_col = "Unnamed: 0", parse_dates=[7])

In [3]:
df.head()

Unnamed: 0,TX_ID,CARD_PAN_ID,TERM_MIDUID,TERM_MCC,TERM_COUNTRY,TX_AMOUNT,TX_DATETIME,TX_ACCEPTED,TX_FRAUD,COUNTRY_CATEGORY,...,COUNTRY_EU,COUNTRY_NB,COUNTRY_ROW,COUNTRY_USA,month,day,hour,minute,second,weekday
0,t1,cF795A68EEDA3B68BFBB543B082EC9607C3B13C2A0190E...,m842904041e+06,7399,BEL,0.0,2013-10-01 01:00:01,False,False,BE,...,0,0,0,0,10,1,1,0,1,1
1,t2,cAC83FDA148FE02DF7A582131ABE4AF1BD7589A13C7521...,m0001740300759911,4816,USA,7.37,2013-10-01 01:00:06,True,False,USA,...,0,0,0,1,10,1,1,0,6,1
2,t3,c1CD10E1CC65E651ECFA881CA89A62DE06CAF75077D643...,m2070050002009250,5735,LUX,6.25,2013-10-01 01:00:08,True,False,NB,...,0,1,0,0,10,1,1,0,8,1
3,t4,c4ECA551366385232A79A5249733E42BB685F0E6237F0A...,m003020378135866130847,7523,CAN,7.18,2013-10-01 01:00:08,True,False,ROW,...,0,0,1,0,10,1,1,0,8,1
4,t5,c74186F480DC2BED64AEC74D265B2B46C2EBA1BFE4F5AC...,m8000009421670050,4812,USA,154.93,2013-10-01 01:00:09,True,False,USA,...,0,0,0,1,10,1,1,0,9,1


In [4]:
datecolumn = df.TX_DATETIME

In [5]:
if not isinstance(datecolumn, DatetimeArray):
    try:
        DatetimeArray(datecolumn)
    except:
        print("The provided date column cannot be parsed. Please provide a datetime column in pandas DatetimeArray format.")

NameError: name 'DatetimeArray' is not defined

In [9]:
from inductiveGRL.timeframes import Timeframes

from datetime import timedelta
tt = Timeframes(DatetimeArray(df.TX_DATETIME), step_size=timedelta(days=1), train_size=timedelta(days=5), test_size=timedelta(days=1))


In [10]:
tt.split()

([array([     0,      1,      2, ..., 503287, 503288, 503289]),
  array([100669, 100670, 100671, ..., 603732, 603733, 603734]),
  array([201593, 201594, 201595, ..., 704306, 704307, 704308]),
  array([302419, 302420, 302421, ..., 805115, 805116, 805117]),
  array([402834, 402835, 402836, ..., 905452, 905453, 905454]),
  array([ 503290,  503291,  503292, ..., 1006112, 1006113, 1006114]),
  array([ 603735,  603736,  603737, ..., 1106601, 1106602, 1106603]),
  array([ 704309,  704310,  704311, ..., 1206908, 1206909, 1206910]),
  array([ 805118,  805119,  805120, ..., 1307189, 1307190, 1307191]),
  array([ 905455,  905456,  905457, ..., 1407616, 1407617, 1407618]),
  array([1006115, 1006116, 1006117, ..., 1508100, 1508101, 1508102]),
  array([1106604, 1106605, 1106606, ..., 1608511, 1608512, 1608513]),
  array([1206911, 1206912, 1206913, ..., 1708823, 1708824, 1708825]),
  array([1307192, 1307193, 1307194, ..., 1809115, 1809116, 1809117]),
  array([1407619, 1407620, 1407621, ..., 1909349, 

In [None]:
type(tt.fold_size)

In [8]:
from pandas.arrays import DatetimeArray


In [None]:
DatetimeArray(df.TX_DATETIME)

In [None]:
dd[0]

In [None]:
da = pd.to_datetime(df.TX_DATETIME).values

In [None]:
(da < dd[5]).nonzero()[0]

In [None]:
dd[5] < da[5]

In [None]:
da[0]

In [None]:
type(df.TX_DATETIME[0])

#### 1.1. Selecting a Timeframe

To improve the robustness of our analysis, we devised a manual 5-fold out-of-time validation, by dividing the dataset based
on a rolling window approach. Our setup has 5 timeframes that roll with an interval of 5 days and a total window size of 17 days. The last 5 days of the timeframe are used as the inductive set.

In [None]:
from inductiveGRL.timeframes import Timeframes
from datetime import datetime, timedelta

tf = Timeframes(df['TX_DATETIME'],step_size=5, window_size=17)
timeframe_indices = tf.get_timeframe_indices(timeframe)

print('number of days in dataset: ',tf.get_number_of_days())
print('number of timeframes derived from window and step size: ',tf.get_number_of_timeframes())

the `train_data` variable stores the data that will be used to construct graphs on which the representation learners can train. 
the `inductive_data` will be used to test the inductive performance of our representation learning algorithms. `days_to_hold_out` specifies the number of days we want to hold out at the end of the timeframe for our inductive set. When setting this equal to the step_size in the previous cell, there is no overlap between the different inductive sets.

In [None]:
days_to_hold_out = 5
train_data, inductive_data = tf.train_inductive_split(df.loc[timeframe_indices],days_to_hold_out)

#### 1.2. Selecting an Undersampling Rate

In [None]:
print('The distribution of fraud for the train data is:\n', train_data['TX_FRAUD'].value_counts())
print('The distribution of fraud for the inductive data is:\n', inductive_data['TX_FRAUD'].value_counts())

Given the highly imbalanced nature of our dataset, we undersample the train data with the aforespecified rate (if any). We also make sure that the indices do not change while undersampling, since these will be used as transaction node identifiers in the graph construction step.

In [None]:
from imblearn.under_sampling import RandomUnderSampler

if not undersampling_rate is None:
    print("An undersampling rate of ", undersampling_rate, "is applied.")
    train_data['index'] = train_data.index
    undersample = RandomUnderSampler(sampling_strategy=(undersampling_rate))
    X, y = undersample.fit_resample(train_data, train_data['TX_FRAUD'])
    train_data = X.set_index(X['index']).drop('index',axis=1)
    print('The new distribution for the train set is:\n', train_data["TX_FRAUD"].value_counts())


### 2. Construct the Credit Card Transaction Network

nodes, edges and features are passed to the GraphConstruction constructor. Note that client and merchant node data hold a trivial attribute with value 1. This because we want all the relevant transaction data to reside at the transaction nodes and StellarGraph's current HinSAGE implementation requires all nodes to have features. Note that a graph is constructed for a specific timeframe and sampling rate.

In [None]:
from inductiveGRL.graphconstruction import GraphConstruction

transaction_node_data = train_data.drop("CARD_PAN_ID", axis=1).drop("TERM_MIDUID", axis=1).drop("TX_FRAUD", axis=1).drop("TX_DATETIME", axis=1)
client_node_data = pd.DataFrame([1]*len(train_data.CARD_PAN_ID.unique())).set_index(train_data.CARD_PAN_ID.unique())
merchant_node_data = pd.DataFrame([1]*len(train_data.TERM_MIDUID.unique())).set_index(train_data.TERM_MIDUID.unique())

nodes = {"client":train_data.CARD_PAN_ID, "merchant":train_data.TERM_MIDUID, "transaction":train_data.index}
edges = [zip(train_data.CARD_PAN_ID, train_data.index),zip(train_data.TERM_MIDUID, train_data.index)]
features = {"transaction": transaction_node_data, 'client': client_node_data, 'merchant': merchant_node_data}

graph = GraphConstruction(nodes, edges, features)
S = graph.get_stellargraph()
print(S.info())

### 2.1.1. Train GraphSAGE

HinSAGE, a heterogeneous implementation of the GraphSAGE framework is trained with user specified hyperparameters.

In [None]:
from inductiveGRL.hinsage import HinSAGE_Representation_Learner

#GraphSAGE parameters
num_samples = [2,32]
embedding_node_type = "transaction"

hinsage = HinSAGE_Representation_Learner(embedding_size, num_samples, embedding_node_type)
trained_hinsage_model, graphsage_train_emb = hinsage.train_hinsage(S, list(train_data.index), train_data['TX_FRAUD'], batch_size=50, epochs=10)

### 2.1.1. Train FIGRL

FI-GRL, a fast inductive graph representation framework is trained using the aforeconstructed graph. This algorithm is implemented in matlab so we make use of matlab.engine to deploy its native implementation. First, we instantiate the FI-GRL class with the intermediate dimension of the matrix between the input graph and the embedding space, in addition to the size of final dimension (embedding space). FI-GRL's train step returns three matrices: U, which represents the embedding space, sigma and v, which are matrices that will be used in the inductive step to generate embeddings for unseen nodes. 

In [None]:
import matlab.engine
eng = matlab.engine.start_matlab()

#FIGRL hyperparameter
intermediate_dim = 400

#Instantiate FI-GRL
figrl = eng.FIGRL(float(intermediate_dim), embedding_size)

#Run train step
edges = matlab.double(graph.get_edgelist())
U, sigma,v = eng.train_step_figrl(figrl, edges, nargout = 3)

We transform the embeddings returned by Matlab to a pandas dataframe and select the embeddings of our train nodes. Since FI-GRL assumes homogeneous input graphs, it also generated embeddings for nodes that we are not interested in (clients and merchants). We also correct for an index shift of 1.

In [None]:
figrl_train_emb = pd.DataFrame(U)
figrl_train_emb = figrl_train_emb.set_index(figrl_train_emb.index+1)
figrl_train_emb = figrl_train_emb.loc[train_data.index]

### 2.2. Inductive Step

We want to keep the original indices after concatenating the train and inductive data, because they represent the transaction node ids. We need to concatenate these dataframes in order to easily construct the new graph.

In [None]:
train_data['index'] = train_data.index
inductive_data['index'] = inductive_data.index
inductive_graph_data = pd.concat((train_data,inductive_data))
inductive_graph_data = inductive_graph_data.set_index(inductive_graph_data['index']).drop("index",axis = 1)

For the inductive step, we need to add the new, unseen transactions to the graph. Because the current StellarGraph implementation does not support adding nodes and edges to an existing stellargraph object, we create a new graph that contains all the nodes from the train graph in addition to the new nodes.

In [None]:
from inductiveGRL.graphconstruction import GraphConstruction

transaction_node_data = inductive_graph_data.drop("CARD_PAN_ID", axis=1).drop("TERM_MIDUID", axis=1).drop("TX_FRAUD", axis=1).drop("TX_DATETIME", axis=1)
client_node_data = pd.DataFrame([1]*len(inductive_graph_data.CARD_PAN_ID.unique())).set_index(inductive_graph_data.CARD_PAN_ID.unique())
merchant_node_data = pd.DataFrame([1]*len(inductive_graph_data.TERM_MIDUID.unique())).set_index(inductive_graph_data.TERM_MIDUID.unique())

nodes = {"client":inductive_graph_data.CARD_PAN_ID, "merchant":inductive_graph_data.TERM_MIDUID, "transaction":inductive_graph_data.index}
edges = [zip(inductive_graph_data.CARD_PAN_ID, inductive_graph_data.index),zip(inductive_graph_data.TERM_MIDUID, inductive_graph_data.index)]
features = {"transaction": transaction_node_data, 'client': client_node_data, 'merchant': merchant_node_data}

graph = GraphConstruction(nodes, edges, features)
S = graph.get_stellargraph()
print(S.info())

### 2.2.2. Inductive Step GraphSAGE 

The inductive step applies the previously learned (and optimized) aggregation functions, part of the `trained_hinsage_model`. We also pass the new graph S and the node identifiers (inductive_data.index) to the inductive step. 

In [None]:
graphsage_inductive_emb = hinsage.inductive_step_hinsage(S, trained_hinsage_model, inductive_data.index, batch_size=50)

### 2.2.2. Inductive Step FI-GRL 

The inductive step performs computations with the new adjacency matrix and the during training calculated matrices sigma and v. 

In [None]:
edges = matlab.double(graph.get_edgelist())
figrl_inductive_emb = eng.inductive_step_figrl(figrl, edges, sigma, v)

Similar to the train step, we extract the embeddings from the nodes we are interested in (i.e. the transaction nodes).

In [None]:
figrl_inductive_emb = pd.DataFrame(figrl_inductive_emb)
figrl_inductive_emb = figrl_inductive_emb.set_index(figrl_inductive_emb.index+1)
figrl_inductive_emb = figrl_inductive_emb.loc[inductive_data.index]

### 4. Classification: predictions based on inductive embeddings

Select your preferred classification model

In [None]:
from xgboost import XGBClassifier
classifier = XGBClassifier(n_estimators=100)

If requested, the original transaction features are added to the generated embeddings. If these features are added, a baseline consisting of only these features (without embeddings) is included to analyze the net impact of embeddings on the predictive performance.

In [None]:
train_labels = train_data['TX_FRAUD']

if add_additional_data is True:
    graphsage_train_emb = pd.merge(graphsage_train_emb, train_data.loc[graphsage_train_emb.index].drop('TX_FRAUD', axis=1).drop('TX_DATETIME', axis=1), left_index=True, right_index=True)
    graphsage_inductive_emb = pd.merge(graphsage_inductive_emb, inductive_data.loc[graphsage_inductive_emb.index].drop('TX_FRAUD', axis=1).drop('TX_DATETIME', axis=1), left_index=True, right_index=True)
    
    figrl_train_emb = pd.merge(figrl_train_emb, train_data.loc[figrl_train_emb.index].drop('TX_FRAUD', axis=1).drop('TX_DATETIME', axis=1), left_index=True, right_index=True)
    figrl_inductive_emb = pd.merge(figrl_inductive_emb, inductive_data.loc[figrl_inductive_emb.index].drop('TX_FRAUD', axis=1).drop('TX_DATETIME', axis=1), left_index=True, right_index=True)
    
    baseline_train = train_data.drop('TX_FRAUD', axis=1).drop('TX_DATETIME', axis=1)
    baseline_inductive = inductive_data.drop('TX_FRAUD', axis=1).drop('TX_DATETIME', axis=1)
    
    classifier.fit(baseline_train, train_labels)
    baseline_predictions = classifier.predict_proba(baseline_inductive)

classifier.fit(graphsage_train_emb, train_labels)
graphsage_predictions = classifier.predict_proba(graphsage_inductive_emb)

classifier.fit(figrl_train_emb, train_labels)
figrl_predictions = classifier.predict_proba(figrl_inductive_emb)

### 5. Evaluation

Given the highly imbalanced nature of our dataset, we evaluate the results based on precision-recall curves and 1% Lift. 

In [None]:
from inductiveGRL.evaluation import Evaluation
import warnings
warnings.filterwarnings("ignore")

inductive_labels = df.loc[inductive_data.index]['TX_FRAUD']

figrl_evaluation = Evaluation(figrl_predictions, inductive_labels, "FIGRL+features") 
graphsage_evaluation = Evaluation(graphsage_predictions, inductive_labels, "GraphSAGE+features")

figrl_evaluation.pr_curve()
graphsage_evaluation.pr_curve()

if add_additional_data is True:
    baseline_evaluation = Evaluation(baseline_predictions, inductive_labels, "Baseline")
    baseline_evaluation.pr_curve()

print("FI-GRL: ")
lift_score = figrl_evaluation.lift_score(0.01)
print("GraphSAGE: ")
lift_score = graphsage_evaluation.lift_score(0.01)

if add_additional_data is True:
    print("Baseline: ")
    lift_score = baseline_evaluation.lift_score(0.01)
