# CFF Demo Pipeline: FIGRL class

This is a demo for the python implementation of the FIGRL algorithm.

In [1]:
import networkx as nx
import pandas as pd 
import numpy as np

Two global parameters need to be defined for this pipeline. The embedding size of figrl's embeddings, and the 'intermediate_dimension'. 

In [2]:
# Global parameters:
embedding_size = 40
intermediate_dimension = 400

## 1. Loading the demo Dataset

In [3]:
df = pd.read_csv("demo_ccf.csv")
df = df.set_index(df.index+1)

The credit card dataset will be split into two parts the training & inductive datasets. The first 60% of the transactions will be the training transactions to train de algorithm and then the resting 40% will be used as inductive set.

In [4]:
cutoff = round(0.6*len(df))
train_data = df.head(cutoff)
inductive_data = df.tail(len(df)-cutoff)

In [5]:
print('The distribution of fraud for the train data is:\n', train_data['fraud_label'].value_counts())
print('The distribution of fraud for the inductive data is:\n', inductive_data['fraud_label'].value_counts())

The distribution of fraud for the train data is:
 0    482
1    164
Name: fraud_label, dtype: int64
The distribution of fraud for the inductive data is:
 0    327
1    103
Name: fraud_label, dtype: int64


## 2. Construct the Graph Network

A networkx graph is constructed with edit, user and webpage nodes. Creating a three partite graph. The FI-GRL framework derives embeddings starting from an adjacency matrix that it constructs using the graph's edgelist. 

In [6]:
nodes = {"transaction":train_data.index, "client":train_data.client_node, "merchant":train_data.merchant_node}
edges = [zip(train_data.client_node, train_data.index),zip(train_data.merchant_node, train_data.index)]
g_nx = nx.Graph()
for key, values in nodes.items():
            g_nx.add_nodes_from(values, ntype=key)
for edge in edges:
            g_nx.add_edges_from(edge)

## 3. Train FIGRL

FI-GRL, a fast inductive graph representation framework is trained using the aforeconstructed graph. This algorithm is implemented in a python class. First, we instantiate the FI-GRL class with the intermediate dimension of the matrix between the input graph and the embedding space, in addition to the size of final dimension (embedding space). FI-GRL's train step returns the embeddings for the train transaction nodes.

Currently there are three versions of figrl in python: FIGRL_Original, FIGRL_expanding_S, and FIGRL
The latter is advised due to being the fastest and least memory intensive one of the three.

Import of the wanted class

In [7]:
from FIGRL import FIGRL

In [8]:
figrl = FIGRL(embedding_size, intermediate_dimension)
figrl_train_emb = figrl.fit(g_nx)
figrl_train_emb = figrl_train_emb.loc[train_data.index]

## 4. Inductive Step

In [9]:
pd.options.mode.chained_assignment = None

inductive_graph_data = pd.concat((train_data,inductive_data))

A graph containing the inductive data and training data is created.

In [10]:
nodes = {"transaction":inductive_graph_data.index, "client":inductive_graph_data.client_node, "merchant":inductive_graph_data.merchant_node}
edges = [zip(inductive_graph_data.client_node, inductive_graph_data.index),zip(inductive_graph_data.merchant_node, inductive_graph_data.index)]
graph_full = nx.Graph()

for key, values in nodes.items():
            graph_full.add_nodes_from(values, ntype=key)
for edge in edges:
            graph_full.add_edges_from(edge)

The inductive prediction function of FIGRL: The full grpah is given, the inductive data, a list of the columns of neighboring nodes of the inductive data, the maximum id number of all nodes, and finally the index number of the inductive data.

In [11]:
figrl_inductive_emb = figrl.predict(graph_full, inductive_data, [inductive_data.client_node,inductive_data.merchant_node], max(inductive_graph_data.merchant_node), inductive_data.index)

## 5. Evaluation

In [12]:
from xgboost import XGBClassifier
classifier = XGBClassifier(n_estimators=100)

In [13]:
train_labels = train_data['fraud_label']
add_additional_data = True
if add_additional_data is True:
    train_emb = pd.merge(figrl_train_emb, train_data.loc[figrl_train_emb.index].drop('fraud_label', axis=1), left_index=True, right_index=True)
    inductive_emb = pd.merge(figrl_inductive_emb, inductive_data.loc[figrl_inductive_emb.index].drop('fraud_label', axis=1), left_index=True, right_index=True)

    baseline_train = train_data.drop('fraud_label', axis=1)
    baseline_inductive = inductive_data.drop('fraud_label', axis=1)

    classifier.fit(baseline_train, train_labels)
    baseline_predictions = classifier.predict_proba(baseline_inductive)
    
classifier.fit(train_emb, train_labels)
predictions = classifier.predict_proba(inductive_emb)


In [14]:
from components.Evaluation import Evaluation
inductive_labels = df.loc[inductive_emb.index]['fraud_label']

figrl_evaluation = Evaluation(predictions, inductive_labels, "FI-GRL+features") 
figrl_evaluation.pr_curve()

if add_additional_data is True:
    baseline_evaluation = Evaluation(baseline_predictions, inductive_labels, "Baseline")
    baseline_evaluation.pr_curve()

Average precision-recall score for  FI-GRL+features  configuration XGBoost: 0.8246865220
Average precision-recall score for  Baseline  configuration XGBoost: 0.8503629197
