<h1>GNNs for Fraud Detection </h1>
This assessment will be divided into 2 parts:

- In the first part, we'll discover how to import a CSV file and create a graph dataset.

- In the second part, a 2 layer Graph Convolutional Network is created. 

In [2]:
'''
Note: We will train our GNNs on CPU runtime since we have a very small graph and training time should be fairly low, 
you can use GPUs if you wish, but make sure that you install the right DGL version from here- https://www.dgl.ai/pages/start.html
The below code installs DGL for a CPU runtime
'''

!pip install  dgl -f https://data.dgl.ai/wheels/repo.html
!pip install  dglgo -f https://data.dgl.ai/wheels-test/repo.html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://data.dgl.ai/wheels/repo.html
Collecting dgl
  Downloading dgl-1.1.0-cp310-cp310-manylinux1_x86_64.whl (5.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.9/5.9 MB[0m [31m64.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dgl
Successfully installed dgl-1.1.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://data.dgl.ai/wheels-test/repo.html
Collecting dglgo
  Downloading dglgo-0.0.2-py3-none-any.whl (63 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.5/63.5 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Collecting isort>=5.10.1 (from dglgo)
  Downloading isort-5.12.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aut

In [3]:
#Don't bother if you get this warning message- "DGL backend not selected or invalid.  Assuming PyTorch for now."
import torch
import dgl
import pandas as pd


DGL backend not selected or invalid.  Assuming PyTorch for now.


Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)


<h2>Week 1</h2>

# <h3>Loading Your First Graph Dataset </h3>
 2 csv files that contain an open source fraud detection dataset created by Amazon. 

<h4> Amazon Fraud Detection Dataset </h4>
The Amazon dataset encompasses product evaluations in the Musical Instruments category. Individuals with over 80% helpful votes are identified as benign entities, while those with fewer than 20% helpful votes are considered fraudulent entities. Performing a fraudulent user detection task on the Amazon dataset involves a binary classification process. Each of these users have a 25-dim dense feature representation that is obtained by calculating certain statistical properties of the user's behaviors. Features include properties like entropy of user's ratings, time entropy, sentiment of user's reviews etc. You can learn more about the features from Table 1. in the paper-https://arxiv.org/pdf/2005.10150.pdf.

The nodes in the graph are therefore users on the Amazon e-commerce platform, the nodes also have handcrafted-features. The node information is available in the file below
- node_information.csv: contains node_id as the first column and features 1-25 in the corresponding columns, the last column is the label of the user (benign, fraudulent)

To create a network of interconnected users and generate a graph, we link users who share similarities. The file provided contains connections between users exhibiting the top 5% mutual review text similarities (calculated using TF-IDF) among all users. In other words, users with high textual resemblances are connected, based on the assumption that this structure could reveal insights into the communication patterns among fraudulent users.

- edge_data.csv: contains 2 columns with source and destination node ids indicating an edge between the source and destination columns


In [4]:
#import required packages
import torch
# Deep Graph Library - build with pytorch
import dgl
import pandas as pd


In [5]:
df = pd.read_csv('/content/node_information.csv')
display(df.head(5))

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_16,feature_17,feature_18,feature_19,feature_20,feature_21,feature_22,feature_23,feature_24,label
0,1.0,26.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,5.0,5.0,5.0,0.0,0.0,0.0,1.0,13.0,1.0,0
1,4.0,17.0,0.0,1.0,1.0,0.0,2.0,0.0,0.25,0.25,...,5.0,2.0,3.75,0.0,3382.0,1.386294,0.0,45.0,1.0,0
2,2.0,15.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,4.0,4.0,4.0,0.0,0.0,0.0,1.0,24.5,1.0,0
3,1.0,21.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,5.0,5.0,5.0,0.0,0.0,0.0,1.0,14.0,1.0,0
4,2.0,18.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,5.0,4.0,4.5,0.0,0.0,0.0,1.0,18.5,1.0,0


In [6]:
#Section 1: Data Loading
def load_node_information_from_csv(path: str):
    '''Given a path to the node information csv file, create a tensor of node 
    features and corresponding labels. You can load using the Pandas library
    Args:
        path: path to a csv file
    Returns: 
        a tensor of node features of the shape (num_nodes, num_features) and a tensor of 
        node labels of the shape (num_nodes)
    '''
    df = pd.read_csv('/content/node_information.csv')
    feature_names = set(df.columns) - {"label"}
    node_features = torch.tensor(df[feature_names].values)
    node_labels = torch.tensor(df['label'].values)
    return node_features, node_labels
    

def load_edges_from_csv(path: str):
    '''Given a path to a csv file, create a tuple of tensors, you can use the Pandas library
    Args:
        path: path to a csv file
    Returns: 
        src: a pytorch tensor of source node ids
        dst: a pytorch tensor of destination node ids
    '''
    # make sure that the node ids are in the required type format, ie. int64
    df = pd.read_csv('/content/edge_data.csv')
    src = torch.tensor(df['src'].values)
    dst = torch.tensor(df['dst'].values)
    return (src,dst)
    

def create_graph_from_tensors(src_tensor: torch.Tensor, dst_tensor: torch.Tensor):
    '''Given a tuple of edge tensors (u,v), create a graph such that each element in u is 
    connected to each element in v with a one-to-one mapping
    please refer to: https://docs.dgl.ai/en/1.0.x/generated/dgl.graph.html
    For example: 
    u = th.tensor([1, 2, 3]), 
    v = th.tensor([4, 5, 0]) 
    should create a graph with 6 nodes and 3 edges:
    1 -> 4, 2 -> 5, 3 -> 0
    Args:
        edge_tensors: a tuple of edge tensors
    Returns: 
        a DGL graph
    '''
    g = dgl.graph((src_tensor, dst_tensor))
    return g
    


def add_node_features_and_labels(graph: dgl.DGLGraph, node_features: torch.Tensor, node_labels: torch.Tensor):
    '''Given a graph and a tensor of node features and labels, add the node features and labels to 
    the graph object so as to access them later directly from the graph object. 
    **Name the features and labels as "features" and "labels" respectively**
    please refer to: https://docs.dgl.ai/guide/graph-feature.html?highlight=features
    Args:
        graph: a DGL graph
        node_features: a tensor of node features of type float()
    Returns: 
        a DGL graph with node features with shape (num_nodes, num_features) and labels with shape (num_nodes, 1)
    '''
    #**Name the features and labels as "features" and "labels" respectively**
    graph.ndata['features'] = node_features
    nd_labels = node_labels.reshape(len(node_labels), 1)
    graph.ndata['labels'] = nd_labels
    return graph

In [7]:
# Section 2: data exploration
def get_num_nodes(graph: dgl.DGLGraph):
    '''Given a DGL graph, return the number of nodes
    please refer to: https://docs.dgl.ai/en/0.1.x/api/python/graph.html#querying-graph-structure
    Args:
        graph: a DGL graph
    Returns: 
        the number of nodes in the graph
    '''
    return graph.number_of_nodes()
    


def check_if_edge_exists(graph: dgl.DGLGraph, u: int, v: int):
    '''Given a DGL graph and two nodes u and v, 
    return True if the edge (u,v) exists in the graph, False otherwise
    please refer to: https://docs.dgl.ai/en/0.1.x/api/python/graph.html#querying-graph-structure
    Args:
        graph: a DGL graph
        u: a node
        v: a node
    Returns: 
        True if the edge (u,v) exists in the graph, False otherwise
    '''
    
    return graph.has_edges_between(u, v)


def get_first_hop_neighbors(graph: dgl.DGLGraph, node: int):
    '''Given a DGL graph and a node, return the first hop neighbors of the node
       First hop neighbors are the nodes that are directly connected to the node
       please refer to: https://docs.dgl.ai/en/0.1.x/api/python/graph.html#querying-graph-structure
    Args:
        graph: a DGL graph
        node: a node
    Returns: 
        a list of first hop neighbors of the node
    '''
    
    predecessors = graph.predecessors(node)
    successors = graph.successors(node)

    first_hop = []
    for each in predecessors:
      if graph.has_edges_between(each, node):
        first_hop.append(each)

    for each in successors:
      if graph.has_edges_between(each, node):
        first_hop.append(each)

    return first_hop


def get_second_hop_neighbors(graph: dgl.DGLGraph, node: int):
    '''Given a DGL graph and a node, return the second hop neighbors of the node
       Second hop neighbors are the nodes that are connected to the first hop neighbors of the node
    Args:
        graph: a DGL graph
        node: a node
    Returns: 
        a tensor of second hop neighbors of the node
    '''
    
    second_hop = []
    first_hop = get_first_hop_neighbors(graph, node)
    for fnode in first_hop:
      predecessors = graph.predecessors(fnode)
      successors = graph.successors(fnode)
      
      for each in predecessors:
        if graph.has_edges_between(each, fnode):
          second_hop.append(each)

      for each in successors:
        if graph.has_edges_between(each, fnode):
          second_hop.append(each)

    return second_hop

<h4>Data Sampling</h4>
Graphs possess a relational nature, distinguishing them from datasets like images or text, which maintain a fixed context window. Consequently, when sampling a node for training, it's essential to also sample the neighbors we want to include for aggregation. Graph neural networks learn from both node-specific information (i.e., node features) and structural information (a node's neighborhood). As a result, data batches typically consist of a node's subgraph, including its neighborhood in a particular manner. For example, we can consider a node's first and second hop neighbors as its neighborhood. Alternatively, we could use a fixed number of neighbors (either randomly or through a ranking process) in each hop, commonly referred to as fan-out. So, when we say "sample a node's first-hop neighborhood with fan-out of 5," it means we select a total of 5 neighbors from the node's first hop. In this section, we use DGL's in-built neighbor sampler for obtaining batches of node data. 


In [8]:
def create_data_sampler(fanout_list):
    '''create a DGL data sampler
    Args: layers: the number of hops in the neighborhood that we want to sample
    Returns: 
        a DGL data sampler of type NeighborSampler. 
        This sampler will sample neighborhood as specified by the fanout_list.
        read more about this sampler in the docs: 
        https://docs.dgl.ai/generated/dgl.dataloading.NeighborSampler.html
    '''
    sampler = dgl.dataloading.NeighborSampler(fanout_list)
    return sampler


def create_data_loaders(graph: dgl.DGLGraph, sampler, batch_size: int, train_ids: torch.Tensor, val_ids: torch.Tensor):
    '''Given a DGL graph, a sampler, a batch size, and a train/val ratio, 
    split the graph into training, validation, and test sets
    Use the DGL data loader to create data loaders for the training and validation sets
    reference: https://docs.dgl.ai/generated/dgl.dataloading.DataLoader.html#dgl.dataloading.DataLoader
    Args:
        graph: a DGL graph
        sampler: a DGL data sampler
        batch_size: the size of the batch 
        train_ratio: the ratio of the training set 
        val_ratio: the ratio of the validation set
    Returns: 
        train and validation data loader objects
    '''
    train_dataloader = dgl.dataloading.DataLoader(graph, train_ids, sampler, batch_size=1024)
    val_dataloader = dgl.dataloading.DataLoader(graph, val_ids, sampler, batch_size=1024)
    return train_dataloader, val_dataloader
    


In [10]:
'''
Just make sure that you've set the data_path correctly
'''
src_edges, dst_edges = load_edges_from_csv(f'/content/edge_data.csv')
graph = create_graph_from_tensors(src_edges, dst_edges)
num_nodes = get_num_nodes(graph)
print('Number of nodes in the graph: ', num_nodes)
edge_exists = check_if_edge_exists(graph, 0, 1)
print('Does the edge (0,1) exist in the graph? ', edge_exists)
first_hop_neighbors = get_first_hop_neighbors(graph, 0)
print('First hop neighbors of node 0: ', first_hop_neighbors)
second_hop_neighbors = get_second_hop_neighbors(graph, 0)
print('Second hop neighbors of node 0: ', second_hop_neighbors)
graph_features, labels = load_node_information_from_csv(f'/content/node_information.csv')
graph = add_node_features_and_labels(graph, graph_features, labels)
print('Graph with node features: ', graph)
graph = dgl.add_self_loop(graph) #add self loops to prevent 0 degree nodes (DGL crashes when node-degree=0)


Number of nodes in the graph:  11944
Does the edge (0,1) exist in the graph?  False
First hop neighbors of node 0:  [tensor(2486), tensor(4857), tensor(5009), tensor(5263), tensor(5610), tensor(5640), tensor(5750), tensor(5809), tensor(6733), tensor(6757), tensor(11616), tensor(2486), tensor(4857), tensor(5009), tensor(5263), tensor(5610), tensor(5640), tensor(5750), tensor(5809), tensor(6733), tensor(6757), tensor(11616)]
Second hop neighbors of node 0:  [tensor(0), tensor(2988), tensor(4748), tensor(4857), tensor(4950), tensor(5009), tensor(5263), tensor(5475), tensor(5610), tensor(5640), tensor(6733), tensor(6757), tensor(7429), tensor(10444), tensor(10981), tensor(11616), tensor(0), tensor(2988), tensor(4748), tensor(4857), tensor(4950), tensor(5009), tensor(5263), tensor(5475), tensor(5610), tensor(5640), tensor(6733), tensor(6757), tensor(7429), tensor(10444), tensor(10981), tensor(11616), tensor(0), tensor(199), tensor(613), tensor(1181), tensor(1366), tensor(1375), tensor(1379)

  node_features = torch.tensor(df[feature_names].values)


In [11]:
#driver code for section 3, we create a random list of train and validation ids with a 80:20 split and use these ids to instantiate dataloaders


#create train and val masks
train_mask = torch.zeros(num_nodes, dtype=torch.bool)
val_mask = torch.zeros(num_nodes, dtype=torch.bool)

torch.manual_seed(0)
train_mask[torch.randperm(num_nodes)[:int(0.8*num_nodes)]] = True
val_mask = ~train_mask

#obtain respective ids
train_ids = torch.nonzero(train_mask, as_tuple=True)[0]
val_ids = torch.nonzero(val_mask, as_tuple=True)[0]

#create sampler and data loaders
sampler = create_data_sampler([15,15])
train_loader, val_loader = create_data_loaders(graph, sampler, 100, train_ids, val_ids)

for input_nodes, output_nodes, blocks in train_loader:
    print("Input nodes in the MFG (Message Flow Graph)")
    print(input_nodes)
    print("Output nodes in the MFG (Message Flow Graph)")
    print(output_nodes)
    print("Message Flow Graph used for training")
    print("Layer 1")
    print(blocks[0])
    print("Layer 2")
    print(blocks[1])
    break

Input nodes in the MFG (Message Flow Graph)
tensor([    0,     1,     2,  ...,  8291,  4192, 10921])
Output nodes in the MFG (Message Flow Graph)
tensor([   0,    1,    2,  ..., 1267, 1269, 1271])
Message Flow Graph used for training
Layer 1
Block(num_src_nodes=10011, num_dst_nodes=5850, num_edges=86156)
Layer 2
Block(num_src_nodes=5850, num_dst_nodes=1024, num_edges=14756)






<h2>Construct the model!</h2>

we'll utilize the data we've prepared to construct our very own GCN model, then train and assess it using a validation dataset! 

In [12]:
#section 4 (Model Building)
'''
create your first dgl gcn model with 2 hidden layers
Remember that 2 layer gcn means that we're 
looking at the 1st hop and 2nd hop neighbors of the nodes in the batch
'''

import torch
import torch.nn as nn
import torch.nn.functional as F
from dgl.nn import GraphConv

class GCN(nn.Module):
    def __init__(self, in_feats, hidden_size, num_classes):
        super(GCN, self).__init__()
        '''
        define the first and second layer of the gcn model using dgl's GraphConv module
        read more here: https://docs.dgl.ai/generated/dgl.nn.pytorch.conv.GraphConv.html
        make sure to use the correct in_feats and out_feats for the layers
        '''
        self.conv1 = GraphConv(in_feats, hidden_size)
        self.conv2 = GraphConv(hidden_size, num_classes)
        
    def forward(self, block, inputs):
        '''
        Implement the forward pass of the gcn model based on the layers defined in the __init__ function
        '''
        #remember that you need to pass respective layer information i.e., block[0] for layer 1 and block[1] for layer 2
        h = self.conv1(block[0], inputs)
        h = F.relu(h)
        h = self.conv2(block[1], h)
        return h

In [13]:
#section 5 (write evaluate function, refer to the driver code below for hints)

def evaluate(model, val_loader, criterion):
    '''
    Implement the evaluation function and return the loss and accuracy. 
    The code should be very similar to the train function below, except that you need to compute metrics and not backprop loss

    Args:
        model: GCN Model
        val_loader: validation dataset loader
        criterion: loss criterion 
    Returns: 
        values of loss and accuracy
    '''
    for input_nodes, output_nodes, blocks in val_loader:
      inputs = blocks[0].srcdata['features'].float()
      labels = blocks[1].dstdata['labels']
      logits = model(blocks, inputs)
      # loss = loss_func(logits, labels)
      loss = criterion(logits,labels.squeeze())

      predicted_labels = torch.argmax(logits, dim=1)
      accuracy = (predicted_labels == labels.squeeze()).float().mean()

      return loss, accuracy

In [30]:

#train function, use this as a helper to complete the evaluate function above
def train(model, train_loader, optimizer, criterion):
    model.train()
    for input_nodes, output_nodes, blocks in train_loader:
        inputs = blocks[0].srcdata['features'].float()
        labels = blocks[1].dstdata['labels']
        logits = model(blocks, inputs)
        
        loss = loss_func(logits, labels.squeeze())

        # loss = loss_func(logits, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

#initialize the model, optimizer, and criterion
in_feat_shape = graph.ndata['features'].shape[1]
hidden_size = 16
num_classes = 2
model = GCN(in_feat_shape, hidden_size, num_classes)
loss_func = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())

#train the model for 50 epochs and evaluate every 5 epochs
loss_values = []
accuracy_values = []
for epoch in range(100):
    print(f'Running Epoch {epoch}')
    train(model, train_loader, optimizer, loss_func)
    if epoch % 5 == 0:
      loss, acc = evaluate(model, val_loader, loss_func)
      loss_values.append(loss)
      accuracy_values.append(acc)
      print('Epoch: {}, Loss: {:.4f}, Accuracy: {:.4f}'.format(epoch, loss, acc))

Running Epoch 0
Epoch: 0, Loss: 130.9405, Accuracy: 0.0264
Running Epoch 1
Running Epoch 2
Running Epoch 3
Running Epoch 4
Running Epoch 5
Epoch: 5, Loss: 1.8562, Accuracy: 0.9668
Running Epoch 6
Running Epoch 7
Running Epoch 8
Running Epoch 9
Running Epoch 10
Epoch: 10, Loss: 0.8625, Accuracy: 0.8516
Running Epoch 11
Running Epoch 12
Running Epoch 13
Running Epoch 14
Running Epoch 15
Epoch: 15, Loss: 0.5284, Accuracy: 0.8848
Running Epoch 16
Running Epoch 17
Running Epoch 18
Running Epoch 19
Running Epoch 20
Epoch: 20, Loss: 0.2641, Accuracy: 0.9307
Running Epoch 21
Running Epoch 22
Running Epoch 23
Running Epoch 24
Running Epoch 25
Epoch: 25, Loss: 0.1961, Accuracy: 0.9482
Running Epoch 26
Running Epoch 27
Running Epoch 28
Running Epoch 29
Running Epoch 30
Epoch: 30, Loss: 0.1658, Accuracy: 0.9580
Running Epoch 31
Running Epoch 32
Running Epoch 33
Running Epoch 34
Running Epoch 35
Epoch: 35, Loss: 0.1897, Accuracy: 0.9463
Running Epoch 36
Running Epoch 37
Running Epoch 38
Running Epo

## Visualize the loss and accuracy curves

In [15]:
!pip install plotly

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [35]:
import plotly.graph_objects as go
import numpy as np

loss_numpy_array = np.array([loss.detach().numpy() for loss in loss_values])
accuracy_numpy_array = np.array([accuracy.detach().numpy() for accuracy in accuracy_values])

# Create traces for loss and accuracy
loss_trace = go.Scatter(x=list(range(len(loss_numpy_array))), y=loss_numpy_array, name='Loss')

# Create layout for loss plot
loss_layout = go.Layout(
    title='Loss over Time',
    xaxis=dict(title='Epochs'),
    yaxis=dict(title='Loss')
)

# Create figure for loss plot
loss_fig = go.Figure(data=[loss_trace], layout=loss_layout)
# Set the size of the figure
loss_fig.update_layout(height=400, width=400)

# Display loss plot
loss_fig.show()

accuracy_trace = go.Scatter(x=list(range(len(accuracy_numpy_array))), y=accuracy_numpy_array, name='Accuracy')

# Create layout
accuracy_layout = go.Layout(
    title='Accuracy over Time',
    xaxis=dict(title='Epochs'),
    yaxis=dict(title='Value')
)

# Combine traces and layout into a Figure object
data = [accuracy_trace]
acc_fig = go.Figure(data=data, layout=accuracy_layout)

# Set the size of the figure
acc_fig.update_layout(height=400, width=400)

# Display the plot
acc_fig.show()