# Introduction 

In this hands-on Python tutorial, we'll delve into an intriguing dataset that captures the dynamics of transactions within a blockchain network. Our goal will be to prepare this data for training machine learning models using the powerful PyTorch and PyTorch Lightning frameworks.

# The Dataset

Let's dissect the dataset's description:

* Temporal Structure: The dataset is divided into 49 distinct time steps, each spaced roughly two weeks apart. Within every time step, we find a connected group of transactions occurring within a three-hour window.
* Transaction Features: Each transaction is characterized by 94 'local' features. These include its timestamp, input/output counts, fees, volume, and interesting aggregations (e.g., average BTC involved in inputs/outputs).
* Neighborhood Features: An additional 72 'aggregated' features illuminate each transaction's context. We get statistics like the maximum, minimum, standard deviation, and correlation coefficients derived from transactions one hop away.

# Exploratory Data Analysis (EDA)

We'll begin our journey with exploratory data analysis (EDA). Key things to explore:

* Distributions: Examine the distributions of transaction features (fees, volumes, etc.) to spot patterns and potential outliers.
* Correlations: Investigate relationships between transaction features. Which features correlate, and can this insight inform our model design?
* Temporal Trends: Analyze how features change across the 49 time steps. Are there seasonal effects or evolving network behaviors?

# PyTorch Datasets

Our EDA findings will guide how we structure our PyTorch Datasets. Here's where things get exciting:

* Custom Dataset Class: We'll create a custom PyTorch Dataset class to load and preprocess the raw data dynamically during model training.
* Data Transformations: We might apply scaling, normalization, or other essential transformations to make the data more suitable for machine learning.

# PyTorch Lightning Integration

Finally, we'll leverage PyTorch Lightning to streamline our training process.

* DataModule: A PyTorch Lightning DataModule will encapsulate our Datasets, manage data loading, and handle batching for efficient model training.

# What You'll Build

By the end of this tutorial, you'll have a solid foundation for training machine learning models on this dataset. This foundation sets the stage for exciting applications such as:

* Fraud detection
* Transaction pattern analysis
* Blockchain network behavior prediction

Let's get started!


# Dataset background
The dataset description originates from [Kaggle Elliptic Data Set](https://www.kaggle.com/datasets/ellipticco/elliptic-data-set) and is restated here for convenience. 

## Dataset description
This anonymized data set is a transaction graph collected from the Bitcoin blockchain. A node in the graph represents a transaction, an edge can be viewed as a flow of Bitcoins between one transaction and the other. Each node has 166 features and has been labeled as being created by a "licit", "illicit" or "unknown" entity.

### Nodes and edges

The graph is made of 203,769 nodes and 234,355 edges. Two percent (4,545) of the nodes are labelled class1 (illicit). Twenty-one percent (42,019) are labelled class2 (licit). The remaining transactions are not labelled with regard to licit versus illicit.

### Features

There are 166 features associated with each node. Due to intellectual property issues, we cannot provide an exact description of all the features in the dataset. There is a time step associated to each node, representing a measure of the time when a transaction was broadcasted to the Bitcoin network. The time steps, running from 1 to 49, are evenly spaced with an interval of about two weeks. Each time step contains a single connected component of transactions that appeared on the blockchain within less than three hours between each other; there are no edges connecting the different time steps.

The first 94 features represent local information about the transaction – including the time step described above, number of inputs/outputs, transaction fee, output volume and aggregated figures such as average BTC received (spent) by the inputs/outputs and average number of incoming (outgoing) transactions associated with the inputs/outputs. The remaining 72 features are aggregated features, obtained using transaction information one-hop backward/forward from the center node - giving the maximum, minimum, standard deviation and correlation coefficients of the neighbour transactions for the same information data (number of inputs/outputs, transaction fee, etc.).

### Dataset files

The dataset consists of three files:
* **elliptic_txs_classes.csv:** Each node is labelled as a "licit" (0), "illicit" (1), or "unkonwn" (2) entity in the class column, the txId column is a unique identifier to the node.  
* **elliptic_txs_edgelist.csv:** A list of nodes who are connected. The file has two columns txID1 and txId2. 
* **elliptic_txs_features.csv:** A file with 171 columns with the first column the transaction id, and the other columns node features. 

For detailed statistics, please visit the Kaggle Data Explorer of the [Elliptic Data Set](https://www.kaggle.com/datasets/ellipticco/elliptic-data-set). 

# Loading the data
We use the FinTorch.datasets library to load the [Elliptic Data Set](https://www.kaggle.com/datasets/ellipticco/elliptic-data-set). The following code downloads the dataset:

In [1]:
# from fintorch.datasets import elliptic
from fintorch.datasets import elliptic

# Load the elliptic dataset
elliptic_dataset = elliptic.EllipticDataset('~/.fintorch_data', force_reload=True)

Processing...
Done!


Let's discuss the code line by line:

1. **Importing:** We import the elliptic module from the fintorch.datasets package. This module provides convenient access to the Elliptic Bitcoin Dataset.

2. **Loading the Dataset:** We create an instance of the elliptic.EllipticDataset class and store it in the dataset variable. This loads the dataset from Kaggle and places it in the .fintorch_data/ directory. The fintorch framework uses with the [Kaggle API](https://github.com/Kaggle/kaggle-api) to download datasets. Make sure you've followed the instructions in the fintorch documentation to set up your Kaggle API credentials for seamless data access.



With the dataset ready, let's examine its structure. 

# Exploration


We convert the PyTorch DataSet into a Polars DataSet and perform basic exploratory data analysis:

In [2]:
type(elliptic_dataset)

fintorch.datasets.elliptic.EllipticDataset

We have a single graph thus we access element 0 in the data list:

In [3]:
elliptic_dataset[0]

Data(x=[203769, 167], edge_index=[2, 234355], y=[203769], train_mask=[203769], val_mask=[203769], test_mask=[203769])

We have the following elements in the dataset:
* **x:** 203.769 nodes with 167 feature values
* **edge_index:** 234.355 pairs of nodes representing the edges between nodes. Note that we transformed the node names into indices. The mapping is stored in *elliptic_dataset.map_id*
* **train_mask:** a mask to indicate which nodes are used to train the model
* **val_mask:** a mask to indicate which nodes are used as validation set
* **test_mask:** a mask to indicate which nodes are used as a test set

In addition, we can query some properties of the dataset:

In [4]:
print(f'Number of node features: {elliptic_dataset.num_features}')
print(f'Number of edge features: {elliptic_dataset.num_edge_features}')
print(f'Number of classes: {elliptic_dataset.num_classes}')
print(f'Feature input matrix shape:{elliptic_dataset.x.shape}')
print(f'Edge index feature matrix shape:{elliptic_dataset.edge_index.shape}')
print(f'Label feature matrix shape:{elliptic_dataset.y.shape}')

Number of node features: 167
Number of edge features: 0
Number of classes: 3
Feature input matrix shape:torch.Size([203769, 167])
Edge index feature matrix shape:torch.Size([2, 234355])
Label feature matrix shape:torch.Size([203769])


In [5]:
import polars as pol

# Convert elliptic_dataset.y to a numpy array and then to a polars Series
y_series = pol.Series(elliptic_dataset.y.numpy())

# Calculate the fraction of each value in the distribution
fraction = y_series.value_counts() 
# Normalize the count column in fraction
fraction = fraction.with_columns(count_normalized = fraction['count'] / y_series.shape[0])


# Print the fraction of the value distribution
print(fraction)


shape: (3, 3)
┌─────┬────────┬──────────────────┐
│     ┆ count  ┆ count_normalized │
│ --- ┆ ---    ┆ ---              │
│ i64 ┆ u32    ┆ f64              │
╞═════╪════════╪══════════════════╡
│ 2   ┆ 157205 ┆ 0.771486         │
│ 0   ┆ 42019  ┆ 0.206209         │
│ 1   ┆ 4545   ┆ 0.022305         │
└─────┴────────┴──────────────────┘


The class distribution reveals a severe imbalance, with the "unknown" class dominating 80% of the dataset. This means that a naive model could achieve a misleadingly high accuracy of 80% simply by always predicting the majority class.  It's crucial to be aware of this imbalance when evaluating model performance on this dataset. 
Relying solely on accuracy could lead to the false impression that a model is performing well, when in reality it's merely taking advantage of the skewed distribution. 
To get a true understanding of model performance, consider metrics like precision, recall, and F1-score. 
Additionally, it's important to explore techniques for addressing class imbalance, such as resampling, cost-sensitive learning, or specialized loss functions.

# Simple model
While we'll demonstrate loading the Elliptic dataset into a simple graph neural network model and use accuracy as the evaluation metric for illustrative purposes, it's important to remember the class imbalance we just discussed. In this case, accuracy alone won't be a reliable measure of performance. Our main focus here is to showcase the loading process, not achieve optimal performance. 

This code defines a Graph Neural Network (GNN) model for processing data arranged in graphs. It takes node features and information about how the nodes connect as inputs. The model then stacks several layers that combine features from neighboring nodes, similar to how information spreads in a network. Finally, it outputs a new set of features for each node, potentially useful for classification tasks.

In [6]:
from torch import Tensor
import torch.nn as nn
import torch_geometric.nn as geom_nn

import torch.optim as optim


class GNNModel(nn.Module):

    def __init__(self, c_in: int, c_hidden: int, c_out: int, num_layers: int = 5, dp_rate: float = 0.1, **kwargs):
        """
        Initialize the Elliptical class.

        Args:
            c_in (int): Number of input channels.
            c_hidden (int): Number of hidden channels.
            c_out (int): Number of output channels.
            num_layers (int, optional): Number of GNN layers. Defaults to 5.
            dp_rate (float, optional): Dropout rate. Defaults to 0.1.
            **kwargs: Additional keyword arguments to be passed to the GNN layers.

        Returns:
            None
        """

        super().__init__()
        gnn_layer = geom_nn.GCNConv

        layers = []
        in_channels, out_channels = c_in, c_hidden
        for _ in range(num_layers-1):
            layers += [
                gnn_layer(in_channels=in_channels,
                          out_channels=out_channels,
                          **kwargs),
                nn.ReLU(inplace=True),
                nn.Dropout(dp_rate)
            ]
            in_channels = c_hidden
        layers += [gnn_layer(in_channels=in_channels,
                             out_channels=c_out,
                             **kwargs)]
        self.layers = nn.ModuleList(layers)

    def forward(self, x: Tensor, edge_index: Tensor) -> Tensor:
        """
        Forward pass of the model.

        Args:
            x (Tensor): Input tensor.
            edge_index (Tensor): Edge index tensor.

        Returns:
            Tensor: Output tensor.
        """

        for l in self.layers:
            if isinstance(l, geom_nn.MessagePassing):
                # In case of a geom layer, also pass the edge_index list
                x = l(x, edge_index)
            else:
                x = l(x)

        return x

The following code builds on the GNN model by turning it into a PyTorch Lightning module for training and evaluation. We define a GNN class that trains the model to predict the class of individual nodes in a graph. During training, we calculate accuracy and loss on a subset of nodes (masks) and update the model weights to minimize the loss. We also track the accuracy on validation and test sets to monitor the model's performance.

In [18]:
import torchmetrics
import pytorch_lightning as pl

class GNN(pl.LightningModule):

    def __init__(self, **model_kwargs):
        super().__init__()
        self.save_hyperparameters()

        self.accuracy = torchmetrics.classification.Accuracy(task="multiclass", num_classes=3)

        self.model = GNNModel(**model_kwargs)
        self.loss_module = nn.CrossEntropyLoss()

    def forward(self, data, mode="train"):
        x, edge_index = data.x, data.edge_index
        x = self.model(x, edge_index)

        # Get the mask 
        if mode == "train":
            mask = data.train_mask
        elif mode == "val":
            mask = data.val_mask
        elif mode == "test":
            mask = data.test_mask
        else:
            assert False, f"Unknown forward mode: {mode}"

        # Calculate the loss for the mask
        loss = self.loss_module(x[mask], data.y[mask].long())
        pred = x[mask].argmax(dim=-1)
        
        return loss, pred, data.y[mask]

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=0.05)
        return optimizer

    def training_step(self, batch, batch_idx):
        loss, preds, y = self.forward(batch, mode="train")

        # log step metric
        self.accuracy(preds, y)
        self.log('train_acc_step', self.accuracy)

        self.log('train_loss', loss)
        return loss

    def validation_step(self, batch, batch_idx):
        loss, preds, y = self.forward(batch, mode="val")

        # log step metric
        self.accuracy(preds, y)
        self.log('val_acc_step', self.accuracy)

    def test_step(self, batch, batch_idx):
        loss, preds, y = self.forward(batch, mode="test")

        # log step metric
        self.accuracy(preds, y)
        self.log('test_acc_step', self.accuracy)


The following code sets up the training process for our model. It prepares the graph dataset for training using a DataLoader, creates the GNN model, configures a PyTorch Lightning trainer to manage the training process on a GPU, trains the model, tests its performance on a test set, and finally returns the trained GNN model.

In [32]:
from torch_geometric.loader import DataLoader

def train_node_classifier(dataset, **model_kwargs):
    node_data_loader = DataLoader(dataset, batch_size = 1)

    # Create a PyTorch Lightning trainer with the generation callback
    trainer = pl.Trainer(accelerator="gpu",
                         devices=1,
                         max_epochs=1000,
                         enable_progress_bar=False) # False because epoch size is 1
    
    # Note: the dimensions are specific for the Elliptic dataset
    model = GNN(c_in=167, c_out=3, **model_kwargs)
    trainer.fit(model, train_dataloaders=node_data_loader, val_dataloaders=node_data_loader)

    # Test best model on the test set
    trainer.test(model, node_data_loader, verbose=True)


    return model

Here we call the code to train the model

In [33]:
node_gnn_model = train_node_classifier(dataset=elliptic_dataset,
                                                        c_hidden=256,
                                                        num_layers=5,
                                                        dp_rate=0.1)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name        | Type               | Params
---------------------------------------------------
0 | accuracy    | MulticlassAccuracy | 0     
1 | model       | GNNModel           | 241 K 
2 | loss_module | CrossEntropyLoss   | 0     
---------------------------------------------------
241 K     Trainable params
0         Non-trainable params
241 K     Total params
0.965     Total estimated model params size (MB)


`Trainer.fit` stopped: `max_epochs=1000` reached.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
      test_acc_step         0.8072921633720398
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


The training ran on a GPU and successfully trained your GNN model with 241,000 parameters. It reached the maximum of 1000 epochs. On the test data, the model achieved an accuracy of around 80.7%. Note that this accuracy level is misleading! 
Let's check what it actually predicts with the following code:

In [34]:
import torch
import polars as pol

output = node_gnn_model.model(elliptic_dataset.x, elliptic_dataset.edge_index)
# Assuming your tensor is named 'tensor'
argmax_tensor = torch.argmax(output, dim=1)

# Convert elliptic_dataset.y to a numpy array and then to a polars Series
y_series = pol.Series(argmax_tensor.numpy())

# Calculate the fraction of each value in the distribution
fraction = y_series.value_counts() 
# Normalize the count column in fraction
fraction = fraction.with_columns(count_normalized = fraction['count'] / y_series.shape[0])

# Print the fraction of the value distribution
print(fraction)

shape: (1, 3)
┌─────┬────────┬──────────────────┐
│     ┆ count  ┆ count_normalized │
│ --- ┆ ---    ┆ ---              │
│ i64 ┆ u32    ┆ f64              │
╞═════╪════════╪══════════════════╡
│ 2   ┆ 203769 ┆ 1.0              │
└─────┴────────┴──────────────────┘


The model only predicts a single class and achieves high levels of accuracy!