# Federated GNN on Graph Dataset using Inductive Learning

## Introduction to GNN, Tasks, and federated GNN via Inductive Learning
### GNN
This example shows how to train a classification model using Graph Neural Network (GNN) using a modern **recipe-based approach**. GNNs show a promising future in research and industry, with potential applications in various domains, including social networks, e-commerce, recommendation systems, and more.
GNNs excel in learning, modeling, and leveraging complex relationships within graph-structured data. They combine local and global information, incorporate structural knowledge, adapt to diverse tasks, handle heterogeneous data, support transfer learning, scale for large graphs, offer interpretable insights, and achieve impressive performance. 

### Tasks
In this example, we provide two tasks:
1. **Protein Classification**:
The aim is to classify protein roles based on their cellular functions from gene ontology. The dataset we are using is PPI
([protein-protein interaction](http://snap.stanford.edu/graphsage/#code)) graphs, where each graph represents a specific human tissue. Protein-protein interaction (PPI) dataset is commonly used in graph-based machine-learning tasks, especially in the field of bioinformatics. This dataset represents interactions between proteins as graphs, where nodes represent proteins and edges represent interactions between them.
2. **Financial Transaction Classification**:
The aim is to classify whether a given transaction is licit or illicit. For this financial application, we use the [Elliptic++](https://github.com/git-disl/EllipticPlusPlus) dataset. It consists of 203k Bitcoin transactions and 822k wallet addresses to enable both the detection of fraudulent transactions and the detection of illicit addresses (actors) in the Bitcoin network by leveraging graph data. For more details, please refer to this [paper](https://arxiv.org/pdf/2306.06108.pdf).


### Federated GNN via Inductive Learning with FedAvgRecipe
Both tasks are for node classification. We used the inductive representation learning method [GraphSAGE](https://arxiv.org/pdf/1706.02216.pdf) based on [Pytorch Geometric](https://github.com/pyg-team/pytorch_geometric)'s examples. 
[Pytorch Geometric](https://pytorch-geometric.readthedocs.io/en/latest/)  is  a library built upon PyTorch to easily write and train Graph Neural Networks (GNNs) for a wide range of applications related to structured data.

For protein classification task, we used it in an unsupervised manner, following [PyG's unsupervised PPI example](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/graph_sage_unsup_ppi.py).
For financial transaction classification task, we used it in a supervised manner, directly using the node labels with supervised classification loss.

Since the inductive learning mode is being used, the locally learnt model (a representation encoding / classification network) is irrelevant to the candidate graph, we are able to use the basic [FedAvg](https://arxiv.org/abs/1602.05629) as the federated learning algorithm.

This example uses **NVFlare's FedAvgRecipe** to create federated GNN training jobs, providing a cleaner, more programmatic API compared to traditional job templates.

Below we list steps to run this example.

## 1. Setup NVFLARE

Follow the [Getting_Started](https://nvflare.readthedocs.io/en/main/getting_started.html) to setup virtual environment and install NVFLARE

We also provide a [Notebook](../../nvflare_setup.ipynb) for this setup process. 

Assume you have already setup the venv, lets first install required packages.

In [None]:
%pip install -r requirements.txt

To support functions of PyTorch Geometric necessary for this example, we need extra dependencies. Please refer to [installation guide](https://pytorch-geometric.readthedocs.io/en/latest/install/installation.html) and install accordingly:

In [None]:
%pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cpu.html

## 2. Data preparation 
This example uses two datasets: 
- For Protein Classification, the PPI dataset is available from torch_geometric's dataset API.  
- For Financial Transaction Classification, we first download the [Elliptic++](https://github.com/git-disl/EllipticPlusPlus) dataset to `/tmp/nvflare/datasets/elliptic_pp` folder. In this example, we will use the following three files:
    - `txs_classes.csv`: transaction id and its class (licit or illicit)
    - `txs_edgelist.csv`: connections for transaction ids 
    - `txs_features.csv`: transaction id and its features

## 3. Local Experiments
For comparison with federated learning results, we first perform local experiments on each client's data and the whole dataset. Here we simulate 2 clients with uniform data split (client_id = 0 means the whole dataset). The 6 experiments will take a while to finish. The default epoch number is set to 70. 

In [None]:
! python3 utils/graphsage_protein_local.py --client_id 0
! python3 utils/graphsage_protein_local.py --client_id 1
! python3 utils/graphsage_protein_local.py --client_id 2 

And for finance experiment

In [None]:
! python3 utils/graphsage_finance_local.py --client_id 0
! python3 utils/graphsage_finance_local.py --client_id 1
! python3 utils/graphsage_finance_local.py --client_id 2 

## 4. Create and Run Federated Learning Jobs using FedAvgRecipe
We are using NVFlare's FL simulator to run the FL experiments. We use the recipe-based approach with `job.py`, which provides task-specific job creation functions using **FedAvgRecipe**.

The `job.py` file provides two functions:
- `create_protein_job()`: Creates a job for protein classification
- `create_finance_job()`: Creates a job for financial transaction classification

We set the local epochs to 10 with 7 rounds of FL to match the default 70-epoch training in local experiments.

In [None]:
### Protein Classification
# Run the protein classification job using job.py
! python3 job.py \
  --task_type protein \
  --client_ids 1 2 \
  --num_rounds 7 \
  --epochs_per_round 10 \
  --data_path /tmp/nvflare/datasets/ppi \
  --workspace_dir /tmp/nvflare/gnn/protein_fl_workspace \
  --job_dir /tmp/nvflare/jobs/gnn_protein \
  --threads 2

### Financial Transaction Classification
Now let's run the financial transaction classification job:

In [None]:
! python3 job.py \
  --task_type finance \
  --client_ids 1 2 \
  --num_rounds 7 \
  --epochs_per_round 10 \
  --data_path /tmp/nvflare/datasets/elliptic_pp \
  --workspace_dir /tmp/nvflare/gnn/finance_fl_workspace \
  --job_dir /tmp/nvflare/jobs/gnn_finance \
  --threads 2

## 5. View Results
The `job.py` script creates, exports, and runs the jobs automatically. Results from both local and federated experiments can be visualized in tensorboard.

In [None]:
%load_ext tensorboard
%tensorboard --logdir /tmp/nvflare/gnn  