# Federated GNN on Graph Dataset using Inductive Learning

## Introduction

In this example, we demonstrate how to train a Graph Neural Network (GNN) for node classification tasks using federated learning with NVIDIA FLARE's **recipe-based approach**. 

Graph Neural Networks (GNNs) show a promising future in research and industry, with potential applications in various domains, including social networks, e-commerce, recommendation systems, and more. GNNs excel in learning, modeling, and leveraging complex relationships within graph-structured data.

We provide two federated learning tasks:
1. **Protein Classification**: Classify protein roles based on cellular functions
2. **Financial Transaction Classification**: Detect illicit transactions in Bitcoin networks

## Data

This example supports two datasets for different node classification tasks:

### 1. Protein-Protein Interaction (PPI) Dataset
- **Task**: Classify protein roles based on their cellular functions from gene ontology
- **Source**: [GraphSAGE PPI Dataset](http://snap.stanford.edu/graphsage/#code)
- **Structure**: Multiple graphs, where each graph represents a specific human tissue
- **Nodes**: Proteins
- **Edges**: Interactions between proteins
- **Features**: Gene sets, positional gene sets, immunological signatures, and gene ontology sets
- **Usage**: Commonly used in graph-based machine learning tasks in bioinformatics
- **Download**: Dataset will be automatically downloaded when running the example via `torch_geometric.datasets.PPI`

### 2. Elliptic++ Dataset
- **Task**: Classify whether a given Bitcoin transaction is licit or illicit
- **Source**: [Elliptic++ GitHub](https://github.com/git-disl/EllipticPlusPlus)
- **Structure**: Single large graph representing the Bitcoin transaction network
- **Scale**: 203k Bitcoin transactions and 822k wallet addresses
- **Files Required**:
  - `txs_classes.csv`: Transaction IDs and their class labels (licit/illicit)
  - `txs_edgelist.csv`: Connections between transaction IDs
  - `txs_features.csv`: Transaction IDs and their features
- **Reference**: [Demystifying Fraudulent Transactions and Illicit Nodes in the Bitcoin Network](https://arxiv.org/pdf/2306.06108.pdf)
- **Download**: Manual download required to `/tmp/nvflare/datasets/elliptic_pp` from source

## Model

Both tasks use **GraphSAGE** (Graph SAmple and aggreGatE), an inductive representation learning method for graph-structured data.

- **Framework**: [PyTorch Geometric](https://pytorch-geometric.readthedocs.io/en/latest/)
- **Paper**: [Inductive Representation Learning on Large Graphs](https://arxiv.org/pdf/1706.02216.pdf)
- **Key Feature**: Inductive learning - the model learns to generate embeddings for unseen nodes, making it suitable for federated learning scenarios

### Model Architecture
- **Protein Classification**: 
  - Unsupervised learning approach
  - Based on [PyG's unsupervised PPI example](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/graph_sage_unsup_ppi.py)
  - Multi-layer GraphSAGE with mean aggregation
  
- **Financial Transaction Classification**:
  - Supervised learning approach
  - Custom SAGE model defined in `finance/model.py`
  - Uses node labels with supervised classification loss

Since inductive learning is used, the locally learned model is independent of the specific graph structure, enabling the use of standard FedAvg aggregation.

In [None]:
%pip install -r requirements.txt

To support functions of PyTorch Geometric necessary for this example, we need extra dependencies. Please refer to [installation guide](https://pytorch-geometric.readthedocs.io/en/latest/install/installation.html) and install accordingly:

In [None]:
%pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cpu.html

## 2. Data preparation 
This example uses two datasets: 
- For Protein Classification, the PPI dataset is available from torch_geometric's dataset API.  
- For Financial Transaction Classification, we first download the [Elliptic++](https://github.com/git-disl/EllipticPlusPlus) dataset to `/tmp/nvflare/datasets/elliptic_pp` folder. In this example, we will use the following three files:
    - `txs_classes.csv`: transaction id and its class (licit or illicit)
    - `txs_edgelist.csv`: connections for transaction ids 
    - `txs_features.csv`: transaction id and its features

## 3. Local Experiments
For comparison with federated learning results, we first perform local experiments on each client's data and the whole dataset. Here we simulate 2 clients with uniform data split (client_id = 0 means the whole dataset). The experiments will take a while to finish. The default epoch number is set to 70.

### Protein Classification
Navigate to the protein directory and run local experiments:

In [None]:
! cd protein && python local_train.py --client_id 0
! cd protein && python local_train.py --client_id 1
! cd protein && python local_train.py --client_id 2 

### Financial Transaction Classification
Navigate to the finance directory and run local experiments:

In [None]:
! cd finance && python local_train.py --client_id 0
! cd finance && python local_train.py --client_id 1
! cd finance && python local_train.py --client_id 2 

## 4. Create and Run Federated Learning Jobs using FedAvgRecipe
We are using NVFlare's FL simulator to run the FL experiments. We use the recipe-based approach with separate `job.py` files for each task.

Each task directory contains:
- `job.py`: Creates and runs the FL job
- `client.py`: Client-side training logic
- `local_train.py`: Baseline local training

We set the local epochs to 10 with 7 rounds of FL to match the default 70-epoch training in local experiments.

### Protein Classification
Navigate to the protein directory and run the FL job:

In [None]:
! cd protein && python job.py \
  --num_clients 2 \
  --num_rounds 7 \
  --epochs_per_round 10 \
  --data_path /tmp/nvflare/datasets/ppi \
  --threads 2

### Financial Transaction Classification
Navigate to the finance directory and run the FL job:

In [None]:
! cd finance && python job.py \
  --num_clients 2 \
  --num_rounds 7 \
  --epochs_per_round 10 \
  --data_path /tmp/nvflare/datasets/elliptic_pp \
  --threads 2

## 5. View Results
Results from both local and federated experiments can be visualized in tensorboard.

### Color Scheme
Local trainings: 
- Black curve: whole dataset
- Green curve: client 1
- Purple curve: client 2

Federated learning: 
- Blue curve: client 1
- Red curve: client 2

In [None]:
%load_ext tensorboard
%tensorboard --logdir /tmp/nvflare/gnn  