# End-to-end credit card fraud detection with Federated XGBoost

This notebook shows how to convert an existing tabular credit dataset, enrich and pre-process the data using a single site (like a centralized dataset), and then convert this centralized process into federated ETL steps easily. Then, construct a federated XGBoost; the only thing the user needs to define is the XGBoost data loader. 

## Install requirements


In [None]:
%pip install -r requirements.txt

## Step 1: Data Preparation 
First, we prepare the data by adding random transactional information to the base creditcard dataset following the below script:

* [prepare data](./notebooks/prepare_data.ipynb)

## Step 2: Feature Analysis

For this stage, we would like to analyze the data, understand the features, and derive (and encode) secondary features that can be more useful for building the model.

Towards this goal, there are two options:
1. **Feature Enrichment**: This process involves adding new features based on the existing data. For example, we can calculate the average transaction amount for each currency and add this as a new feature. 
2. **Feature Encoding**: This process involves encoding the current features and transforming them to embedding space via machine learning models. This model can be either pre-trained, or trained with the candidate dataset.

Considering the fact that the only two numerical features in the dataset are "Amount" and "Time", we will perform feature enrichment first. Optionally, we can also perform feature encoding. In this example, we use a graph neural network (GNN); we will train the GNN model in a federated, unsupervised fashion and then use the model to encode the features for all sites. 

### Step 2.1: Rule-based Feature Enrichment

#### Single-site Enrichment and Additional Processing
The detailed feature enrichment step is illustrated using one site as example: 

* [feature_enrichments with-one-site](./notebooks/feature_enrichment.ipynb)

Similarly, we examine the additional pre-processing step using one site: 

* [pre-processing with one-site](./notebooks/pre_process.ipynb)


#### Federated Job to Perform on All Sites
In order to run feature enrichment and processing job on each site similar to above steps, we wrote federated ETL job scripts for client-side based on single-site implementations.

* [enrichment script](./nvflare/enrich.py)
* [pre-processing script](./nvflare/pre_process.py) 

Then we define job scripts on server-side to trigger and coordinate running client-side scripts on each site: 

* [enrich_job.py](./nvflare/enrich_job.py)
* [pre-processing-job](./nvflare/pre_process_job.py)

Example script as below:
```
# Define the enrich_ctrl workflow and send to server
    enrich_ctrl = ETLController(task_name="enrich")
    job.to(enrich_ctrl, "server", id="enrich")

    # Add clients
    for site_name in site_names:
        executor = ScriptExecutor(task_script_path=task_script_path, task_script_args=task_script_args)
        job.to(executor, site_name, tasks=["enrich"], gpu=0)
```

### (Optional) Step 2.2: GNN-based Feature Encoding
Based on raw features, or combining the derived features from **Step 2.1**, we can use machine learning models to encode the features. 
In this example, we use federated GNN to learn and generate the feature embeddings.

First, we construct a graph based on the transaction data. Each node represents a transaction, and the edges represent the relationships between transactions. We then use the GNN to learn the embeddings of the nodes, which represent the transaction features.

#### Single-site operation example: graph construction
The detailed graph construction step is illustrated using one site as example:

* [graph_construction with one-site](./notebooks/graph_construct.ipynb)

The detailed GNN training and encoding step is illustrated using one site as example:

* [gnn_training_encoding with one-site](./notebooks/gnn_train_encode.ipynb)

#### Federated Job to Perform on All Sites
In order to run feature graph construction job on each site similar to the enrichment and processing steps, we wrote federated ETL job scripts for client-side based on single-site implementations.

* [graph_construction script](./nvflare/graph_construct.py)
* [gnn_train_encode script](./nvflare/gnn_train_encode.py)

Similarily, we define job scripts on server-side to trigger and coordinate running client-side scripts on each site: 

* [graph_construction_job.py](./nvflare/graph_construct_job.py)
* [gnn_train_encode_job.py](./nvflare/gnn_train_encode_job.py)

The resulting GNN encodings will be merged with the normalized data for enhancing the feature.

## Step 3: Federated XGBoost 

Now that we have the data ready, either enriched and normalized features, or GNN feature embeddings, we can fit them with XGBoost. NVIDIA FLARE has already written XGBoost Controller and Executor for the job. All we need to provide is the data loader to fit into the XGBoost.

To specify the controller and executor, we need to define a Job. You can find the job construction in

* [xgb_job.py](./nvflare/xgb_job.py)
* [xgb_job_embed.py](./nvflare/xgb_job_embed.py)

Below is main part of the code

```
    controller = XGBFedController(
        num_rounds=num_rounds,
        training_mode="horizontal",
        xgb_params=xgb_params,
        xgb_options={"early_stopping_rounds": early_stopping_rounds},
    )
    job.to(controller, "server")

    # Add clients
    for site_name in site_names:
        executor = FedXGBHistogramExecutor(data_loader_id="data_loader")
        job.to(executor, site_name, gpu=0)
        data_loader = CreditCardDataLoader(root_dir=root_dir, file_postfix=file_postfix)
        job.to(data_loader, site_name, id="data_loader")
```
> file_postfix
  file_postfix is default to "_normalized.csv", we are loading the normalized csv files normalized by pre-processing step. 
  the files are 
  * train__normalized.csv
  * test__normalized.csv
  

Notice we assign defined a [```CreditCardDataLoader```](./nvflare/xgb_data_loader.py), this a XGBLoader we defined to load the credit card dataset. 

```
import os
from typing import Optional, Tuple

import pandas as pd
import xgboost as xgb
from xgboost.core import DataSplitMode

from nvflare.app_opt.xgboost.data_loader import XGBDataLoader


class CreditCardDataLoader(XGBDataLoader):
    def __init__(self, root_dir: str, file_postfix: str):
        self.dataset_names = ["train", "test"]
        self.base_file_names = {}
        self.root_dir = root_dir
        self.file_postfix = file_postfix
        for name in self.dataset_names:
            self.base_file_names[name] = name + file_postfix
        self.numerical_columns = [
            "Timestamp",
            "Amount",
            "trans_volume",
            "total_amount",
            "average_amount",
            "hist_trans_volume",
            "hist_total_amount",
            "hist_average_amount",
            "x2_y1",
            "x3_y2",
        ]

    def load_data(self, client_id: str, split_mode: int) -> Tuple[xgb.DMatrix, xgb.DMatrix]:
        data = {}
        for ds_name in self.dataset_names:
            print("\nloading for site = ", client_id, f"{ds_name} dataset \n")
            file_name = os.path.join(self.root_dir, client_id, self.base_file_names[ds_name])
            df = pd.read_csv(file_name)
            data_num = len(data)

            # split to feature and label
            y = df["Class"]
            x = df[self.numerical_columns]
            data[ds_name] = (x, y, data_num)


        # training
        x_train, y_train, total_train_data_num = data["train"]
        data_split_mode = DataSplitMode(split_mode)
        dmat_train = xgb.DMatrix(x_train, label=y_train, data_split_mode=data_split_mode)

        # validation
        x_valid, y_valid, total_valid_data_num = data["test"]
        dmat_valid = xgb.DMatrix(x_valid, label=y_valid, data_split_mode=data_split_mode)

        return dmat_train, dmat_valid
```

We are now ready to run all the code

## Run All the Jobs End-to-end
Here we are going to run each job in sequence. For real-world use case,

* prepare data is not needed, as you already have the data
* feature enrichment / encoding scripts need to be defined based on your own technique
* for XGBoost Job, you will need to write your own data loader 

Note: All Sender SICs are considered clients: they are 
* 'ZHSZUS33_Bank_1'
* 'SHSHKHH1_Bank_2'
* 'YXRXGB22_Bank_3'
* 'WPUWDEFF_Bank_4'
* 'YMNYFRPP_Bank_5'
* 'FBSFCHZH_Bank_6'
* 'YSYCESMM_Bank_7'
* 'ZNZZAU3M_Bank_8'
* 'HCBHSGSG_Bank_9'
* 'XITXUS33_Bank_10' 

Total 10 banks

### Prepare Data

In [None]:
! python3 ./utils/prepare_data.py -i ./creditcard.csv -o /tmp/nvflare/xgb/credit_card

### Enrich data

In [None]:
%cd nvflare
! python3 enrich_job.py -c 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'FBSFCHZH_Bank_6' 'YMNYFRPP_Bank_5' 'WPUWDEFF_Bank_4' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'YSYCESMM_Bank_7' 'ZHSZUS33_Bank_1' 'HCBHSGSG_Bank_9' -p enrich.py  -a "-i /tmp/nvflare/xgb/credit_card/ -o /tmp/nvflare/xgb/credit_card/"
%cd ..

### Pre-Process Data

In [None]:
%cd nvflare
! python3 pre_process_job.py -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -p pre_process.py -a "-i /tmp/nvflare/xgb/credit_card  -o /tmp/nvflare/xgb/credit_card/"
%cd ..

### Construct Graph

In [None]:
%cd nvflare
! python graph_construct_job.py -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -p graph_construct.py -a "-i /tmp/nvflare/xgb/credit_card  -o /tmp/nvflare/xgb/credit_card/"
%cd ..

### GNN Training and Encoding

In [None]:
%cd nvflare
! python gnn_train_encode_job.py -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -p gnn_train_encode.py -a "-i /tmp/nvflare/xgb/credit_card  -o /tmp/nvflare/xgb/credit_card/"
%cd ..

### GNN Encoding Merge

In [None]:
! python3 ./utils/merge_feat.py

### Run XGBoost Job
#### Without GNN embeddings

In [None]:
%cd nvflare
! python3 xgb_job.py -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -i /tmp/nvflare/xgb/credit_card  -w /tmp/nvflare/workspace/xgb/credit_card/
%cd ..

#### With GNN embeddings

In [None]:
%cd nvflare
! python xgb_job_embed.py -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -i /tmp/nvflare/xgb/credit_card  -w /tmp/nvflare/workspace/xgb/credit_card_embed
%cd ..

## Prepare Job for POC and Production

With job running well in simulator, we are ready to run in a POC mode, so we can simulate the deployment in localhost or simply deploy to production. 

All we need is the job definition; we can use the job.export_job() method to generate the job configuration and export it to a given directory. For example, in xgb_job.py, we have the following

```
    if work_dir:
        print("work_dir=", work_dir)
        job.export_job(work_dir)

    if not args.config_only:
        job.simulator_run(work_dir)
```

let's try this out and then look at the directory. We use ```tree``` command if you have it. othewise, simply use ```ls -al ```

In [None]:
%cd nvflare
! python xgb_job.py -co -w /tmp/nvflare/workspace/xgb/credit_card/config -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4'  -i /tmp/nvflare/xgb/credit_card  
%cd ..

In [None]:
! tree /tmp/nvflare/workspace/xgb/credit_card/config

In [None]:
!cat /tmp/nvflare/workspace/xgb/credit_card/config/xgb_job/meta.json

Now we have the job definition, you can either run it in POC mode or production setup. 

* setup POC
``` 
    nvfalre poc prepare -c <list of clients>
    nvflare poc start -ex admin@nvidia.com  
```
  
* submit job using NVFLARE console 
        
    from different terminal 
   
   ```
   nvflare poc start -p admin@nvidia.com
   ```
   using submit job command
    
* use nvflare job submit command  to submit job

* use NVFLARE API to submit job

The exact same process for production. Please look at this site for documentation or tuturial examples: https://nvidia.github.io/NVFlare/


    
    
    
    




