# End-to-end credit card fraud detection with Federated XGBoost

This notebooks shows the how do we convert and existing tabular credit data, enrich and pre-process data using one-site (like centralized dataset) and then convert this centralized process into a federated ETL steps, easily. Then construct a federated XGBoost, the only thing user need to define is the XGboost data loader. 

## Install requirements




In [None]:
%pip install -r requirements.txt

## Data Prepare Data 

* [prepare data](./prepare_data.ipynb)


## Feature Enrichment

We can first examine how the feature enrichment is processed using just one-site. 

* [feature_enrichments with-one-site](./feature_enrichment.ipynb)

in order to run feature job on each site similar to above feature enrichment steps, we wrote an enrichment ETL job.

[enrichment script](./enrich.py)

Define a job to trigger running enrichnment script on each site: 

[enrich_job.py](./enrich_job.py)

```
# Define the enrich_ctrl workflow and send to server
    enrich_ctrl = ETLController(task_name="enrich")
    job.to(enrich_ctrl, "server", id="enrich")

    # Add clients
    for site_name in site_names:
        executor = ScriptExecutor(task_script_path=task_script_path, task_script_args=task_script_args)
        job.to(executor, site_name, tasks=["enrich"], gpu=0)
```




## Pre-Processing 

We exam examine the steps for pre-processing using only one-site (one client) 

* [pre-processing with one-site](./pre_process.ipynb)

Based on one-site, we create the pre-processing script

* [pre-processing script](./pre_process.py) 

then we define the pre-processing job to coordinate the pre-processing for all sites

* [pre-processing-job](./pre_process_job.py)

```
    pre_process_ctrl = ETLController(task_name="pre_process")
    job.to(pre_process_ctrl, "server", id="pre_process")

    # Add clients
    for site_name in site_names:
        executor = ScriptExecutor(task_script_path=task_script_path, task_script_args=task_script_args)
        job.to(executor, site_name, tasks=["pre_process"], gpu=0)

```
 Similarly to the ETL job, we simply issue a task to trigger pre-process running pre-process script. 


    def load_data(self, client_id: str, split_mode: int) -> Tuple[xgb.DMatrix, xgb.DMatrix]:
        data = {}
        for ds_name in self.dataset_names:
            print("\nloading for site = ", client_id, f"{ds_name} dataset \n")
            file_name = os.path.join(self.root_dir, client_id, self.base_file_names[ds_name])
            df = pd.read_csv(file_name)
            data_num = len(data)

            # split to feature and label
            y = df["Class"]
            x = df[self.numerical_columns]
            data[ds_name] = (x, y, data_num)


        # training
        x_train, y_train, total_train_data_num = data["train"]
        data_split_mode = DataSplitMode(split_mode)
        dmat_train = xgb.DMatrix(x_train, label=y_train, data_split_mode=data_split_mode)

        # validation
        x_valid, y_valid, total_valid_data_num = data["test"]
        dmat_valid = xgb.DMatrix(x_valid, label=y_valid, data_split_mode=data_split_mode)

        return dmat_train, dmat_valid
## Define XGBoost Job 

Now that we have the data ready, We can fit the data into XGBoost. NVIDIA FLARE has already has written XGBoost Controller and Executor for the job. All we need to provide is the data loader to fit into the XGBoost
To specify the controller and executor, we need to define a Job.  You can find the job construction can be find in

* [xgb_job.py](./xgb_job.py). 

Here is main part of the code

```
    controller = XGBFedController(
        num_rounds=num_rounds,
        training_mode="horizontal",
        xgb_params=xgb_params,
        xgb_options={"early_stopping_rounds": early_stopping_rounds},
    )
    job.to(controller, "server")

    # Add clients
    for site_name in site_names:
        executor = FedXGBHistogramExecutor(data_loader_id="data_loader")
        job.to(executor, site_name, gpu=0)
        data_loader = CreditCardDataLoader(root_dir=root_dir, file_postfix=file_postfix)
        job.to(data_loader, site_name, id="data_loader")
```
> file_postfix
  file_postfix is default to "_normalized.csv", we are loading the normalized csv files normalized by pre-processing step. 
  the files are 
  * train__normalized.csv
  * test__normalized.csv
  

Notice we assign defined a [```CreditCardDataLoader```](./xgb_data_loader.py), this a XGBLoader we defined to load the credit card dataset. 

```


import os
from typing import Optional, Tuple

import pandas as pd
import xgboost as xgb
from xgboost.core import DataSplitMode

from nvflare.app_opt.xgboost.data_loader import XGBDataLoader


class CreditCardDataLoader(XGBDataLoader):
    def __init__(self, root_dir: str, file_postfix: str):
        self.dataset_names = ["train", "test"]
        self.base_file_names = {}
        self.root_dir = root_dir
        self.file_postfix = file_postfix
        for name in self.dataset_names:
            self.base_file_names[name] = name + file_postfix
        self.numerical_columns = [
            "Timestamp",
            "Amount",
            "trans_volume",
            "total_amount",
            "average_amount",
            "hist_trans_volume",
            "hist_total_amount",
            "hist_average_amount",
            "x2_y1",
            "x3_y2",
        ]

    def load_data(self, client_id: str, split_mode: int) -> Tuple[xgb.DMatrix, xgb.DMatrix]:
        data = {}
        for ds_name in self.dataset_names:
            print("\nloading for site = ", client_id, f"{ds_name} dataset \n")
            file_name = os.path.join(self.root_dir, client_id, self.base_file_names[ds_name])
            df = pd.read_csv(file_name)
            data_num = len(data)

            # split to feature and label
            y = df["Class"]
            x = df[self.numerical_columns]
            data[ds_name] = (x, y, data_num)


        # training
        x_train, y_train, total_train_data_num = data["train"]
        data_split_mode = DataSplitMode(split_mode)
        dmat_train = xgb.DMatrix(x_train, label=y_train, data_split_mode=data_split_mode)

        # validation
        x_valid, y_valid, total_valid_data_num = data["test"]
        dmat_valid = xgb.DMatrix(x_valid, label=y_valid, data_split_mode=data_split_mode)

        return dmat_train, dmat_valid


```

We are now ready to run all the code

## Run all the Jobs
Here we are going to run each job in sequence. For real-world use case,

* prepare data is not needed, as you already have the data
* feature enrichment scripts need to be define based on your own enrichment rules
* pre-processing, you also need to change the pre-process script to define normalization and categorical encodeing
* for XGBoost Job, you will need to write your own data loader 

Note: All Sender SICs are considered clients: they are 
* 'ZHSZUS33_Bank_1'
* 'SHSHKHH1_Bank_2'
* 'YXRXGB22_Bank_3'
* 'WPUWDEFF_Bank_4'
* 'YMNYFRPP_Bank_5'
* 'FBSFCHZH_Bank_6'
* 'YSYCESMM_Bank_7'
* 'ZNZZAU3M_Bank_8'
* 'HCBHSGSG_Bank_9'
* 'XITXUS33_Bank_10' 
Total 10 banks
### Prepare Data

In [None]:
! python3 prepare_data.py -i ./creditcard.csv -o /tmp/nvflare/xgb/credit_card

### Enrich data

In [None]:
! python enrich_job.py -c 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'FBSFCHZH_Bank_6' 'YMNYFRPP_Bank_5' 'WPUWDEFF_Bank_4' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'YSYCESMM_Bank_7' 'ZHSZUS33_Bank_1' 'HCBHSGSG_Bank_9' -p enrich.py  -a "-i /tmp/nvflare/xgb/credit_card/ -o /tmp/nvflare/xgb/credit_card/"


### Pre-Process Data

In [None]:
! python pre_process_job.py -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -p pre_process.py -a "-i /tmp/nvflare/xgb/credit_card  -o /tmp/nvflare/xgb/credit_card/"

### Run XGBoost Job

In [None]:
! python xgb_job.py -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -i /tmp/nvflare/xgb/credit_card  -w /tmp/nvflare/workspace/xgb/credit_card/

## Prepare Job for POC and Production

This seems to work well with Job running in simulator. Now we are ready to run in a POC mode, so we can simulate the deployment in localhost or simply deploy to production. 

All we need is the job definition. we can use job.export_job() method to generate the job configuration and export to given directory. For example, in xgb_job.py, we have the following

```
    if work_dir:
        print("work_dir=", work_dir)
        job.export_job(work_dir)

    if not args.config_only:
        job.simulator_run(work_dir)
```

let's try this out and then look at the directory. We use ```tree``` command if you have it. othewise, simply use ```ls -al ```

In [None]:
! python xgb_job.py -co -w /tmp/nvflare/workspace/xgb/credit_card/config -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4'  -i /tmp/nvflare/xgb/credit_card  

In [None]:
! tree /tmp/nvflare/workspace/xgb/credit_card/config

In [None]:
!cat /tmp/nvflare/workspace/xgb/credit_card/config/xgb_job/meta.json

Now we have the job definition, you can either run it in POC mode or production setup. 

* setup POC
``` 
    nvfalre poc prepare -c <list of clients>
    nvflare poc start -ex admin@nvidia.com  
```
  
* submit job using NVFLARE console 
        
    from different terminal 
   
   ```
   nvflare poc start -p admin@nvidia.com
   ```
   using submit job command
    
* use nvflare job submit command  to submit job

* use NVFLARE API to submit job

The exact same process for production. Please look at this site for documentation or tuturial examples: https://nvidia.github.io/NVFlare/


    
    
    
    




