# Federated Horizontal XGBoost with Tree-based Collaboration 

This tutorial illustrates a federated horizontal xgboost learning on tabular data with bagging collaboration. 

Before do the training, we need to setup NVFLARE

## Setup NVFLARE

Follow [Getting Started](https://nvflare.readthedocs.io/en/main/getting_started.html) to set up a virtual environment and install NVFLARE.

You can also follow this [notebook](https://github.com/NVIDIA/NVFlare/blob/main/examples/nvflare_setup.ipynb) to get set up.

> Make sure you have installed nvflare from **terminal** 


## Install requirements
assuming the current directory is '/examples/hello-world/step-by-step/higgs/xgboost'

In [1]:
!pwd

/Users/ziyuex/NVFlare/nvflare_tab_exp/examples/hello-world/step-by-step/higgs/xgboost


In [2]:
%pip install -r requirements.txt

Collecting pandas (from -r requirements.txt (line 1))
  Downloading pandas-2.1.3-cp310-cp310-macosx_11_0_arm64.whl.metadata (18 kB)
Collecting scikit-learn (from -r requirements.txt (line 2))
  Downloading scikit_learn-1.3.2-cp310-cp310-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting xgboost (from -r requirements.txt (line 3))
  Downloading xgboost-2.0.2-py3-none-macosx_12_0_arm64.whl.metadata (2.0 kB)
Collecting pytz>=2020.1 (from pandas->-r requirements.txt (line 1))
  Downloading pytz-2023.3.post1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.1 (from pandas->-r requirements.txt (line 1))
  Downloading tzdata-2023.3-py2.py3-none-any.whl (341 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m341.8/341.8 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting scipy>=1.5.0 (from scikit-learn->-r requirements.txt (line 2))
  Downloading scipy-1.11.4-cp310-cp310-macosx_12_0_arm64.whl.metadata (112 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━


## Prepare data
Please reference [prepare_higgs_data](../prepare_data.ipynb) notebooks. Pay attention to the current location. You need to switch "higgs" directory to run the data split.
    

Now we have our data prepared. we are ready to do the training

### Data Cleaning 

We noticed from time-to-time the Higgs dataset is making small changes which causing job to fail. so we need to do some clean up or skip certain rows. 
For example: certain floating number mistakenly add an alphabetical letter at some point of time. This may have already fixed by UCI. 


## XGBoost
This tutorial uses [XGBoost](https://github.com/dmlc/xgboost), which is an optimized distributed gradient boosting library.

### Federated XGBoost Model
Here we use tree-based collaboration for horizontal federated XGBoost.

Under this setting, individual trees are independently trained on each client's local data without aggregating the global sample gradient histogram information.
Trained trees are collected and passed to the server / other clients for bagging aggregation and further boosting rounds.

The XGBoost Booster api is leveraged to create in-memory Booster objects that persist across rounds to cache predictions from trees added in previous rounds and retain other data structures needed for training.

Let's look at the code see how we convert the local training script to the federated training script.

In [3]:
!pwd

/Users/ziyuex/NVFlare/nvflare_tab_exp/examples/hello-world/step-by-step/higgs/xgboost


In [4]:
!cat code/xgboost_fl.py

# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import csv
import json
from typing import Dict, List, Tuple

import pandas as pd
import xgboost as xgb
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# (1) import nvflare client API
from nvflare import client as flare
from nvflare.app_opt.xgboost.tree_based.shareable_generat

The code is pretty much like the standard scikit-learn training script of `code/xgboost_local_iter.py`

#### load data

We first load the features from the header file: 
    
```
    site_name = flare.get_site_name()
    feature_data_path = f"{data_root_dir}/{site_name}_header.csv"
    features = load_features(feature_data_path)
    n_features = len(features) -1

    data_path = f"{data_root_dir}/{site_name}.csv"
    data = load_data(data_path=data_path, data_features=features, test_size=test_size, skip_rows=skip_rows)

```

then load the data from the main csv file, then transform the data and split the training and test data based on the test_size provided.  

```
    data = to_dataset_tuple(data)
    dataset = transform_data(data)
    x_train, y_train, train_size = dataset["train"]
    x_test, y_test, test_size = dataset["test"]

```

The part that's specific to Federated Learning is in the following codes

```
# (1) import nvflare client API
from nvflare import client as flare

```
```
# (2) initializes NVFlare client API
    flare.init()

    site_name = flare.get_site_name()
    
```
    
These few lines, import NVFLARE Client API and initialize it, then use the API to find the site_name (such as site-1, site-2 etc.). With the site-name, we can construct the site-specific 
data path such as

```
    feature_data_path = f"{data_root_dir}/{site_name}_header.csv"

    data_path = f"{data_root_dir}/{site_name}.csv"
```

#### Training 

In the standard traditional xgboost, we would train the model such as
```
  model = xgb.train(...) 
```

with federated learning, using FLARE Client API, we need to make a few changes
* 1) we are not only training in local iterations, but also global rounds, we need to keep the program running until we reached to the totoal number of rounds 
  
  ```
      while flare.is_running():
          ... rest of code
  
  ```
  
* 2) Unlike local learning, we have now have more than one clients/sites participating the training. To ensure every site starts with the same model parameters, we use server to broadcase the initial model parameters to every sites at the first round ( current_round = 0). 

* 3) We will need to use FLARE client API to receive global model updates 

```
        # (3) receives FLModel from NVFlare
        input_model = flare.receive()
        global_params = input_model.params
        curr_round = input_model.current_round
```

```
        if curr_round == 0:
            # (4) first round, no global model
            model = xgb.train(
                xgb_params,
                dmat_train,
                num_boost_round=1,
                evals=[(dmat_train, "train"), (dmat_test, "test")],
            )
            config = model.save_config()
        ....
```
* 4) if it is not the first round, we need to use the global model to update the local model before training the next round. 

```
            # (5) update model based on global updates
            model_updates = global_params["model_data"]
            for update in model_updates:
                global_model_as_dict = update_model(global_model_as_dict, json.loads(update))
            loadable_model = bytearray(json.dumps(global_model_as_dict), "utf-8")
            # load model
            model.load_model(loadable_model)
            model.load_config(config)
```

* 5) we then evaluate the global model using the local data

```
            # (6) evaluate model
            auc = evaluate_model(x_test, model, y_test)
```
* 6) finally we do the training 

```
            # train model in two steps
            # first, eval on train and test
            eval_results = model.eval_set(
                evals=[(dmat_train, "train"), (dmat_test, "test")], iteration=model.num_boosted_rounds() - 1
            )
            print(eval_results)
            # second, train for one round
            model.update(dmat_train, model.num_boosted_rounds())
        
```

* 7) we need the new training result (new tree) back to server for aggregation, to do that, we have the following code

```
        # (7) construct trained FL model
        # Extract newly added tree using xgboost_bagging slicing api
        bst_new = model[model.num_boosted_rounds() - 1 : model.num_boosted_rounds()]
        local_model_update = bst_new.save_raw("json")
        params = {"model_data": local_model_update}
        metrics = {"accuracy": auc}
        output_model = flare.FLModel(params=params, metrics=metrics)

        # (8) send model back to NVFlare
        flare.send(output_model)
```

## Prepare Job  

Now, we have the code, we need to prepare job folder with configurations to run in NVFLARE. To do this, we can leveage the job template for scikit learn. First look at the the available job templates

In [5]:
!nvflare config -jt ../../../../../job_templates/

In [7]:
!nvflare job list_templates


The following job templates are available: 

------------------------------------------------------------------------------------------------------------------------
  name                 Description                                                  Controller Type      Client Category     
------------------------------------------------------------------------------------------------------------------------
  cyclic_cc_pt         client-controlled cyclic workflow with PyTorch ClientAPI tra client               client_api          
  cyclic_pt            server-controlled cyclic workflow with PyTorch ClientAPI tra server               client_api          
  psi_csv              private-set intersection for csv data                        server               Executor            
  sag_cross_np         scatter & gather and cross-site validation using numpy       server               client executor     
  sag_cse_pt           scatter & gather workflow and cross-site evaluation with Py

In [8]:
!nvflare job create -j /tmp/nvflare/jobs/xgboost -force -w xgboost_tree \
-sd code \
-f config_fed_client.conf app_script="xgboost_fl.py" app_config="--data_root_dir /tmp/nvflare/dataset/output"


The following are the variables you can change in the template

---------------------------------------------------------------------------------------------------------------------------------------
                                                                                                                                       
  job folder: /tmp/nvflare/jobs/xgboost                                                                                                  
                                                                                                                                       
---------------------------------------------------------------------------------------------------------------------------------------
  file_name                      var_name                       value                               component                          
---------------------------------------------------------------------------------------------------------------------

In [9]:
!cat /tmp/nvflare/jobs/xgboost/app/config/config_fed_client.conf

format_version = 2
app_script = "xgboost_fl.py"
app_config = "--data_root_dir /tmp/nvflare/dataset/output"
executors = [
  {
    tasks = [
      "train"
    ]
    executor {
      path = "nvflare.app_opt.pt.client_api_launcher_executor.ClientAPILauncherExecutor"
      args {
        launcher_id = "launcher"
        pipe_id = "pipe"
        heartbeat_timeout = 60
        params_exchange_format = "raw"
        params_transfer_type = "FULL"
        train_with_evaluation = true
        launch_once = true
      }
    }
  }
]
task_data_filters = []
task_result_filters = []
components = [
  {
    id = "launcher"
    path = "nvflare.app_common.launchers.subprocess_launcher.SubprocessLauncher"
    args {
      script = "python3 custom/{app_script}  {app_config} "
    }
  }
  {
    id = "pipe"
    path = "nvflare.fuel.utils.pipe.file_pipe.FilePipe"
    args {
      mode = "PASSIVE"
      root_path = ""
    }
  }
]


In [10]:
!tree /tmp/nvflare/jobs/xgboost

zsh:1: command not found: tree


>Note 
 we use skip_rows = 0 to skip 1st row. We could skip_rows = [0, 3] to skip first and 4th rows.



## Run job in simulator

We use the simulator to run this job

In [12]:
!nvflare simulator /tmp/nvflare/jobs/xgboost -w /tmp/nvflare/xgboost -n 3 -t 3

2023-11-29 12:57:55,136 - SimulatorRunner - INFO - Create the Simulator Server.
2023-11-29 12:57:55,140 - CoreCell - INFO - server: creating listener on tcp://0:65467
2023-11-29 12:57:55,180 - CoreCell - INFO - server: created backbone external listener for tcp://0:65467
2023-11-29 12:57:55,180 - ConnectorManager - INFO - 99598: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2023-11-29 12:57:55,181 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:1345] is starting
2023-11-29 12:57:55,686 - CoreCell - INFO - server: created backbone internal listener for tcp://localhost:1345
2023-11-29 12:57:55,687 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 PASSIVE tcp://0:65467] is starting
2023-11-29 12:57:55,760 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 65468
2023-11-29 12:57:55,760 - SimulatorRunner - INFO - Deploy the Apps.
2023-11-29 12:57:55,835 - SimulatorRunner - INFO - Create t

Let's examine the results.

We can notice from the FL training log, at the last round of local training, site-1 reports `site-1: local model AUC: 0.6351`
Now let's run a local training to verify if this number makes sense.

In [None]:
!python3 ./code/sgd_local_iter.py --data_root_dir /tmp/nvflare/dataset/output

In [None]:
!python3 ./code/sgd_local_oneshot.py --data_root_dir /tmp/nvflare/dataset/output

In this experiment, all three clients have relatively large amount data wiht homogeneous distribution, we would expect the three numbers align within reasonable variation range. 

The final result for iterative learning is `ending model AUC: 0.6352`, and one-shot learning is `local model AUC: 0.6355`, as compared with FL's `local model AUC: 0.6351`, the numbers do align.

## We are done !
Congratulations! you have just completed the federated linear model for tabular data. 