# Tree-based Federated Learning for XGBoost on HIGGS Dataset

## Introduction
### Tree-based Collaboration
Under tree-based collaboration, individual trees are independently trained on each client's local data without aggregating the global sample gradient histogram information.
Trained trees are collected and passed to the server / other clients for aggregation. 

### Bagging Aggregation

"Bagging XGBoost" is one way of performing tree-based federated boosting with multiple sites: at each round of tree boosting, all sites start from the same "global model", and boost a number of trees (in current example, 1 tree) based on their local data. The resulting trees are then send to server. A bagging aggregation scheme is applied to all the submitted trees to update the global model, which is further distributed to all clients for next round's boosting. 

This scheme bears certain similarity to the [Random Forest mode](https://xgboost.readthedocs.io/en/stable/tutorials/rf.html) of XGBoost, where a `num_parallel_tree` is boosted based on random row/col splits, rather than a single tree. Under federated learning setting, such split is fixed to clients rather than random and without column subsampling. 

In addition to basic uniform shrinkage setting where all clients have the same learning rate, based on our research, we enabled scaled shrinkage across clients for weighted aggregation according to each client's data size, which is shown to significantly improve the model's performance on non-uniform quantity splits over HIGGS data.

Below we listed steps to run this example.

## 1. Setup NVFLARE

Follow the [Getting_Started](https://nvflare.readthedocs.io/en/main/getting_started.html) to setup virtual environment and install NVFLARE

We also provide a [Notebook](../../nvflare_setup.ipynb) for this setup process. 

Assume you have already setup the venv, lets first install required packages.

In [None]:
%pip install -r requirements.txt

## 2. Data and job configs preparation 
Follow this [Notebook](../data_job_setup.ipynb) for this setup process. 

## 3. Run simulated xgboost experiment
Now that we have the data and job configs ready, we run the experiment using Simulator.

Here we simulate 5 clients under uniform data split.

**warning: We suggest you only run this notebook when your machine has more than 25 GB RAM. Otherwise, please try a smaller dataset other than HIGGS.**

Simulator can either be used with CLI command (please run the CLI command in a terminal):

```shell
nvflare simulator ./jobs/higgs_5_bagging_uniform_split_uniform_lr -w /tmp/nvflare/workspaces/xgboost_workspace_5_bagging_uniform_split_uniform_lr -n 5 -t 5
```

or via Simulator API:

In [None]:
import os
from nvflare.private.fed.app.simulator.simulator_runner import SimulatorRunner  
simulator = SimulatorRunner(
    job_folder="./jobs/higgs_5_bagging_uniform_split_uniform_lr",
    workspace="/tmp/nvflare/workspaces/xgboost_workspace_5_bagging_uniform_split_uniform_lr",
    n_clients=5,
    threads=5
)
run_status = simulator.run()
print("Simulator finished with run_status", run_status)

## 4. Result visualization
Model accuracy can be visualized in tensorboard.

In [None]:
%load_ext tensorboard
%tensorboard --logdir /tmp/nvflare/workspaces/xgboost_workspace_5_bagging_uniform_split_uniform_lr