# Federated XGBoost
Several mechanisms have been proposed for training an XGBoost model in a federated learning setting.
In this section, we illustrate the use of NVFlare to carry out *horizontal* federated learning using two approaches: histogram-based collaboration and tree-based collaboration.
And *vertical* federated learning using histogram-based collaboration.

## Horizontal Federated XGBoost
Under horizontal setting, each participant joining the federated learning will have part of 
the whole data samples / instances / records, while each sample has all the features.

### Histogram-based Collaboration
The histogram-based collaboration federated XGBoost approach leverages NVFlare integration of [federated learning support](https://github.com/dmlc/xgboost/issues/7778) in the XGBoost open-source library,
which allows the existing *distributed* XGBoost training algorithm to operate in a federated manner,
with the federated clients acting as the distinct workers in the distributed XGBoost algorithm.

In distributed XGBoost, individual workers share and aggregate gradient information about their respective portions of the training data,
as required to optimize tree node splitting when building the successive boosted trees.

![hori_hist](./figs/hori_hist.png)

The shared information is in the form of quantile sketches of feature values as well as corresponding sample gradient and sample Hessian histograms ("Local G/H") , based on which the global information can be computed ("Global G/H").

Under federated histogram-based collaboration, information of precisely the same structure is exchanged among the clients.
The main differences are that the data is partitioned across the workers according to client data ownership, rather than being arbitrarily partionable, and all communication is via an aggregating federated [gRPC](https://grpc.io) server instead of direct client-to-client communication.
Histograms from different clients, in particular, are aggregated in the server and then communicated back to the clients.

### Tree-based Collaboration
Under tree-based collaboration, individual trees are independently trained on each client's local data without aggregating the global sample gradient histogram information. 
Trained trees are collected and passed to the server / other clients for aggregation and / or further boosting rounds.

Comparing with histogram-based collaboration, the major difference is that the histogram-based methods exchange the intermediate results for tree-boosting, while tree-based methods exchange the final tree model.

Under this setting, we can further distinguish between two types of tree-based collaboration: cyclic and bagging.

#### Cyclic Training
"Cyclic XGBoost" is one way of performing tree-based federated boosting with 
multiple sites: 

![hori_cyclic](./figs/cyclic.png)

At each round of tree boosting, instead of relying on the whole 
data statistics collected from all clients, the boosting relies on only one client's 
local data. The resulting tree sequence is then forwarded to the next client for 
next round's boosting. One full "cycle" will be complete when all clients have been covered.

#### Bagging Aggregation

"Bagging XGBoost" is another way of performing tree-based federated boosting with multiple sites: 

![hori_cyclic](./figs/tree.png)

At each round of tree boosting, all sites start from the same "global model", and boost a number of trees (in current example, 1 tree) based on their local data. The resulting trees are then send to server. A bagging aggregation scheme is applied to all the submitted trees to update the global model, which is further distributed to all clients for next round's boosting. 

This scheme bears certain similarity to the [Random Forest mode](https://xgboost.readthedocs.io/en/stable/tutorials/rf.html) of XGBoost, where a `num_parallel_tree` is boosted based on random row/col splits, rather than a single tree. Under federated learning setting, such split is fixed to clients rather than random and without column subsampling. 

In addition to basic uniform shrinkage setting where all clients have the same learning rate, based on our research, we enabled scaled shrinkage across clients for weighted aggregation according to each client's data size, which is shown to significantly improve the model's performance on non-uniform quantity splits.

Specifically, the global model is updated by aggregating the trees from all clients as a forest, and the global model is then broadcasted back to all clients for local prediction and further training.

The XGBoost Booster API is leveraged to create in-memory Booster objects that persist across rounds to cache predictions from trees added in previous rounds and retain other data structures needed for training.

## Vertical Federated XGBoost
Under vertical setting, each participant joining the federated learning will 
have part of the whole features, while each site has all the overlapping instances.

### Private Set Intersection (PSI)
In this tutorial, we assume that all parties hold the same population but different features. 

In reality, however, not every site will have the same set of data samples (rows), ad we shall use PSI to first compare encrypted versions of the sites' datasets in order to jointly compute the intersection based on common IDs. To learn more about our PSI protocol implementation, see our [psi example](https://github.com/NVIDIA/NVFlare/tree/main/examples/advanced/psi/README.md).

### Histogram-based Collaboration
Similar to its horizontal counterpart, under vertical collaboration, the gradients for each sample will be first computed with label information by the active party; then the gradients will be broadcasted to all passive parties, where they will be used to compute local feature histograms, and find the local best splits with their corresponding gain values; at the last stage, all local best splits will be synced to find the global best split, with which the next split of the tree can be determined. 

By exchanging gradient and split information among all sites and update the global model accordingly, vertical histogram-based method can result in the exact same model as the centralized training. 

![vert_hist](./figs/vert_hist.png)

We leverage the [vertical federated learning support](https://github.com/dmlc/xgboost/issues/8424) in the XGBoost open-source library. This allows for the distributed XGBoost algorithm to operate in a federated manner on vertically split data.

## Setup
Install required packages for data download and training

In [None]:
%pip install -r requirements.txt

## Data Preparation
Download and Store Data
To run the examples, we first download the dataset and stored in /tmp/nvflare/dataset/creditcard.csv with the following:

In [None]:
import kagglehub
path = kagglehub.dataset_download("mlg-ulb/creditcardfraud")
! mkdir -p /tmp/nvflare/dataset/
! cp {path}/creditcard.csv /tmp/nvflare/dataset/

### Data Split
To prepare data for further experiments, we perform the following steps:
1. Split the dataset into training/validation and testing sets. 
2. Split the training/validation set: 
    * Into "train" and "valid" for baseline centralized training.
    * Into "train" and "valid" for each client under horizontal setting. 
    * Into "train" and "valid" for each client under vertical setting.

Data splits used in this example can be generated with

In [None]:
! bash prepare_data.sh

This will generate data splits for 3 clients under all experimental settings.

From the prints, we can see we have in total `182276` rows (data samples) for training, each with `31` columns (30 features + 1 label) 

For vertical splits, site-wise column assignments are: 
- site-1 split cols [0:12]
- site-2 split cols [12:21]
- site-3 split cols [21:31]

For horizontal splits, site-wise row assignments are:
- site-1 split rows [0:60758]
- site-2 split rows [60758:121516]
- site-3 split rows [121516:182276]

> **_NOTE:_** In this section, we have divided the dataset into separate columns for each site,
> assuming that the datasets from different sites have already been joined using Private Set
> Intersection (PSI). In practice, each site initially has its own separate dataset. To
> combine these datasets accurately, PSI is needed to match records with the same ID across
> different sites. 

> **_NOTE:_** The generated data files will be stored in the folder `/tmp/nvflare/dataset/xgb_dataset/`

## Experiments
We first run the centralized trainings to get the baseline performance, then run the federated XGBoost training using NVFlare Simulator via [JobAPI](https://nvflare.readthedocs.io/en/main/programming_guide/fed_job_api.html).

### Centralized Baseline
For centralize training, we train the XGBoost model on the whole dataset.

Let's first examining the data used for centralized baseline:

In [None]:
!tree /tmp/nvflare/dataset/xgb_dataset/base_xgb_data

In [None]:
import csv

def print_first_rows_csv(file_path, num_rows=1):
    with open(file_path, 'r') as file:
        csv_reader = csv.reader(file)
        for i, row in enumerate(csv_reader):
            if i >= num_rows:
                break
            print(','.join(row))

file_path = '/tmp/nvflare/dataset/xgb_dataset/base_xgb_data/train.csv'
print_first_rows_csv(file_path)

In [None]:
! python train_base.py 

The results by default will be stored in the folder `/tmp/nvflare/workspace/fedxgb/train_base/`.

### Horizontal Experiments
Let's take a look at the dataset for horizontal experiments.

In [None]:
!tree /tmp/nvflare/dataset/xgb_dataset/horizontal_xgb_data/

First row of site-1 data, should be identical to the first row of baseline data:

In [None]:
file_path = '/tmp/nvflare/dataset/xgb_dataset/horizontal_xgb_data/site-1/train.csv'
print_first_rows_csv(file_path)

The following cases will be covered:
- Histogram-based collaboration
- Tree-based collaboration with cyclic training 
- Tree-based collaboration with bagging training 

The experiments can be run with:

In [None]:
%%capture
! python xgb_fl_job.py --training_algo histogram --data_split_mode horizontal
! python xgb_fl_job.py --training_algo cyclic --data_split_mode horizontal
! python xgb_fl_job.py --training_algo bagging --data_split_mode horizontal

### Vertical Experiment
Let's take a look at the dataset for vertical experiments:

In [None]:
!tree /tmp/nvflare/dataset/xgb_dataset/vertical_xgb_data/

First row of site-1/2/3 data combined together, should be identical to the first row of baseline data:

In [None]:
file_path = '/tmp/nvflare/dataset/xgb_dataset/vertical_xgb_data/site-1/train.csv'
print_first_rows_csv(file_path)
file_path = '/tmp/nvflare/dataset/xgb_dataset/vertical_xgb_data/site-2/train.csv'
print_first_rows_csv(file_path)
file_path = '/tmp/nvflare/dataset/xgb_dataset/vertical_xgb_data/site-3/train.csv'
print_first_rows_csv(file_path)

Histogram-based collaboration will be performed for vertical setting:

In [None]:
%%capture
! python xgb_fl_job.py --training_algo histogram --data_split_mode vertical

## Results
We can visualize the results via tensorboard records:

In [None]:
%load_ext tensorboard
%tensorboard --logdir /tmp/nvflare/workspace/fedxgb/works

For reference, the training curves for the four settings are below:

![training_curves](./figs/training_curves.png)

As shown, for this task, histogram-based methods, both vertical and horizontal, result in almost identical curves, and achieve better results as compared with bagging / cyclic.
Bagging and cyclic also converge to same training accuracy at the end of training. 

Also as expected, vertical histogram-based method achieves identical performance as baseline training.

Now let's move on to next section [Secure Federated XGBoost with Homomorphic Encryption](../10.2_secure_fed_xgboost/secure_fed_xgboost.ipynb) to see how to protect data privacy during histogram-based collaborations with federated learning and encryption