# Histogram-based Federated Learning for XGBoost on HIGGS Dataset

## Introduction
### Histogram-based Collaboration
The histogram-based collaboration federated XGBoost approach leverages NVFlare integration of recently added [federated learning support](https://github.com/dmlc/xgboost/issues/7778) in the XGBoost open-source library,
which allows the existing *distributed* XGBoost training algorithm to operate in a federated manner,
with the federated clients acting as the distinct workers in the distributed XGBoost algorithm.

In distributed XGBoost, the individual workers share and aggregate coarse information about their respective portions of the training data,
as required to optimize tree node splitting when building the successive boosted trees.

The shared information is in the form of quantile sketches of feature values as well as corresponding sample gradient and sample Hessian histograms.

Under federated histogram-based collaboration, precisely the same information is exchanged among the clients.

The main differences are that the data is partitioned across the workers according to client data ownership, rather than being arbitrarily partionable, and all communication is via an aggregating federated [gRPC](https://grpc.io) server instead of direct client-to-client communication.

Histograms from different clients, in particular, are aggregated in the server and then communicated back to the clients.

See [README](./README.md) for more information on the histogram-based collaboration.

Below we listed steps to run this example.

## 1. Setup NVFLARE

Follow the [Getting_Started](https://nvflare.readthedocs.io/en/main/getting_started.html) to setup virtual environment and install NVFLARE

We also provide a [Notebook](../../nvflare_setup.ipynb) for this setup process. 

Assume you have already setup the venv, lets first install required packages.

In [None]:
%pip install -r requirements.txt

## 2. Data and job configs preparation 
Follow this [Notebook](../data_job_setup.ipynb) for this setup process. 

## 3. Run simulated xgboost experiment
Now that we have the data and job configs ready, we run the experiment using Simulator.

Here we simulate 5 clients under uniform data split.

**warning: We suggest you only run this notebook when your machine has more than 25 GB RAM. Otherwise, please try a smaller dataset other than HIGGS.**

Simulator can either be used with CLI command (please run the CLI command in a terminal):

```shell
nvflare simulator ./jobs/higgs_5_histogram_uniform_split_uniform_lr -w /tmp/nvflare/workspaces/xgboost_workspace_5_histogram_uniform_split_uniform_lr -n 5 -t 5
```

or via Simulator API:

In [None]:
import os
from nvflare.private.fed.app.simulator.simulator_runner import SimulatorRunner  
simulator = SimulatorRunner(
    job_folder="./jobs/higgs_5_histogram_uniform_split_uniform_lr",
    workspace="/tmp/nvflare/workspaces/xgboost_workspace_5_histogram_uniform_split_uniform_lr",
    n_clients=5,
    threads=5
)
run_status = simulator.run()
print("Simulator finished with run_status", run_status)