# Federated Random Forest on HIGGS Dataset

## Introduction
### Libraries
This example show how to use [NVIDIA FLARE](https://nvflare.readthedocs.io/en/2.3/index.html) on tabular data applications.
It illustrates the [Random Forest](https://xgboost.readthedocs.io/en/stable/tutorials/rf.html) functionality using [XGBoost](https://github.com/dmlc/xgboost) library,
which is an optimized distributed gradient boosting library, also covering random forest. In this example, we illustrate the use of NVFlare to carry out *horizontal* federated learning with tree-based collaboration - forming a random forest.

### Dataset
This example illustrate a binary classification task based on [HIGGS dataset](https://archive.ics.uci.edu/dataset/280/higgs).
This dataset contains 11 million instances, each with 28 attributes.

### Horizontal Federated Learning
Under horizontal setting, each participant / client joining the federated learning will have part of the whole data / instances / examples/ records, while each instance has all the features.
This is in contrast to vertical federated learning, where each client has part of the feature values for each instance.

### Tree-based Collaboration
Under tree-based collaboration, individual trees are independently trained on each client's local data without aggregating the global sample gradient histogram information.
Trained trees are collected and passed to the server / other clients for aggregation. Note that under Random Forest setting, only one round of training will be performed.

### Local Training and Aggregation
Random forest training with multiple clients can be achieved in two steps:

- Local training: each site train a local sub-forest consisting of a number of trees based on their local data by utilizing the `subsample` and `num_parallel_tree` functionalities from XGBoost. 
- Global aggregation: server collects all sub-forests from clients, and a bagging aggregation scheme is applied to generate the global forest model.

No further training will be performed, `num_boost_round` should be 1 to align with the basic setting of random forest.

Below we listed steps to run this example.

## 1. Setup NVFLARE

Follow the [Getting_Started](https://nvflare.readthedocs.io/en/2.3/getting_started.html) to setup virtual environment and install NVFLARE

We also provide a [Notebook](../../nvflare_setup.ipynb) for this setup process. 

Assume you have already setup the venv, lets first install required packages.

In [None]:
%pip install -r requirements.txt

## 2. Data preparation 

### Download and Store Data
To run the examples, we first download the dataset from the HIGGS link above.
By default, we assume the dataset is downloaded, uncompressed, and stored in `DATASET_ROOT/HIGGS.csv`.
Please note that the UCI's website may experience occasional downtime.

### Generate Data Split
Since HIGGS dataset is already randomly recorded,
data split will be specified by the continuous index ranges for each client,
rather than a vector of random instance indices.
In this example, we choose uniform data split with 5 clients.

**Please change the DATASET_ROOT to the correct local path.**

In [None]:
# please change this DATASET_ROOT to the correct path containing HIGGS dataset
%env DATASET_ROOT=/data
!python3 utils/prepare_data_split.py \
        --data_path "${DATASET_ROOT}/HIGGS.csv" \
        --site_num 5 \
        --size_total 11000000 \
        --size_valid 1000000 \
        --split_method uniform \
        --out_path "/tmp/nvflare/random_forest/HIGGS/data_splits/5_uniform"

## 3. Prepare job configs
We are using NVFlare's FL simulator to run the following experiments. 

In [None]:
%env DATA_SPLIT_ROOT=/tmp/nvflare/random_forest/HIGGS/data_splits
!python3 utils/prepare_job_config.py \
        --site_num 5 \
        --num_local_parallel_tree 20 \
        --local_subsample 0.05 \
        --split_method uniform \
        --lr_mode uniform \
        --nthread 16 \
        --tree_method "hist" \
        --data_split_root "${DATA_SPLIT_ROOT}"

## 4. Run simulated random forest experiment
Now that we have the job configs ready, we run the experiment using Simulator.

**warning: We suggest you only run this notebook when your machine has more than 25 GB RAM. Otherwise, please try a smaller dataset other than HIGGS.**

Simulator can either be used with CLI command (please run CLI commands in terminal):

```shell
nvflare simulator "./jobs/higgs_5_0.05_uniform_split_uniform_lr" -w "/tmp/nvflare/random_forest/workspace_5_0.05_uniform_split_uniform_lr" -n 5 -t 5
```

or via Simulator API:

In [None]:
import os
from nvflare.private.fed.app.simulator.simulator_runner import SimulatorRunner  

simulator = SimulatorRunner(
    job_folder="./jobs/higgs_5_0.05_uniform_split_uniform_lr",
    workspace="/tmp/nvflare/random_forest/workspace_5_0.05_uniform_split_uniform_lr",
    n_clients=5,
    threads=5
)
run_status = simulator.run()
print("Simulator finished with run_status", run_status)

## 5. Validate the trained model
The trained global random forest model can further be validated.

In [None]:
!python3 utils/model_validation.py \
        --data_path "${DATASET_ROOT}/HIGGS.csv" \
        --model_path "/tmp/nvflare/random_forest/workspace_5_0.05_uniform_split_uniform_lr/simulate_job/app_server/xgboost_model.json" \
        --size_valid 1000000 --num_trees 100

The expected AUC is 0.7810306437097397