# Secure Federated XGBoost with Homomorphic Encryption
This section illustrates the use of [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.html) enabling secure federated [XGBoost](https://github.com/dmlc/xgboost) under both horizontal and vertical collaborations.
The examples are based on a [finance dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) to perform fraud detection.

## Secure Federated Training of XGBoost
In last section, we visited several mechanisms for training an XGBoost model in a federated learning setting, including histogram-based vertical, histogram-based horizontal, and tree-based horizontal methods. 

In this example, we further extend the existing histogram-based horizontal and vertical federated learning approaches to support secure federated learning using homomorphic encryption. Depending on the characteristics of the data to be encrypted, we can choose between [CKKS](https://github.com/OpenMined/TenSEAL) and [Paillier](https://github.com/intel/pailliercryptolib_python).

In the following, we illustrate both *histogram-based* *horizontal* and *vertical* federated XGBoost, *with* homomorphic encryption. We leverage the [vertical federated learning with secure features support](https://github.com/dmlc/xgboost/issues/9987) and [horizontal federated learning with secure features support](https://github.com/dmlc/xgboost/issues/10170) in the XGBoost open-source library.

### Secure Vertical Federated Training of XGBoost
For vertical XGBoost, the active party holds the label, which can be considered “the most valuable asset” for the whole process, and should not be accessed by passive parties. Therefore, the active party in this case is the “major contributor” from a model training perspective, with a concern of leaking this information to passive clients. In this case, the security protection is mainly against passive clients over the label information. 

To protect label information for vertical collaboration, at every round of XGBoost after the active party computes the gradients for each sample, the gradients will be encrypted before sending to passive parties (Figure 1). Upon receiving the encrypted gradients (ciphertext), they will be accumulated according to the specific feature distribution at each passive party. The resulting cumulative histograms will be returned to the active party, decrypted, and further used for tree building by the active party.

![secure_vert_hist](./figs/secure_vert.png)

### Secure Horizontal Federated Training of XGBoost
For horizontal XGBoost, each party holds “equal status” (whole feature and label for partial population), while the federated server performs aggregation, without owning any data. Hence in this case, clients have a concern of leaking information to the server, and to each other. Hence, the information to be protected is each clients’ local histograms.

To protect the local histograms for horizontal collaboration, the histograms will be encrypted before sending to the federated server for aggregation. The aggregation will then be performed over ciphertexts and the encrypted global histograms will be returned to clients, where they will be decrypted and used for tree building. In this way, the server will have no access to the plaintext histograms, while each client will only learn the global histogram after aggregation, rather than individual local histograms.

![secure_hori_hist](./figs/secure_hori.png)

### Encryption with proper HE schemes
With multiple libraries covering various HE schemes both with and without GPU support, it is important to properly choose the most efficient scheme for the specific needs of a particular federated XGBoost setting. Let’s look at one  example, assume N=5 number of participants, M=200K total number of data samples, J=30 total number of features, and each feature histogram has K=256 slots.  Depending on the type of federated learning applications: (Vertical or Horizontal application, we will need different algorithms. 

For vertical application, the encryption target is the individual g/h numbers, and the computation is to add the encrypted numbers according to which histogram slots they fall into. As the number of g/h is the same as the sample number, for each boosting round in theory:

The total encryption needed will be M * 2 = 400k (g and h), and each time encrypts a single number
The total encrypted addition needed will be (M – K) * 2 * J ≈ 12m
In this case, an optimal scheme choice would be Paillier because the encryption needs to be performed over a single number. Using schemes targeting vectors like CKKS would be a significant waste of space. 

For horizontal application, on the other hand, the encryption target is the local histograms G/H, and the computation is to add local histograms together to form the global histogram. For each boosting round:

The total encryption needed will be N * 2 = 10 (G and H), and each time encrypts a vector of length J * K = 7680
The total encrypted addition needed will be (N – 1) * 2 = 18
In this case, an optimal scheme choice would be CKKS because it is able to handle a histogram vector (with length 7680, for example) in one shot.

We provide encryption solutions both with CPU-only, and with efficient GPU acceleration. 

## Setup
Install required packages for data download and training

In [None]:
%pip install -r requirements.txt

## Encryption Plugins
The secure XGBoost requires encryption plugins to work. The plugins are distributed with NVFlare package. If you build NVFlare from source, you need
to build the plugins following the instructions in this [README](https://github.com/NVIDIA/NVFlare/blob/main/integration/xgboost/encryption_plugins/README.md)

The build process will generate 2 .so files: libcuda_paillier.so and libnvflare.so. Configure the path accordingly following the instructions in 
[XGBoost User Guide](https://nvflare.readthedocs.io/en/main/user_guide/federated_xgboost/secure_xgboost_user_guide.html)

## Data Preparation
We follow the same data preparation process as regular federated without secure features. Download and Store Data To run the examples, we use the same data as the last section. We download the dataset and stored in /tmp/nvflare/dataset/creditcard.csv with the following:

In [None]:
import kagglehub
path = kagglehub.dataset_download("mlg-ulb/creditcardfraud")
! mkdir -p /tmp/nvflare/dataset/
! cp {path}/creditcard.csv /tmp/nvflare/dataset/

### Data Split
To prepare data for further experiments, we perform the following steps:
1. Split the dataset into training/validation and testing sets. 
2. Split the training/validation set: 
    * Into "train" and "valid" for baseline centralized training.
    * Into "train" and "valid" for each client under horizontal setting. 
    * Into "train" and "valid" for each client under vertical setting.

Data splits used in this example can be generated with

In [None]:
! bash prepare_data.sh

This will generate data splits for 3 clients under all experimental settings.

> **_NOTE:_** In this section, we have divided the dataset into separate columns for each site,
> assuming that the datasets from different sites have already been joined using Private Set
> Intersection (PSI). However, in practice, each site initially has its own separate dataset. To
> combine these datasets accurately, you need to use PSI to match records with the same ID across
> different sites. 

> **_NOTE:_** The generated data files will be stored in the folder `/tmp/nvflare/dataset/xgb_dataset/`

## Run Baseline and Standalone Experiments
First, we run the baseline centralized training and standalone federated XGBoost training for comparison.
In this case, we utilized the `mock` plugin to simulate the homomorphic encryption process. 
For more details regarding federated XGBoost and the interface-plugin design,
please refer to our [documentation](https://nvflare.readthedocs.io/en/main/user_guide/federated_xgboost/secure_xgboost_user_guide.html).

To run all experiments, we provide a script for all settings.

In [None]:
! bash run_training_standalone.sh

This will cover baseline centralized training, federated xgboost run in the same machine
(server and clients are running in different processes) with and without secure feature.

## Run Federated Experiments with NVFlare
We then run the federated XGBoost training using NVFlare Simulator via [JobAPI](https://nvflare.readthedocs.io/en/main/programming_guide/fed_job_api.html), without and with homomorphic encryption. 

In [None]:
%%capture
! python xgb_fl_job.py --data_root /tmp/nvflare/dataset/xgb_dataset/horizontal_xgb_data --data_split_mode horizontal
! python xgb_fl_job.py --data_root /tmp/nvflare/dataset/xgb_dataset/horizontal_xgb_data --data_split_mode horizontal --secure True
! python xgb_fl_job.py --data_root /tmp/nvflare/dataset/xgb_dataset/vertical_xgb_data --data_split_mode vertical
! python xgb_fl_job.py --data_root /tmp/nvflare/dataset/xgb_dataset/vertical_xgb_data --data_split_mode vertical --secure True

Secure horizontal needs additional tenseal context provisioning:

In [None]:
%%capture
! nvflare provision -p project.yml -w /tmp/nvflare/workspace/fedxgb_secure/train_fl/works/horizontal_secure
! nvflare simulator /tmp/nvflare/workspace/fedxgb_secure/train_fl/jobs/horizontal_secure -w /tmp/nvflare/workspace/fedxgb_secure/train_fl/works/horizontal_secure/example_project/prod_00/site-1 -n 3 -t 3

## Results
Comparing the AUC results with centralized baseline, we have four observations:
1. The performance of the model trained with homomorphic encryption is identical to its counterpart without encryption.
2. Vertical federated learning (both secure and non-secure) have identical performance as the centralized baseline.
3. Horizontal federated learning (both secure and non-secure) have performance slightly different from the centralized baseline. This is because under horizontal FL, the local histogram quantiles are based on the local data distribution, which may not be the same as the global distribution.
4. GPU leads to different results compared to CPU, which is expected as the GPU involves some data conversions.

Below are sample results for CPU training:

The AUC of vertical learning (both secure and non-secure):
```
[0]	eval-auc:0.90515	train-auc:0.92747
[1]	eval-auc:0.90516	train-auc:0.92748
[2]	eval-auc:0.90518	train-auc:0.92749
```
The AUC of horizontal learning (both secure and non-secure):
```
[0]	eval-auc:0.89789	train-auc:0.92732
[1]	eval-auc:0.89791	train-auc:0.92733
[2]	eval-auc:0.89791	train-auc:0.92733
```

Comparing the tree models with centralized baseline, we have the following observations:
1. Vertical federated learning (non-secure) has exactly the same tree model as the centralized baseline.
2. Vertical federated learning (secure) has the same tree structures as the centralized baseline, however, it produces different tree records at different parties - because each party holds different feature subsets, as illustrated below.
3. Horizontal federated learning (both secure and non-secure) have different tree models from the centralized baseline.

|     ![Tree Structures](./figs/tree.base.png)      |
|:-------------------------------------------------:|
|                 *Baseline Model*                  |
| ![Tree Structures](./figs/tree.vert.secure.0.png) |
|        *Secure Vertical Model at Party 0*         |
| ![Tree Structures](./figs/tree.vert.secure.1.png) |
|        *Secure Vertical Model at Party 1*         |
| ![Tree Structures](./figs/tree.vert.secure.2.png) |
|        *Secure Vertical Model at Party 2*         |

In this case we can notice that Party 0 holds Feature 7 and 10, Party 1 holds Feature 14, 17, and 12, and Party 2 holds none of the effective features for this tree - parties who do not hold the feature will and should not know the split value if it.

By combining the feature splits at all parties, the tree structures will be identical to the centralized baseline model.

When comparing the training and validation accuracy as well as the model outputs,
experiments conducted with NVFlare produce results that are identical
to those obtained from standalone scripts.

For more information on the secure xgboost user guide please refer to
https://nvflare.readthedocs.io/en/main/user_guide/federated_xgboost/secure_xgboost_user_guide.html

Now that we covered federated XGBoost under various settings: histogram-based and tree-based, horizontal and vertical, regular and secured. Let's have a [recap](../10.3_recap/recap.ipynb).