# Hetero-SecureBoost Tutorial

In a vertically partitioned data setting, multiple parties have different feature sets for the same common user samples. Federated learning enables these parties to collaboratively train a model without sharing their actual data. The model is trained locally at each party, and only model updates are shared, not the actual data. 
SecureBoost is a specialized tree-boosting framework designed for vertical federated learning. It performs entity alignment under a privacy-preserving protocol and constructs boosting trees across multiple parties using an encryption strategy. It allows for high-quality, lossless model training without needing a trusted third party.

In this tutorial, we will show you how to run a Hetero-SecureBoost task under FATE-2.0 without using a Pipeline. You can refer to this example for local model experimentation, algorithm modification, and testing, although we do not recommend using it directly in a production environment.

## Setup Hetero-Secureboost Step by Step

To run a Hetero-Secureboost task, several steps are needed:
1. Import required libraries and create fate context
2. Prepare tabular data and transform them into fate dataframe
3. guest&host run the python script, fit the Hetero-Secureboost model

### Import Libs and Create Context
We import these lib from later use.

In [1]:
import pandas as pd
from fate.arch.dataframe import PandasReader
import sys
from fate.ml.ensemble.algo.secureboost.hetero.guest import HeteroSecureBoostGuest
from fate.ml.ensemble.algo.secureboost.hetero.host import HeteroSecureBoostHost
from datetime import datetime

  from .autonotebook import tqdm as notebook_tqdm


We will use 'create_ctx' to create fate context. When creating fate context, please make sure that every party's context is initialized with the same unique context_name. Here, we provide 'get_current_datetime_str' in order to get a unique context name
according to the current date and time. We are running a two party vertical federation task, so guest party id and host party id are:

In [8]:
guest = ("guest", "10000")
host = ("host", "9999")

In [6]:
def get_current_datetime_str():
    return datetime.now().strftime("%Y-%m-%d-%H-%M")


def create_ctx(local, context_name):
    from fate.arch import Context
    from fate.arch.computing.standalone import CSession
    from fate.arch.federation.standalone import StandaloneFederation
    import logging

    # prepare log
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    console_handler = logging.StreamHandler()
    console_handler.setLevel(logging.INFO)
    formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
    console_handler.setFormatter(formatter)
    logger.addHandler(console_handler)
    # init fate context
    computing = CSession()
    return Context(
        computing=computing, federation=StandaloneFederation(computing, context_name, local, [guest, host])
    )


### Prepare Data

We can read a csv file and transform it into a Fate-DataFrame:

In [12]:
# guest create context
guest_ctx = create_ctx(guest, get_current_datetime_str())
# read csv
df = pd.read_csv('./../../../../examples/data/breast_hetero_guest.csv')
# add sample_id column, sample id & match id are needed in FATE dataframe
df["sample_id"] = [i for i in range(len(df))]
reader = PandasReader(sample_id_name="sample_id", match_id_name="id", label_name="y", dtype="float32") 
data_guest = reader.to_frame(guest_ctx, df)

In [13]:
data_guest

<fate.arch.dataframe._dataframe.DataFrame at 0x7fd2889f17c0>

In [19]:
data_guest.as_pd_df()

Unnamed: 0,sample_id,id,x0,x1,x2,x3,x4,x5,x6,x7,...,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19
0,0,133.0,0.449512,-1.247226,0.413178,0.303781,-0.123848,-0.184227,-0.219076,0.268537,...,-0.337360,-0.728193,-0.442587,-0.272757,-0.608018,-0.577235,-0.501126,0.143371,-0.466431,-0.554102
1,5,274.0,1.080023,1.207830,0.956888,0.978402,-0.555822,-0.645696,-0.399365,-0.038153,...,0.057848,0.392164,-0.050027,0.120414,-0.532348,-0.770613,-0.519694,-0.531097,-0.769127,-0.394858
2,7,76.0,-0.169639,-1.943019,-0.167192,-0.272150,2.329937,0.006804,-0.251467,0.429234,...,0.017786,-0.368046,-0.105966,-0.169129,2.119760,0.162743,-0.672216,-0.577002,0.626908,0.896114
3,9,399.0,-0.660984,-0.472313,-0.688248,-0.634204,-0.390718,-0.796360,-0.756680,-0.839314,...,-0.221505,-0.139439,-0.317344,-0.336122,-0.526014,-0.326291,-0.368166,-1.037840,-0.698901,-0.273818
4,11,246.0,-0.263364,-0.432753,-0.322891,-0.322206,-1.722935,-1.120051,-0.570489,-0.976796,...,-0.874050,0.696974,-0.986625,-0.589142,-0.260004,-0.547055,-0.036596,-1.040273,-0.111671,-0.584362
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,499,515.0,-0.791630,-0.158159,-0.791224,-0.749959,0.607733,-0.366730,-0.574758,-0.592724,...,-0.585313,-0.375303,-0.680696,-0.487274,0.512028,-0.506814,-0.361535,-0.117786,0.459820,-0.975096
565,504,357.0,-0.073075,-0.716655,-0.142066,-0.174028,-0.635527,-0.936601,-0.926297,-0.723241,...,-0.544529,0.265160,-0.558919,-0.431169,-0.467679,-0.980254,-0.883294,-0.933377,-0.617779,-0.646017
566,509,377.0,-0.189520,2.075826,-0.250397,-0.263902,-1.508016,-1.081769,-0.955299,-0.973701,...,-0.852755,-0.121295,-0.725744,-0.559439,-0.699688,-0.751610,-0.808558,-1.073364,-0.741279,-0.798452
567,510,234.0,-1.295188,-0.786467,-1.308161,-1.067361,-0.834079,-1.202869,-0.907465,-0.831834,...,-0.685649,-0.701704,-0.817325,-0.609383,1.533069,-0.842710,-0.664258,-0.352504,0.398070,-0.096418


For host side, creating the dataframe is the same as the guest side, except that the label_name is not needed.

In [16]:
# host create context
guest_ctx = create_ctx(host, get_current_datetime_str())
# read csv
df = pd.read_csv('./../../../../examples/data/breast_hetero_guest.csv')
# add sample_id column, sample id & match id are needed in FATE dataframe
df["sample_id"] = [i for i in range(len(df))]
reader = PandasReader(sample_id_name="sample_id", match_id_name="id", dtype="float32") 
data_host = reader.to_frame(guest_ctx, df)

In [18]:
data_host.as_pd_df()

Unnamed: 0,sample_id,id,x0,x1,x2,x3,x4,x5,x6,x7,...,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19
0,0,133.0,0.449512,-1.247226,0.413178,0.303781,-0.123848,-0.184227,-0.219076,0.268537,...,-0.337360,-0.728193,-0.442587,-0.272757,-0.608018,-0.577235,-0.501126,0.143371,-0.466431,-0.554102
1,5,274.0,1.080023,1.207830,0.956888,0.978402,-0.555822,-0.645696,-0.399365,-0.038153,...,0.057848,0.392164,-0.050027,0.120414,-0.532348,-0.770613,-0.519694,-0.531097,-0.769127,-0.394858
2,7,76.0,-0.169639,-1.943019,-0.167192,-0.272150,2.329937,0.006804,-0.251467,0.429234,...,0.017786,-0.368046,-0.105966,-0.169129,2.119760,0.162743,-0.672216,-0.577002,0.626908,0.896114
3,9,399.0,-0.660984,-0.472313,-0.688248,-0.634204,-0.390718,-0.796360,-0.756680,-0.839314,...,-0.221505,-0.139439,-0.317344,-0.336122,-0.526014,-0.326291,-0.368166,-1.037840,-0.698901,-0.273818
4,11,246.0,-0.263364,-0.432753,-0.322891,-0.322206,-1.722935,-1.120051,-0.570489,-0.976796,...,-0.874050,0.696974,-0.986625,-0.589142,-0.260004,-0.547055,-0.036596,-1.040273,-0.111671,-0.584362
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,499,515.0,-0.791630,-0.158159,-0.791224,-0.749959,0.607733,-0.366730,-0.574758,-0.592724,...,-0.585313,-0.375303,-0.680696,-0.487274,0.512028,-0.506814,-0.361535,-0.117786,0.459820,-0.975096
565,504,357.0,-0.073075,-0.716655,-0.142066,-0.174028,-0.635527,-0.936601,-0.926297,-0.723241,...,-0.544529,0.265160,-0.558919,-0.431169,-0.467679,-0.980254,-0.883294,-0.933377,-0.617779,-0.646017
566,509,377.0,-0.189520,2.075826,-0.250397,-0.263902,-1.508016,-1.081769,-0.955299,-0.973701,...,-0.852755,-0.121295,-0.725744,-0.559439,-0.699688,-0.751610,-0.808558,-1.073364,-0.741279,-0.798452
567,510,234.0,-1.295188,-0.786467,-1.308161,-1.067361,-0.834079,-1.202869,-0.907465,-0.831834,...,-0.685649,-0.701704,-0.817325,-0.609383,1.533069,-0.842710,-0.664258,-0.352504,0.398070,-0.096418


##

### Run the Hetero-Secureboost Script

Once contexts are prepared and data are loaded, we can initialize Secureboost instances and fit models.Here we show you the complete python script for running a Hetero-Secureboost task. In this example, we will not use PSI (Private Set Intersection) for data intersection; instead, we will train the tree model directly with aligned data.

In [5]:
import pandas as pd
from fate.arch.dataframe import PandasReader
import sys
from fate.ml.ensemble.algo.secureboost.hetero.guest import HeteroSecureBoostGuest
from fate.ml.ensemble.algo.secureboost.hetero.host import HeteroSecureBoostHost
from datetime import datetime

guest = ("guest", "10000")
host = ("host", "9999")

def get_current_datetime_str():
    return datetime.now().strftime("%Y-%m-%d-%H-%M")


def create_ctx(local, context_name):
    from fate.arch import Context
    from fate.arch.computing.standalone import CSession
    from fate.arch.federation.standalone import StandaloneFederation
    import logging

    # prepare log
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    console_handler = logging.StreamHandler()
    console_handler.setLevel(logging.INFO)
    formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
    console_handler.setFormatter(formatter)
    logger.addHandler(console_handler)
    # init fate context
    computing = CSession()
    return Context(
        computing=computing, federation=StandaloneFederation(computing, context_name, local, [guest, host])
    )


if __name__ == "__main__":

    party = sys.argv[1]
    max_depth = 2
    num_tree = 2

    if party == "guest":
        ctx = create_ctx(guest, get_current_datetime_str())
        df = pd.read_csv("./../../../../examples/data/breast_hetero_guest.csv")
        df["sample_id"] = [i for i in range(len(df))]

        reader = PandasReader(sample_id_name="sample_id", match_id_name="id", label_name="y", dtype="float32")
        data_guest = reader.to_frame(ctx, df)

        trees = HeteroSecureBoostGuest(num_tree, max_depth=max_depth)
        trees.fit(ctx, data_guest)
        pred = trees.get_train_predict().as_pd_df()

        print(pred)
        from sklearn.metrics import roc_auc_score
        print('auc is {}'.format(roc_auc_score(pred['label'], pred['predict_score'])))

        # save model
        import json
        with open('./guest_tree.json', 'w') as f:
            f.write(json.dumps(trees.get_model(), indent=4))

        # load model
        model_dict = json.load(open('./guest_tree.json'))

    elif party == "host":
        ctx = create_ctx(host, get_current_datetime_str())
        df_host = pd.read_csv("./../../../../examples/data/breast_hetero_host.csv")
        df_host["sample_id"] = [i for i in range(len(df_host))]

        reader_host = PandasReader(sample_id_name="sample_id", match_id_name="id", dtype="float32")
        data_host = reader_host.to_frame(ctx, df_host)

        trees = HeteroSecureBoostHost(num_tree, max_depth=max_depth)
        trees.fit(ctx, data_host)

        # save model
        import json
        with open('./host_tree.json', 'w') as f:
            f.write(json.dumps(trees.get_model(), indent=4))

        # load model
        model_dict = json.load(open('./host_tree.json'))

We save this script to a file named 'run_hetero_sbt.py' and execute it simultaneously in two terminals. The guest party terminal command is:

```
python -i ./run_hetero_sbt.py guest
```

The host party terminal command is:

```
python -i ./run_hetero_sbt.py host
```

We add -i option so that you can check the result of the script in the terminal.

Geust Terminal Outputs:
```
2023-09-13 16:42:47,053 - fate.ml.ensemble.algo.secureboost.hetero.guest - INFO - start to fit a guest tree
2023-09-13 16:42:47,583 - fate.ml.ensemble.learner.decision_tree.hetero.guest - INFO - encrypt kit setup through setter
2023-09-13 16:42:48,214 - fate.ml.ensemble.learner.decision_tree.hetero.guest - INFO - gh are packed
2023-09-13 16:42:50,978 - fate.ml.ensemble.learner.decision_tree.tree_core.decision_tree - INFO - drop leaf samples, new sample count is 569, 0 samples dropped
2023-09-13 16:42:51,067 - fate.ml.ensemble.learner.decision_tree.hetero.guest - INFO - layer 0 done: next layer will split 2 nodes, active samples num 569
2023-09-13 16:42:53,802 - fate.ml.ensemble.learner.decision_tree.tree_core.decision_tree - INFO - drop leaf samples, new sample count is 569, 0 samples dropped
2023-09-13 16:42:53,979 - fate.ml.ensemble.learner.decision_tree.hetero.guest - INFO - layer 1 done: next layer will split 4 nodes, active samples num 569
2023-09-13 16:42:54,769 - fate.ml.ensemble.algo.secureboost.hetero.guest - INFO - fitting guest decision tree 0 done
2023-09-13 16:42:54,769 - fate.ml.ensemble.algo.secureboost.hetero.guest - INFO - start to fit a guest tree
2023-09-13 16:42:55,419 - fate.ml.ensemble.learner.decision_tree.hetero.guest - INFO - encrypt kit setup through setter
2023-09-13 16:42:56,075 - fate.ml.ensemble.learner.decision_tree.hetero.guest - INFO - gh are packed
2023-09-13 16:42:58,780 - fate.ml.ensemble.learner.decision_tree.tree_core.decision_tree - INFO - drop leaf samples, new sample count is 569, 0 samples dropped
2023-09-13 16:42:58,875 - fate.ml.ensemble.learner.decision_tree.hetero.guest - INFO - layer 0 done: next layer will split 2 nodes, active samples num 569
2023-09-13 16:43:01,620 - fate.ml.ensemble.learner.decision_tree.tree_core.decision_tree - INFO - drop leaf samples, new sample count is 569, 0 samples dropped
2023-09-13 16:43:01,779 - fate.ml.ensemble.learner.decision_tree.hetero.guest - INFO - layer 1 done: next layer will split 4 nodes, active samples num 569
2023-09-13 16:43:02,564 - fate.ml.ensemble.algo.secureboost.hetero.guest - INFO - fitting guest decision tree 1 done
    sample_id     id  label  predict_score  predict_result                                     predict_detail
0           0  133.0      1       0.620374               1  "{'0': 0.3796257972717285, '1': 0.620374202728...
1           5  274.0      0       0.288331               0  "{'0': 0.7116693258285522, '1': 0.288330674171...
2           7   76.0      1       0.730982               1  "{'0': 0.26901811361312866, '1': 0.73098188638...
3           9  399.0      1       0.730982               1  "{'0': 0.26901811361312866, '1': 0.73098188638...
4          11  246.0      1       0.730982               1  "{'0': 0.26901811361312866, '1': 0.73098188638...
..        ...    ...    ...            ...             ...                                                ...
564       499  515.0      1       0.730982               1  "{'0': 0.26901811361312866, '1': 0.73098188638...
565       504  357.0      1       0.730982               1  "{'0': 0.26901811361312866, '1': 0.73098188638...
566       509  377.0      1       0.730982               1  "{'0': 0.26901811361312866, '1': 0.73098188638...
567       510  234.0      1       0.730982               1  "{'0': 0.26901811361312866, '1': 0.73098188638...
568       511  341.0      1       0.730982               1  "{'0': 0.26901811361312866, '1': 0.73098188638...

[569 rows x 6 columns]
auc is 0.9778024417314095
```

Host Terminal Outputs:

```
2023-09-13 16:42:45,399 - fate.ml.ensemble.algo.secureboost.hetero.host - INFO - data binning done
2023-09-13 16:42:45,399 - fate.ml.ensemble.algo.secureboost.hetero.host - INFO - start to fit a host tree
2023-09-13 16:42:49,417 - fate.ml.ensemble.learner.decision_tree.hetero.host - INFO - cur layer node num: 1, next layer node num: 2
2023-09-13 16:42:50,792 - fate.ml.ensemble.learner.decision_tree.tree_core.decision_tree - INFO - drop leaf samples, new sample count is 569, 0 samples dropped
2023-09-13 16:42:50,933 - fate.ml.ensemble.learner.decision_tree.hetero.host - INFO - layer 0 done: next layer will split 2 nodes, active samples num 569
2023-09-13 16:42:52,058 - fate.ml.ensemble.learner.decision_tree.hetero.host - INFO - cur layer node num: 2, next layer node num: 4
2023-09-13 16:42:53,503 - fate.ml.ensemble.learner.decision_tree.tree_core.decision_tree - INFO - drop leaf samples, new sample count is 569, 0 samples dropped
2023-09-13 16:42:53,728 - fate.ml.ensemble.learner.decision_tree.hetero.host - INFO - layer 1 done: next layer will split 4 nodes, active samples num 569
2023-09-13 16:42:53,967 - fate.ml.ensemble.algo.secureboost.hetero.host - INFO - fitting host decision tree 0 done
2023-09-13 16:42:53,967 - fate.ml.ensemble.algo.secureboost.hetero.host - INFO - start to fit a host tree
2023-09-13 16:42:57,265 - fate.ml.ensemble.learner.decision_tree.hetero.host - INFO - cur layer node num: 1, next layer node num: 2
2023-09-13 16:42:58,649 - fate.ml.ensemble.learner.decision_tree.tree_core.decision_tree - INFO - drop leaf samples, new sample count is 569, 0 samples dropped
2023-09-13 16:42:58,787 - fate.ml.ensemble.learner.decision_tree.hetero.host - INFO - layer 0 done: next layer will split 2 nodes, active samples num 569
2023-09-13 16:43:00,027 - fate.ml.ensemble.learner.decision_tree.hetero.host - INFO - cur layer node num: 2, next layer node num: 4
2023-09-13 16:43:01,377 - fate.ml.ensemble.learner.decision_tree.tree_core.decision_tree - INFO - drop leaf samples, new sample count is 569, 0 samples dropped
2023-09-13 16:43:01,598 - fate.ml.ensemble.learner.decision_tree.hetero.host - INFO - layer 1 done: next layer will split 4 nodes, active samples num 569
2023-09-13 16:43:01,839 - fate.ml.ensemble.algo.secureboost.hetero.host - INFO - fitting host decision tree 1 done
```