# Completely Random Forest training with encrypted data using FHE

## Introduction

This example shows how to train a completely-random-forest (CRF) [1] with encryption of UCI-adult dataset [2-3] using FHE.
Our CRF implementation supports features of binary-value type.
Therefore features of categorical (non-ordinal) type need to be preprocessed as one-hot vectors (see explanation in [1] section 2.2). Features that contain numeric values are first splitted to a fix number of sub-range buckets, then each value is mapped to one-hot ordinal representation according to the buckets.

This demo uses SEAL backend since release 1.5.5

Reading the UCI-adult dataset from file system

In [None]:
import gc
import math
import pandas as pd
import numpy as np
import os
from pathlib import Path
import warnings
from utils import elapsed_timer, get_used_ram, get_data_sets_dir
warnings.filterwarnings('ignore')

INPUT_DIR = os.path.join(get_data_sets_dir(), 'uci_adult')
train_data = pd.read_csv(os.path.join(INPUT_DIR, "adult.data"), header=None)
X_train = train_data.iloc[:,:-1]
y_train = train_data.iloc[:,-1]

test_data = pd.read_csv(os.path.join(INPUT_DIR, "adult.test"), header=None, skiprows=1, sep="[,.]")
X_test = test_data.iloc[:,:-2]
y_test = test_data.iloc[:,-2]

Initialize HElayers context for 128 bit security level such that each ciphertext pack 4096 plaintext values.

In [None]:
import pyhelayers
# Request a SEAL context
he_context = pyhelayers.HeContext.create(["SEAL_CKKS"])
requirements = pyhelayers.HeConfigRequirement(
    num_slots = 4096,
    multiplication_depth = 3,
    fractional_part_precision = 36,
    integer_part_precision = 17,
    security_level = 128)

he_context.init(requirements)

Preprocessing all features to binary features.  We read the data in batches of a size equal to the number of slots. For best performance, we recommend on choosing batch size equals to an integer multiplication of the number of ciphertext's slots i.e., he_context.slot_count().
For ordinal features, the number of bins determines the granularity of it's binary representation - see [1] section 2.2 for further discussion on data representation.

In [None]:
from misc.crf_utils import Preprocessor
batch_size = he_context.slot_count()
prep = Preprocessor(num_bins = 10, batch_size = batch_size)
cat_predictors, ord_predictors  = prep.preprocess_predictor_descriptions(X_train)

Set CRF model hyperparameters: the number of trees, the tree's depth and the features types.

In [None]:
crf = pyhelayers.Crf(he_context)
crf.set_hyper_params(num_trees = 100, depth=3, categorical_predictors = cat_predictors, ordinal_predictors = ord_predictors, seed=42)

Transform the training data into batches of binary data, encrypt each batch and use it to homomorphically train the CRF. Note that in a more realistic use case, a data owner encrypts the dataset and send it to a remote server that would train the CRF with the encrypted data. However, for simplicity, we only show here a single computation entity.

In [None]:
batch_ind = 0
last_batch = False
num_batches = math.ceil(len(y_train) / batch_size)
with elapsed_timer("fit", len(y_train)):
    while not last_batch:
        batch_ind = batch_ind + 1
        print('fitting batch %d/%d' % (batch_ind, num_batches))
        X_batch_oh, y_batch_oh, last_batch = prep.transform_next_batch(X_train, y_train)
        x_train_enc, y_train_enc = crf.encode_encrypt_input(X_batch_oh, y_batch_oh)
        crf.fit(x_train_enc, y_train_enc)
print("RAM usage:", get_used_ram(), "MB")

Delete `X_train` and `y_train` to free up memory

In [None]:
del X_train
del y_train
gc.collect()

Decrypt the CRF model

In [None]:
crf_plaintext = crf.decrypt_decode()

Transform the test data into batches of binary data, run inference over each batch by the plaintext CRF model and evaluate the model's AUC over all the test data.


In [None]:
import numpy as np
last_batch = False
y_test_bin_all = np.empty([0,1])
y_pred_proba_all = np.empty([0,2])
batch_ind = 0
last_batch = False
while not last_batch:
    batch_ind = batch_ind + 1
    X_test_one_hot, y_test_bin, last_batch = prep.transform_next_batch(X_test, y_test)
    y_pred_proba = crf_plaintext.predict_proba(X_test_one_hot)
    y_test_bin_all = np.concatenate((y_test_bin_all, y_test_bin))
    y_pred_proba_all = np.concatenate((y_pred_proba_all, y_pred_proba))

In [None]:
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_test_bin_all,y_pred_proba_all[:,1])
print(f"AUC metrics: {auc:.2f}")
print("RAM usage:", get_used_ram(), "MB")
if (auc < 0.8):
    raise Exception("AUC too small")

Display the first tree of the trained CRF. For each leaf count of negative and positive classes are shown followed by list of conditions on the binary features 

In [None]:
print(str(crf_plaintext).split("\ntree")[0])

## Citations
[1] Aslett, Louis JM, Pedro M. Esperança, and Chris C. Holmes. "Encrypted statistical machine learning: new privacy preserving methods." arXiv preprint arXiv:1508.06845 (2015).

[2] Kohavi,  R.,  Becker,  B.:   Uci  machine  learning  repository  -  adult  dataset  (1996), https://archive.ics.uci.edu/ml/datasets/adult18

[3] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 