# Adults Income Analysis using a fabricated Adults Data dataset

This notebook shows an example of training and running a model that classifies people described by a set of attributes as 
It is based on a fabricated dataset that generated based on the <a href="https://archive.ics.uci.edu/ml/datasets/adult">Adults income dataset</a> from the <a href="https://archive.ics.uci.edu">UCI</a> repository. 
The dataset has 14 attributes (6 numerical, 8 categorical) and the target field is an integer representing the income: whether or not the income exceeds $50K/yr.

The demonstration uses a decision tree model for classification.

The dataset attributes are: 

|Attribute|Values|
|---|---|
|Age|Numerical|
|Workclass|Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked|
|fnlwgt|Numerical|
|Education|Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool|
|Education-num|Numerical|
|Marital-status|Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse|
|Occupation|Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces|
|Relationship|Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried|
|Race|White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black|
|Sex|Female, Male|
|Capital-loss|Numerical|
|Capital-gain|Numerical|
|Hours-per-week|Numerical|
|Native-country|United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands|
|Income|>50K , <=50K|

In [1]:
import os
import warnings
warnings.filterwarnings("ignore")

##### For reproducibility
from numpy.random import seed
seed_value= 1
os.environ['PYTHONHASHSEED']=str(seed_value)
seed(seed_value)
import numpy as np
import pandas as pd


from sklearn import metrics
from sklearn.model_selection import train_test_split

import h5py

import random
import sklearn_json as skljson
from sklearn.linear_model import LogisticRegression
import sys
from  preprocessor import Preprocessor

TASK_NAME = "dt_adults_income"

run_with_gpu = False


### Data loading
Please refer to the dataset <a href="https://archive.ics.uci.edu/ml/datasets/adult">documentation</a> for the complete list of attributes and their description.

In [2]:
column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num',
 'marital-status', 'occupation', 'relationship', 'race', 'sex',
  'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'label']

df = pd.read_csv("./datasets/adult.data", names=column_names, header=None, index_col=False, engine='python')

df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### Data preprocessing

We first convert the categorial features (in the table below) to indicator vectors. 

Subsequently, we split every row into its target value (y) and predicates (X).

In [3]:
X = df.drop(['label'], axis=1)
y = df['label'].str.strip().map({'<=50K': 0, '>50K': 1}).astype('int')
X.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


### Data preprocessing

We split the dataset into the training (x_train, y_train) and test (x_test, y_test) sets and scale their features. 

We convert the categorial features (in the table below) to indicator vectors. 

Subsequently, we split the test set into test and validation sets.

In [4]:
x_train, x_test, y_train, y_test = train_test_split(X, y ,test_size=0.2, random_state=5, stratify=y)

prep = Preprocessor()
x_train = prep.fit_transform(x_train)
x_test = prep.transform(x_test)

x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=4096, random_state=5, stratify=y_test)

For later use in HE, we save the different preprocessed datasets.

In [5]:
def save_data_set(x, y, data_type, path, s=''):
    if not os.path.exists(path):
        os.makedirs(path)
    fname=os.path.join(path, f'x_{data_type}{s}.h5')
    print("Saving x_{} of shape {} in {}".format(data_type, x.shape, fname))
    xf = h5py.File(fname, 'w')
    xf.create_dataset('x_{}'.format(data_type), data=x)
    xf.close()

    print("Saving y_{} of shape {} in {}".format(data_type, y.shape, fname))
    yf = h5py.File(os.path.join(path, f'y_{data_type}{s}.h5'), 'w')
    yf.create_dataset(f'y_{data_type}', data=y)
    yf.close()

datasets_dir = "datasets/"
model_dir = "model/"

save_data_set(x_test, y_test, data_type='test', path=datasets_dir)
save_data_set(x_train, y_train, data_type='train', path=datasets_dir)
save_data_set(x_val, y_val, data_type='val', path=datasets_dir)

if not os.path.exists(model_dir):
    os.mkdir(model_dir)
prep.save(os.path.join(model_dir, "prep.pickle"))

Saving x_test of shape (2417, 6) in datasets/x_test.h5
Saving y_test of shape (2417,) in datasets/x_test.h5
Saving x_train of shape (26048, 6) in datasets/x_train.h5
Saving y_train of shape (26048,) in datasets/x_train.h5
Saving x_val of shape (4096, 6) in datasets/x_val.h5
Saving y_val of shape (4096,) in datasets/x_val.h5


### Training a Decision Tree Model

In [6]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=3)
clf.fit(x_train, y_train)

print('DT model ready')

DT model ready


For later use in HE, we save the trained model.

In [7]:
def save_model(model, path):
    if not os.path.exists(path):
        os.mkdir(path)
    fname = os.path.join(path, f"{TASK_NAME}_model.json")
    skljson.to_json(model, fname)
    print("Saved model to ",fname)

save_model(clf, model_dir)


Saved model to  model/dt_adults_income_model.json


### Using the model for classifying cleartest data

In [8]:
y_pred = clf.predict(x_test)

Confusion Matrix - TEST

In [9]:
f,t,thresholds = metrics.roc_curve(y_test, y_pred)
cm = metrics.confusion_matrix(y_test, y_pred)
print(f"AUC Score: {metrics.auc(f,t):.3f}")
print("Classification report:")
print(metrics.classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(cm)

AUC Score: 0.731
Classification report:
              precision    recall  f1-score   support

           0       0.86      0.95      0.90      1835
           1       0.78      0.51      0.61       582

    accuracy                           0.85      2417
   macro avg       0.82      0.73      0.76      2417
weighted avg       0.84      0.85      0.83      2417

Confusion Matrix:
[[1751   84]
 [ 287  295]]


### Using the model for classifying encrypted data

To run the model over encrypted samples with homomorphic encryption (HE), we first load the pyhelayers package and refer it to the directory "output/", where we saved the model and the relevant datasets.

In [10]:
import pyhelayers

Load test data and labels from the h5 file

In [11]:
with h5py.File(datasets_dir + "x_test.h5") as f:
    x_test = np.array(f["x_test"])
with h5py.File(datasets_dir + "y_test.h5") as f:
    y_test = np.array(f["y_test"])

### Compute the feature ranges


Our implementaiton requires the users to specify the minimum and maximum values of each feature. Here, we extract this info from the training data and assume it will also be relevant to the test data.

In [12]:
def get_feature_range(col):
    return (col.min(), col.max())

feature_ranges = []
for col in x_train.T:
    feature_ranges.append(get_feature_range(col))

Load a plain model

In [13]:
hyper_params = pyhelayers.PlainModelHyperParams()
hyper_params.feature_ranges = feature_ranges
# hyper_params.grep = 3
# hyper_params.frep = 1

hyper_params.verbose = True

plain_dtree = pyhelayers.PlainModel.create(hyper_params, [os.path.join(model_dir, f"{TASK_NAME}_model.json")])
# plain_dtree.init_from_json_file(os.path.join(model_dir, f"{TASK_NAME}_model.json"))
print("loaded plain model")

The provided parameters are a valid represention of DTreePlain model. Printing each other supported model type and the reason for which the provided parameters are not valid for it.
*** ArimaPlain ***
/data/helayers/src/helayers/ai/arima/ArimaPlain.cpp:28: doInit: Assertion failed: streams.empty()
*** KMeansPlain ***
KMeans initialization: expecting CSV file/stream, .json given
*** XGBoostPlain ***
No subtree exists under the specified key: learner
*** NeuralNetPlain ***
Neural network initialization from a single JSON file must include initializing random weights, using Hyperparameters initRandomWeights flag
*** LogisticRegressionPlain ***
No subtree exists under the specified key: coef_
loaded plain model


Apply automatic optimziations

In [14]:
he_run_req = pyhelayers.HeRunRequirements()

if hasattr(pyhelayers, "HeaanContext"):
    print('Using HEaaN backend')
    he_run_req.set_he_context_options([pyhelayers.HeaanContext()])
else:
    print('Using SEAL backend')
    he_run_req.set_he_context_options([pyhelayers.SealCkksContext()])

Using SEAL backend


In [15]:
profile = pyhelayers.HeModel.compile(plain_dtree, he_run_req)

*** Searching profiles for mode DEFAULT ***
Running 6 simulations . . .
*** Search summary for mode DEFAULT ***
Profiles evaluated: 4
Best profile:
He configuration requirement:
Security level: 128
Integer part precision: 10
Fractional part precision: 41
Number of slots: 16384
Multiplication depth: 19
Bootstrappable: False
Automatic bootstrapping: False
Rotation keys policy: custom, 0 keys required:
[]
HE context name: SEAL_CKKS
Mode: predict
Tile layout: ( 16384 x 1 )
Mode name: DEFAULT
Is model encrypted: true
Using circuit optimization: false
Lazy encoding: false
Handle overflow: false
Base chain index: 19
Estimated model measures:
Required bootstrap operations: 0
Estimated predict CPU time (s): 37.16
Estimated init model CPU time (s): 1.58
Estimated encrypt input CPU time (s): 0.42
Estimated decrypt output CPU time (s): 0.00
Estimated throughput (samples/s): 440.92
Estimated model memory (MB): 83.89
Estimated input memory (MB): 41.94
Estimated output memory (MB): 0.52
Estimated con

Intialize the HE context with the optimized configuration.

In [16]:
he_context = pyhelayers.HeModel.create_context(profile)
if run_with_gpu:
    he_context.set_default_device(pyhelayers.DeviceType.DEVICE_GPU)
else:
    he_context.set_default_device(pyhelayers.DeviceType.DEVICE_CPU)

### 2.6. Initialize and encrypt the model¶
We initialize the HE model using the plain model and the HE profile computed above.

In [17]:
dt = plain_dtree.get_empty_he_model(he_context)
dt.encode_encrypt(plain_dtree, profile)
print('FHE model encrypted and initialized')

FHE model encrypted and initialized


We use the encrypted model over batches of 16 records at a time. 

In [18]:
batch_size=16
plain_samples = x_test.take(indices=range(0, batch_size), axis=0)
labels = y_test.take(indices=range(0, batch_size), axis=0)

Encrypt input samples

In [19]:
iop = dt.create_io_processor()
x_test_enc = iop.encode_encrypt_input_for_predict(plain_samples)
print('input data encrypted')

input data encrypted


We perform FHE prediction on the encrypted test samples, using the encrypted model. The resulting predictions are encrypted as well, and will next be decrypted and compared to the expected labels.

### Run prediction over the encrypted data
Now we perform inference of the 16 samples under encryption 

In [20]:
res = dt.predict(x_test_enc)
print('prediction ready')

prediction ready


### Plaintext results

Decrypting the final results

In [21]:
res_plain = iop.decrypt_decode_output(res)
res_plain = np.where(res_plain > 0.5, 1, 0)

In [22]:
print('\nclassification results')
print('=========================================')
for label,pred in zip(labels,res_plain):
    print('Label:',('Good' if label==1 else 'Bad.'),end=', ')
    print('Prediction:',('Bad' if pred[0]<0.5 else 'Good.'))


classification results
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Bad
Label: Good, Prediction: Bad
Label: Bad., Prediction: Bad
Label: Good, Prediction: Bad
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Good.
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Bad
