# Loan aproval analysis using a fabricated German Credit Data dataset

This notebook shows an example of training and running a model that classifies people described by a set of attributes as good or bad credit risks.
It is based on a fabricated dataset that generated based on the <a href="https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)">German Credit Data dataset</a> from the <a href="https://archive.ics.uci.edu/">UCI</a> repository. 
The German Credit Data dataset has 20 attributes (7 numerical, 13 categorical) and the target field is an integer either Good (1) or Bad (2), where it is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).

The demonstration uses a logistic regression model for classification.

The required estimated memory is: model (140MB), input (7.34MB), output (0.26MB), and context (100MB).

We start by importing the required source packages.

The dataset attributes are:

|Attribute|Description|Values|
|---|---|---|
|checking|Status of existing checking account|<ul><li> A11 : ... < 0 DM</li></ul><ul><li> A12 : 0 <= ... <  200 DM</li></ul><ul><li>A13 :      ... >= 200 DM / salary assignments for at least 1 year</li></ul><ul><li>A14 : no checking account</li></ul>|
|duration|Duration in month| Numerical
|credit-hist|Credit history|<ul><li>A30 : no credits taken/all credits paid back duly</li></ul><ul><li>A31 : all credits at this bank paid back duly</li></ul><ul><li>A32 : existing credits paid back duly till now</li></ul><ul><li>A33 : delay in paying off in the past</li></ul><ul><li>A34 : critical account/ other credits existing (not at this bank)</li></ul>|
|purpose|Purpose|<ul><li>A40 : car (new)</li></ul><ul><li>A41 : car (used)</li></ul><ul><li>A42 : furniture/equipment</li></ul><ul><li>A43 : radio/television</li></ul><ul><li>A44 : domestic appliances</li></ul><ul><li>A45 : repairs</li></ul><ul><li>A46 : education</li></ul><ul><li>A47 : (vacation - does not exist?)</li></ul><ul><li>A48 : retraining</li></ul><ul><li>A49 : business</li></ul><ul><li>A410 : others</li></ul>|
|credit-amount|Credit amount|Numerical|
|saving-account|Savings account/bonds|<ul><li>A61 :          ... <  100 DM</li></ul><ul><li>A62 :   100 <= ... <  500 DM</li></ul><ul><li>A63 :   500 <= ... < 1000 DM</li></ul><ul><li>A64 :          .. >= 1000 DM</li></ul><ul><li>A65 :   unknown/ no savings account</li></ul>|
|employment-duration|Present employment since|<ul><li>A71 : unemployed</li></ul><ul><li>A72 :       ... < 1 year</li></ul><ul><li>A73 : 1  <= ... < 4 years</li></ul><ul><li>A74 : 4  <= ... < 7 years</li></ul><ul><li>A75 :       .. >= 7 years|
|installment-income-ratio|Installment rate in percentage of disposable income|Numerical|
|Attribute 9: (qualitative)
|sex|Personal status and sex|<ul><li>A91 : male : divorced/separated</li></ul><ul><li>A92 : female : divorced/separated/married</li></ul><ul><li>A93 : male : single</li></ul><ul><li>A94 : male : married/widowed</li></ul><ul><li>A95 : female : single</li></ul>|
|debtors-guarantors|Other debtors / guarantors|<ul><li>A101 : none</li></ul><ul><li>A102 : co-applicant</li></ul><ul><li>A103 : guarantor|
|residence-since|Present residence since|Numerical|
|property|Property|<ul><li>A121 : real estate</li></ul><ul><li>A122 : if not A121 : building society savings agreement/life insurance</li></ul><ul><li>A123 : if not A121/A122 : car or other, not in attribute 6</li></ul><ul><li>A124 : unknown / no property</li></ul>|
|age|Age in years| Numerical|
|installment-plans|Other installment plans|<ul><li>A141 : bank</li></ul><ul><li>A142 : stores</li></ul><ul><li>A143 : none</li></ul>|
|housing|Housing|<ul><li>A151 : rent</li></ul><ul><li>A152 : own</li></ul><ul><li>A153 : for free</li></ul>|
|num-existing-credits|Number of existing credits at this bank|Numerical|
|job|Job|<ul><li>A171 : unemployed/ unskilled  - non-resident</li></ul><ul><li>A172 : unskilled - resident</li></ul><ul><li>A173 : skilled employee / official</li></ul><ul><li>A174 : management/ self-employed/</li></ul><ul><li>highly qualified employee/ officer|
|num-liable|Number of people being liable to provide maintenance for|Numerical|
|telephone|Telephone|<ul><li>A191 : none</li></ul><ul><li>A192 : yes, registered under the customers name</li></ul>|
|foreighn-worker|foreign worker|<ul><li>A201 : yes</li></ul><ul><li>A202 : no</li></ul>|


In [23]:
import os
import warnings
warnings.filterwarnings("ignore")

##### For reproducibility
from numpy.random import seed
seed_value= 1
os.environ['PYTHONHASHSEED']=str(seed_value)
seed(seed_value)
import numpy as np
import pandas as pd


from sklearn import metrics
from sklearn.model_selection import train_test_split

import h5py


import random
import sklearn_json as skljson
from sklearn.linear_model import LogisticRegression
import sys
from  preprocessor import Preprocessor


### Data loading
Please refer to the dataset <a href="https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)">documentation</a> for the complete list of attributes and their description.

In [24]:
cols = {'checking':str, 
        'duration':np.int64, 
        'credit-hist':str, 
        'purpose':str, 
        'credit-amount':np.int64,
        'savings-account':str, 
        'employment-duration':str, 
        'installment-income-ratio':np.int64,
        'marital-gender-status':str,
        'debtors-guarantors':str, 
        'residence-since':str, 
        'property':str, 
        'age':np.int64,
        'installment-plans':str, 
        'housing':str, 
        'num-existing-credits':np.int64, 
        'job':str,
        'num-liable':np.int64, 
        'telephone':str, 
        'foreign-worker':str, 
        'is_good':np.int64}

df = pd.read_csv('./datasets/loan_approval.generated', sep=" ", index_col=False, names=cols.keys(), header=None, dtype=cols)
df.head()

Unnamed: 0,checking,duration,credit-hist,purpose,credit-amount,savings-account,employment-duration,installment-income-ratio,marital-gender-status,debtors-guarantors,...,property,age,installment-plans,housing,num-existing-credits,job,num-liable,telephone,foreign-worker,is_good
0,A11,17,A32,A43,595,A61,A73,3,A94,A103,...,A121,23,A143,A152,1,A173,1,A191,A201,1
1,A12,16,A34,A43,284,A62,A74,4,A94,A101,...,A121,20,A143,A152,1,A174,1,A191,A201,2
2,A11,5,A32,A43,267,A61,A75,4,A93,A101,...,A123,26,A143,A152,1,A173,1,A192,A201,1
3,A14,5,A34,A41,1194,A61,A75,1,A93,A101,...,A123,62,A143,A152,4,A173,1,A191,A201,1
4,A14,5,A32,A46,924,A61,A73,1,A92,A101,...,A123,63,A142,A152,1,A173,1,A191,A201,2


### Data preprocessing

We first convert the categorial features (in the table below) to indicator vectors. 

Subsequently, we split every row into its target value (y) and predicates (X).

In [25]:
X = df.drop(['is_good'], axis=1)
y = df['is_good'].replace([1, 2], [1, 0])
X.head()

Unnamed: 0,checking,duration,credit-hist,purpose,credit-amount,savings-account,employment-duration,installment-income-ratio,marital-gender-status,debtors-guarantors,residence-since,property,age,installment-plans,housing,num-existing-credits,job,num-liable,telephone,foreign-worker
0,A11,17,A32,A43,595,A61,A73,3,A94,A103,4,A121,23,A143,A152,1,A173,1,A191,A201
1,A12,16,A34,A43,284,A62,A74,4,A94,A101,4,A121,20,A143,A152,1,A174,1,A191,A201
2,A11,5,A32,A43,267,A61,A75,4,A93,A101,4,A123,26,A143,A152,1,A173,1,A192,A201
3,A14,5,A34,A41,1194,A61,A75,1,A93,A101,4,A123,62,A143,A152,4,A173,1,A191,A201
4,A14,5,A32,A46,924,A61,A73,1,A92,A101,1,A123,63,A142,A152,1,A173,1,A191,A201


### Data preprocessing

We split the dataset into the training (x_train, y_train) and test (x_test, y_test) sets and scale their features. 

We convert the categorial features (in the table below) to indicator vectors. 

Subsequently, we split the test set into test and validation sets.

In [26]:
x_train, x_test, y_train, y_test = train_test_split(X, y ,test_size=0.2, random_state=5, stratify=y)

prep = Preprocessor()
x_train = prep.fit_transform(x_train)
x_test = prep.transform(x_test)

x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=4096, random_state=5, stratify=y_test)

For later use in HE, we save the different preprocessed datasets.

In [27]:
def save_data_set(x, y, data_type, path, s=''):
    if not os.path.exists(path):
        os.makedirs(path)
    fname=os.path.join(path, f'x_{data_type}{s}.h5')
    print("Saving x_{} of shape {} in {}".format(data_type, x.shape, fname))
    xf = h5py.File(fname, 'w')
    xf.create_dataset('x_{}'.format(data_type), data=x)
    xf.close()

    print("Saving y_{} of shape {} in {}".format(data_type, y.shape, fname))
    yf = h5py.File(os.path.join(path, f'y_{data_type}{s}.h5'), 'w')
    yf.create_dataset(f'y_{data_type}', data=y)
    yf.close()

datasets_dir = "datasets/"
model_dir = "model/"

save_data_set(x_test, y_test, data_type='test', path=datasets_dir)
save_data_set(x_train, y_train, data_type='train', path=datasets_dir)
save_data_set(x_val, y_val, data_type='val', path=datasets_dir)


prep.save(os.path.join(model_dir, "prep.pickle"))

Saving x_test of shape (15904, 62) in datasets/x_test.h5
Saving y_test of shape (15904,) in datasets/x_test.h5
Saving x_train of shape (80000, 62) in datasets/x_train.h5
Saving y_train of shape (80000,) in datasets/x_train.h5
Saving x_val of shape (4096, 62) in datasets/x_val.h5
Saving y_val of shape (4096,) in datasets/x_val.h5


### Logistic Regression Train

In [28]:
lr = LogisticRegression(C=0.1)
lr.fit(x_train, y_train)

print('LR model ready')

LR model ready


For later use in HE, we save the trained model.

In [29]:
def save_model(model, path):
    if not os.path.exists(path):
        os.mkdir(path)
    fname = os.path.join(path, "lr_loan_approval_model.json")
    skljson.to_json(model, fname)
    print("Saved model to ",fname)

save_model(lr, model_dir)

Saved model to  model/lr_loan_approval_model.json


### Using the model for classifying cleartest data

In [30]:
y_pred = lr.predict(x_test)

Confusion Matrix - TEST

In [31]:
f,t,thresholds = metrics.roc_curve(y_test, y_pred)
cm = metrics.confusion_matrix(y_test, y_pred)
print(f"AUC Score: {metrics.auc(f,t):.3f}")
print("Classification report:")
print(metrics.classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(cm)

AUC Score: 0.765
Classification report:
              precision    recall  f1-score   support

           0       0.73      0.63      0.68      4782
           1       0.85      0.90      0.87     11122

    accuracy                           0.82     15904
   macro avg       0.79      0.76      0.78     15904
weighted avg       0.81      0.82      0.81     15904

Confusion Matrix:
[[ 3013  1769]
 [ 1113 10009]]


### Using the model for classifying encrypted data

To run the model over encrypted samples with homomorphic encryption (HE), we first load the pyhelayers package and refer it to the directory "output/", where we saved the model and the relevant datasets.

In [32]:
import pyhelayers

Load test data and labels from the h5 file

In [33]:
with h5py.File(datasets_dir + "x_test.h5") as f:
    x_test = np.array(f["x_test"])
with h5py.File(datasets_dir + "y_test.h5") as f:
    y_test = np.array(f["y_test"])

Load a plain model

In [34]:
lrp = pyhelayers.LogisticRegressionPlain()
lrp.init_from_json_file(model_dir + "lr_loan_approval_model.json")
print("loaded plain model")

loaded plain model


Use a 3rd degree polynomial to approximate the sigmoid activation of the LogisticRegression model

In [35]:
lrp.set_activation(pyhelayers.LRActivation.SIGMOID_POLY_3)

Apply automatic optimziations

In [36]:
context = pyhelayers.DefaultContext()
optimizer = pyhelayers.HeProfileOptimizer(lrp, context)
optimizer.get_requirements().set_batch_size(16)
profile = optimizer.get_optimized_profile(False)
batch_size = profile.get_batch_size()

To reduce the memory requirements of the context, we reduce the number of rotation keys.

In [37]:
pf1=pyhelayers.PublicFunctions()
pf1.rotate=pyhelayers.RotationSetType.CUSTOM_ROTATIONS
pf1.set_rotation_steps([1,4,16,128])
pf1.conjugate=True
requirements = profile.requirement
requirements.public_functions=pf1

Intialize the HE context with the optimized configuration.

In [38]:
context.init(profile.requirement)
print('HE Context ready. Batch size=',batch_size)

HE Context ready. Batch size= 16


Print the HE context (w/ keys) size.

In [39]:
evalBuf=context.save_to_buffer();
print('Size',len(evalBuf)/1024/1024,'MB')

Size 15.690414428710938 MB


### Encrypt the model

In [40]:
lr = pyhelayers.LogisticRegression(context)
lr.encode_encrypt(lrp, profile)

Object (detailed printing not implemented yet)

We use the encrypted model over batches of 16 records at a time. 

In [41]:
plain_samples = x_test.take(indices=range(0, batch_size), axis=0)
labels = y_test.take(indices=range(0, batch_size), axis=0)

Encrypt input samples

In [42]:
samples = lr.encode_encrypt_input(plain_samples)

Now we perform inference of the 16 samples under encryption 

In [43]:
predictions=lr.predict(samples)

### Plaintext results

Decrypting the final results

In [44]:
plain_predictions = lr.decrypt_decode_output(predictions)

In [45]:
print('\nclassification results')
print('=========================================')
for label,pred in zip(labels,plain_predictions):
    print('Label:',('Good' if label==1 else 'Bad.'),end=', ')
    print('Prediction:',('Bad' if pred[0]<0.5 else 'Good.'))


classification results
Label: Good, Prediction: Good.
Label: Good, Prediction: Bad
Label: Good, Prediction: Good.
Label: Bad., Prediction: Bad
Label: Good, Prediction: Good.
Label: Good, Prediction: Good.
Label: Bad., Prediction: Good.
Label: Good, Prediction: Good.
Label: Good, Prediction: Good.
Label: Good, Prediction: Good.
Label: Good, Prediction: Good.
Label: Good, Prediction: Bad
Label: Good, Prediction: Good.
Label: Good, Prediction: Good.
Label: Bad., Prediction: Bad
Label: Good, Prediction: Bad
