# Logistic regression training over encrypted credit card fraud samples

expected RAM usage: 6.5 GB  
expected runtime: 20 seconds

## Introduction

This example demonstrates how an encrypted logistic regression (LR) model can be trained in an untrusted environment with encrypted data. Predictions are also carried out in the untrusted public environment for validation of the trained model. Prediction results are encrypted and sent back to the data owner to be decrypted in a trusted environment.

The training is done over creditcardfraud dataset  https://www.kaggle.com/mlg-ulb/creditcardfraud [1]-[9].

This demo uses SEAL backend since release 1.5.5

<br>

## Step 1. Load and prepare the dataset in the trusted environment

Load and prepare the credit card fraud dataset for encryption in a trusted client environment.

In [None]:
import utils 
utils.verify_memory(min_memory_size=10)

load_from_pre_prepared = True
import json
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.patches as mpatches

if load_from_pre_prepared:
    INPUT_DIR = Path(utils.get_data_sets_dir()) / 'logistic_regression_training'
else:
    INPUT_DIR = Path('data/logistic_regression_training/')

file = INPUT_DIR / 'processed_creditcard_balanced_sample.csv'
data = pd.read_csv(file, header=0)
labels = (data.iloc[:, -1:]).to_numpy(dtype=np.float128)

colors = ['r','b']
ax = pd.Series(labels.flatten()).value_counts().plot.bar(xlabel="Fraud Cases", ylabel="Frequency", legend=True, color=colors,title="Before training")
ax1 = mpatches.Patch(color='b', label='Not Fraud')
ax2 = mpatches.Patch(color='r', label='Fraud')
ax.legend(handles=[ax1,ax2], loc="lower center")

plain_samples = (data.iloc[:, :-1]).to_numpy(dtype=np.float128) 
batch_size = plain_samples.shape[0]
number_of_features = plain_samples.shape[1]

nRow, nCol = plain_samples.shape
print(f'There are {nRow} rows and {nCol} columns in our dataset.')
# data.info(verbose=True, memory_usage='deep')

<br>

## Step 2. Initialize model and encrypt data in a trusted envireonment
* A LogisticRegression object is initialized
* A requirement configuration is supplied to internally determine the most suitable HE parameters
* The encryption of the data is carried out

#### 2.1. Initialize and encrypt the logistic regression model

We can provide some additional `HyperParameters` and `HeRunRequirements` as inputs.

The used hyper parameters are:
* the number of iterations to be used in training, which is referred to by `hyper_params.number_of_iterations`
* the learning rate for training, which is `hyper_params.learning_rate`
* the activation used for training is a degree 3 polynomial approximation of the sigmoid function, and is referenced by `pyhelayers.LRActivation.SIGMOID_POLY_3`. 

In the HE run requirements, we set the batch size and rely on the default values for other parameters.

In [None]:
import pyhelayers

hyper_params = pyhelayers.PlainModelHyperParams()
hyper_params.fit_hyper_params.number_of_epochs = 3
hyper_params.fit_hyper_params.learning_rate = 0.1
hyper_params.number_of_features = number_of_features
hyper_params.trainable = True
hyper_params.logistic_regression_activation = pyhelayers.LRActivation.SIGMOID_POLY_3

he_run_req = pyhelayers.HeRunRequirements()
# Request a SEAL context
he_run_req.set_he_context_options([pyhelayers.HeContext.create(["SEAL_CKKS"])])
he_run_req.optimize_for_batch_size(batch_size)

client_lr = pyhelayers.LogisticRegression()
client_lr.encode_encrypt(files=[], he_run_req=he_run_req, hyper_params=hyper_params)
client_context = client_lr.get_created_he_context()

print('logistic regression training initialised')

#### 2.2. Encrypt the data in a trusted environment

The plaintext samples and labels are encrypted:

In [None]:
model_io_encoder = pyhelayers.ModelIoEncoder(client_lr)

encrypted_inputs = pyhelayers.EncryptedData(client_context)
model_io_encoder.encode_encrypt(encrypted_inputs, [plain_samples, labels])
print('training data has been encrypted.')

#### 2.3. Save and send
We save the encrypted model, the context, and the samples in preparation for sending them to the server

In [None]:
lr_buffer = client_lr.save_to_buffer()
inputs_buffer = encrypted_inputs.save_to_buffer()

# Save the context. Note that this saves all the HE library information, including the 
# public key, allowing the server to perform HE computations.
# The secret key is not saved here, so the server won't be able to decrypt.
# The secret key is never stored unless explicitly requested by the user using the designated 
# method.
context_buffer = client_context.save_to_buffer()

print('Context, model, and samples saved')

<br>

## Step 3. Perform training on a remote server using encrypted data and labels

#### 3.1. Load the labels, samples and context in the server

In the server side, we use the previously saved data to prepare the server:

In [None]:
server_context = pyhelayers.load_he_context(context_buffer)
server_lr = pyhelayers.load_he_model(server_context, lr_buffer)
server_inputs = pyhelayers.load_encrypted_data(server_context, inputs_buffer)

#### 3.2. Perform the model training in the cloud/server using encrypted data and encrypted labels

We can now run the training of the encrypted data to obtain encrypted trained weights and bias. This computation does not use the secret key and acts on completely encrypted values.

**NOTE: the data, the LR model and the results always remain in an encrypted state, even during computation.**

In [None]:
with utils.elapsed_timer('training', batch_size):
    server_lr.fit(server_inputs)

#### 3.3. Send the trained model back

We can now send back the trained model. Note that the entire server side computation does not have the secret key and no values were revealed.

In [None]:
trained_model_buffer = server_lr.save_to_buffer()
print('Trained model saved.')

<br>

## Step 4. Decrypt the trained model in the trusted environment

The encrypted model computed by the server (stored at `predictions_buffer`) can now be decrypted and decoded in the client:

In [None]:
# Load the encrypted predictions.
client_trained_lr = pyhelayers.load_he_model(client_context, trained_model_buffer)

trained_plain = client_trained_lr.decrypt_decode()

print('Predictions loaded and decrypted.')
print(trained_plain)

<br>

## Step 5. Assess the results

Let's assess the results in two ways. First let's run prediction of the trained model and calculate the precision, recall and F1 score:

In [None]:
with utils.elapsed_timer('validation', batch_size) as timer:
    plain_predictions = trained_plain.predict([plain_samples])[0]
    
plain_predictions = plain_predictions.reshape(plain_predictions.shape[0], 1)
accuracy = utils.assess_results(labels, plain_predictions)
predicted_labels = [1 if i >= 0.5 else 0 for i in plain_predictions]
colors = ["b", "r"]
ax = pd.Series(predicted_labels).value_counts().plot.bar(xlabel="Fraud Cases", ylabel="Frequency", legend=True, color=colors,title="Validation After Training")
ax1 = mpatches.Patch(color='b', label='Not Fraud')
ax2 = mpatches.Patch(color='r', label='Fraud')
ax.legend(handles=[ax1,ax2])

Next, let's train also in plaintext, and compare the model weights and biases. Our plaintext training code is available under the misc folder. 
Note that it an HE-friendly version of an LR training algorithm: the 'sigmoid' activation is approximated by a polynomial. It produces somewhat lesser accuracy than the standard algorithm.

In [None]:
from misc.logistic_regression_plain import LogisticRegression
new_plain = LogisticRegression(n_iters=hyper_params.fit_hyper_params.number_of_epochs)

with utils.elapsed_timer('training_baseline', batch_size) as timer:
    new_plain.fit(plain_samples, labels)

enc_trained_weights=trained_plain.get_weights().flatten()
enc_trained_bias=trained_plain.get_bias().flatten()
plain_trained_weights=new_plain.weights.flatten()
plain_trained_bias=new_plain.bias.flatten()

mse1=np.linalg.norm(enc_trained_weights-plain_trained_weights)
mse2=np.linalg.norm(enc_trained_bias-plain_trained_bias)

print('Mean square error in weights:',mse1)
print('Mean square error in bias:',mse2)

if (mse1+mse2>1e-3):
    raise Exception("MSE too large")

In [None]:
print("RAM usage:", utils.get_used_ram(), "MB")

<br>

References:

 <sub><sup> 1. Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015 </sup></sub>
    
<sub><sup> 2. Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon </sup></sub>

<sub><sup> 3. Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE </sup></sub>

<sub><sup> 4. Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)</sup></sub>

<sub><sup> 5. Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier </sup></sub>

<sub><sup> 6. Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing </sup></sub>

<sub><sup> 7. Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019 </sup></sub>

<sub><sup> 8. Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019 </sup></sub>

<sub><sup> 9. Yann-Aël Le Borgne, Gianluca Bontempi Machine Learning for Credit Card Fraud Detection - Practical Handbook </sup></sub> 
