# Tutorial 1: Training and Evaluation of Logistic Regression on Encrypted Data

Welcome to this first use case tutorial, where we are going to show how to use TenSEAL for training and evaluating a logistic regression (LR) model on encrypted data (using homomorphic encryption) for heart disease prediction! If you haven't played with TenSEAL before, I would suggest going through ['Tutorial 0 - Getting Started'](./Tutorial%200%20-%20Getting%20Started.ipynb) first.


**Disclaimer:** The goal of this tutorial isn't to show how efficient LR is for this task, we will just go with whatever accuracy we get, but the training and evaluation on encrypted data should be comparable to when we use plain data.


Authors:
- Ayoub Benaissa - Twitter: [@y0uben11](https://twitter.com/y0uben11)

## Setup

All modules are imported here, make sure everything is installed by running the cell below

In [1]:
import torch
import tenseal as ts
import pandas as pd
import random
from time import time

We now prepare the training and test data, the dataset was downloaded from Kaggle [here](https://www.kaggle.com/dileep070/heart-disease-prediction-using-logistic-regression), this dataset provides patients' information along with a 10-year risk of future coronary heart disease (CHD) as a label, and the goal is to build a model that can predict this 10-year CHD risk from patients' information, you can read more about the dataset in the link provided.

In [2]:
torch.random.manual_seed(73)
random.seed(73)


def random_data(m=1024, n=2):
    x_train = torch.randn(m, n)
    x_test = torch.randn(m, n)
    y_train = (x_train[:, 0] >= 0).float().unsqueeze(0).t()
    y_test = (x_test[:, 0] >= 0).float().unsqueeze(0).t()
    return x_train, y_train, x_test, y_test


def split_train_test(x, y, test_ratio=0.3):
    assert len(x) == len(y)
    idxs = list(range(len(x)))
    random.shuffle(idxs)
    delim = int(len(x) * test_ratio)
    test_idxs, train_idxs = idxs[:delim], idxs[delim:]
    return x[train_idxs], y[train_idxs], x[test_idxs], y[test_idxs]


def heart_disease_data():
    data = pd.read_csv("./data/framingham.csv")
    # drop rows with missing values
    data = data.dropna()
    y = torch.tensor(data["TenYearCHD"].values).float().unsqueeze(1)
    # drop label column
    data = data.drop("TenYearCHD", 'columns')
    # normalize
    data = (data - data.mean()) / data.std()
    x = torch.tensor(data.values).float()
    return split_train_test(x, y)


# x_train, y_train, x_test, y_test = random_data()
x_train, y_train, x_test, y_test = heart_disease_data()
print(f"x_train has shape: {x_train.shape}")
print(f"y_train has shape: {y_train.shape}")
print(f"x_test has shape: {x_test.shape}")
print(f"y_test has shape: {y_test.shape}")

x_train has shape: torch.Size([2560, 15])
y_train has shape: torch.Size([2560, 1])
x_test has shape: torch.Size([1096, 15])
y_test has shape: torch.Size([1096, 1])


## Training a Logistic Regression Model

We are going to use a LR model, which can be viewed as a single layer neural network with a single node. Let's use PyTorch for this.

In [3]:
class LR(torch.nn.Module):

    def __init__(self, n_features):
        super(LR, self).__init__()
        self.lr = torch.nn.Linear(n_features, 1)
        
    def forward(self, x):
        out = torch.sigmoid(self.lr(x))
        return out

In [4]:
n_features = x_train.shape[1]
model = LR(n_features)
# use gradient descent with a learning_rate=1
optim = torch.optim.SGD(model.parameters(), lr=1)
# use Binary Cross Entropy Loss
criterion = torch.nn.BCELoss()

In [5]:
def train(model, optim, criterion, x, y, epochs=100):
    for e in range(1, epochs + 1):
        optim.zero_grad()
        out = model(x)
        loss = criterion(out, y)
        loss.backward()
        optim.step()
        if e % 10 == 0:
            print(f"Loss at epoch {e}: {loss.data}")
    return model

model = train(model, optim, criterion, x_train, y_train)

Loss at epoch 10: 0.39459195733070374
Loss at epoch 20: 0.3803192377090454
Loss at epoch 30: 0.37820807099342346
Loss at epoch 40: 0.3777090907096863
Loss at epoch 50: 0.3775556683540344
Loss at epoch 60: 0.3774970471858978
Loss at epoch 70: 0.37747031450271606
Loss at epoch 80: 0.37745657563209534
Loss at epoch 90: 0.3774489760398865
Loss at epoch 100: 0.3774445652961731


In [6]:
def accuracy(model, x, y):
    out = model(x)
    correct = torch.abs(y - out) < 0.5
    return correct.float().mean()

plain_accuracy = accuracy(model, x_test, y_test)
print(f"Accuracy on plain test_set: {plain_accuracy}")

Accuracy on plain test_set: 0.8540145754814148


## Encrypted Evaluation

In this part, we will just focus on evaluating the logistic regression model with plain parameters (optionally encrypted parameters) on the encrypted test-set. We first create a PyTorch-like LR model that can evaluate encrypted data

In [7]:
class EncryptedLR:
    
    def __init__(self, torch_lr):
        self.weight = torch_lr.lr.weight.data.tolist()[0]
        self.bias = torch_lr.lr.bias.data.tolist()
        
    def forward(self, enc_x):
        enc_out = enc_x.dot(self.weight) + self.bias
        return enc_out
    
    def __call__(self, *args, **kwargs):
        return self.forward(*args, **kwargs)
        
    ################################################
    ## You can use the functions below to perform ##
    ## the evaluation with an encrypted model     ##
    ################################################
    
    def encrypt(self, context):
        self.weight = ts.ckks_vector(context, self.weight)
        self.bias = ts.ckks_vector(context, self.bias)
        
    def decrypt(self, context):
        self.weight = self.weight.decrypt()
        self.bias = self.bias.decrypt()
        

eelr = EncryptedLR(model)

We now create a TenSEALContext for specifying the scheme and the parameters we are going to use. Here we choose small and secure parameteres that allows us to make a single multiplication, that's enough for evaluating a LR model, however, we will see that we need larger parameters when doing training on encrypted data.

In [8]:
ctx_eval = ts.context(ts.SCHEME_TYPE.CKKS, 4096, 0, [40, 20, 40])
# scale of ciphertext to use
ctx_eval.global_scale = 2 ** 20
# this key is needed for doing dot-product operations
ctx_eval.generate_galois_keys()

We will encrypt the whole test-set before the evaluation

In [9]:
t_start = time()
enc_x_test = [ts.ckks_vector(ctx_eval, x.tolist()) for x in x_test]
t_end = time()
print(f"Encryption of the test_set took {int(t_end - t_start)} seconds")

Encryption of the test_set took 4 seconds


In [10]:
# (optional) encrypt the model's parameters
# eelr.encrypt(ctx_eval)

As you may have already noticed when we built the EncryptedLR class, we doesn't compute the sigmoid function on the encrypted output of the linear layer, simply because it's not needed, and computing sigmoid over encrypted data will increase the computation time and need larger encryption parameters. In a client server scenario, the client would encrypt the data and send it for evaluation to the server, the server can just send the evaluation of the linear layer, the client would then decrypt the result and can guess the label by tranforming the output into a probability using sigmoid, or just by comparing if it's greater or less than zero.

In [11]:
def encrypted_evaluation(model, enc_x_test, y_test):
    t_start = time()
    
    correct = 0
    for enc_x, y in zip(enc_x_test, y_test):
        # encrypted evaluation
        enc_out = model(enc_x)
        # plain comparaison
        out = enc_out.decrypt()
        out = torch.tensor(out)
        out = torch.sigmoid(out)
        if torch.abs(out - y) < 0.5:
            correct += 1
    
    t_end = time()
    print(f"Evaluated test_set of {len(x_test)} entries in {int(t_end - t_start)} seconds")
    print(f"Accuracy: {correct}/{len(x_test)} = {correct / len(x_test)}")
    return correct / len(x_test)
    

encrypted_accuracy = encrypted_evaluation(eelr, enc_x_test, y_test)
print(f"Difference between plain and encrypted accuracies: {plain_accuracy - encrypted_accuracy}")

Evaluated test_set of 1096 entries in 10 seconds
Accuracy: 926/1096 = 0.8448905109489051
Difference between plain and encrypted accuracies: 0.009124040603637695


We saw that evaluating on the encrypted test-set doesn't affect that much the accuracy, I've even seen examples where the encrypted evaluation performs better.

## Training an Encrypted Logistic Regression Model on Encrypted Data

In this part we will redifine a PyTorch-like model than can both forward encrypted data, as well backprobagate to update the weights and thus train the encrypted LR model on encrypted data.

Explains mathematically the operations and why we need a certain depth:
- which loss are we using?
- why did we choose a learning rate of 1?
- the update formula

In [12]:
class EncryptedLR:
    
    def __init__(self, torch_lr):
        self.weight = torch_lr.lr.weight.data.tolist()[0]
        self.bias = torch_lr.lr.bias.data.tolist()
        self._delta_w = 0
        self._delta_b = 0
        self._count = 0
        
    def forward(self, enc_x):
        enc_out = enc_x.dot(self.weight) + self.bias
        enc_out = EncryptedLR.sigmoid(enc_out)
        return enc_out
    
    def backward(self, enc_x, enc_out, enc_y):
        # TODO: need one-sized vector to be multiplied with n-sized vector
        out_minus_y = (enc_out - enc_y)
        self._delta_w += enc_x * out_minus_y
        self._delta_b += out_minus_y
        self._count += 1
        
    def update_parameters(self):
        # TODO: need either division or scalar multiplication
        self.weight += self._delta_w * (1 / self._count)
        self.bias += self._delta_b * (1 / self._count)
        self._count = 0
    
    @staticmethod
    def sigmoid(enc_x):
        # TODO: need scalar operations
        # https://eprint.iacr.org/2018/462.pdf
        # look into https://eprint.iacr.org/2018/254.pdf
        enc_x_3 = enc_x * enc_x * enc_x
        return 0.5 + 0.197 * enc_x + -0.004 * enc_x_3
    
    
    def encrypt(self, context):
        self.weight = ts.ckks_vector(context, self.weight)
        self.bias = ts.ckks_vector(context, self.bias)
        
    def decrypt(self, context):
        self.weight = self.weight.decrypt()
        self.bias = self.bias.decrypt()
        
    def __call__(self, *args, **kwargs):
        return self.forward(*args, **kwargs)
        

eelr = EncryptedLR(model)

# Congratulations!!! - Time to Join the Community!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement toward privacy preserving, decentralized ownership of AI and the AI supply chain (data), you can do so in the following ways!

### Star TenSEAL on GitHub

The easiest way to help our community is just by starring the Repos! This helps raise awareness of the cool tools we're building.

- [Star TenSEAL](https://github.com/OpenMined/TenSEAL)

### Join our Slack!

The best way to keep up to date on the latest advancements is to join our community! You can do so by filling out the form at [http://slack.openmined.org](http://slack.openmined.org). #lib_tenseal and #code_tenseal are the main channels for the TenSEAL project.

### Join our Team!

Coming soon

### Donate

If you don't have time to contribute to our codebase, but would still like to lend support, you can also become a Backer on our Open Collective. All donations go toward our web hosting and other community expenses such as hackathons and meetups!

[OpenMined's Open Collective Page](https://opencollective.com/openmined)
