# Model.ipynb 
A proof of use for Homomorphic Encryption. 
Code was created over several weeks of trial and error, different libraries and implementations. Began with sklearn, and moved to torch. For FHE I ended up on Tenseal but I did have previous iterations with Concrete-ml by Zama AI and Phailiar cryptosystems.
Inspired by TenSEAL model

# Problems
Specifically with the TenSEAL Implementation, which I decided to limit this capstone to along with the CKKS scheme. First, the encrypted data was cast as a CKKS.Vector. The current release for most ML libraries do not have models that can use data in the ''scheme'.vector' form. As a result, some sections had to be manually written instead of predefined functions in the library. 

In [1]:
import torch
import tenseal as ts
import pandas as pd
import psutil
import os
from time import time
import sklearn
from sklearn.model_selection import train_test_split

# System resources
* As this is done in a Jupyter notebook, I'm going to monitor the performance/ usage of the model
* This tracking is done using psutil, dtype for Ram, and time.
* Laptop is a HP victus - Ryzen 5 8645HS, 16 GB 5600 DDR5, 4050

In [2]:
# prints the memory usage of the current process
def print_memory_usage():
    process = psutil.Process(os.getpid())
    print(f"Memory usage: {process.memory_info().rss / 1024 ** 2:.2f} MB")
print("#### Memory usage ####")
print_memory_usage()
print("######################")

#### Memory usage ####
Memory usage: 292.91 MB
######################


# Dataset and Preprocessing
* using a payment_fraud.csv, given out during a lab for NCS 490, Intro to AI Security, Around 39,000 transactions
* payment method is dropped as it is a list of vendors in strings, and balancing
* Creating train and test data using sklearn Train_Test_split function.

In [3]:

# Load the data
# The data is a credit card fraud dataset, where the goal is to predict whether a transaction is fraudulent or not
# The dataset is highly imbalanced, with only 0.17% of the transactions being fraudulent
def Credit_data():
    data = pd.read_csv("payment_fraud.csv")
    # drop some features
    data = data.drop(columns=["paymentMethod"])
    # balance data
    grouped = data.groupby('label')
    data = grouped.apply(lambda x: x.sample(grouped.size().min(), random_state=13).reset_index(drop=True))
    # extract labels
    y = torch.tensor(data["label"].values).float().unsqueeze(1)
    data = data.drop(columns="label")
    # standardize data
    data = (data - data.mean()) / data.std()
    x = torch.tensor(data.values).float()
    return split_train_test(x, y)

# split the data into training and testing sets
def split_train_test(x, y):
    # shuffle the data
    sklearn.utils.shuffle(x, y)
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
    return x_train, x_test, y_train, y_test

x_train, x_test, y_train, y_test = Credit_data()

print("############# Data summary #############")
print(f"x_train has shape: {x_train.shape}")
print(f"y_train has shape: {y_train.shape}")
print(f"x_test has shape: {x_test.shape}")
print(f"y_test has shape: {y_test.shape}")
print_memory_usage()
print("#######################################")


############# Data summary #############
x_train has shape: torch.Size([896, 4])
y_train has shape: torch.Size([896, 1])
x_test has shape: torch.Size([224, 4])
y_test has shape: torch.Size([224, 1])
Memory usage: 302.71 MB
#######################################


  data = grouped.apply(lambda x: x.sample(grouped.size().min(), random_state=13).reset_index(drop=True))


# Non-Encrypted Model
* Standard torch logistic regression model.
* using print memory to keep track of the model usage of resources  

In [4]:
# Deining the Logistic Regression torch NN model.
class NE_LR(torch.nn.Module):
    # n_features is the number of features in the input data    
    def __init__(self, n_features):
        super(NE_LR, self).__init__()
        # the linear layer is the logistic regression model
        # it takes n_features inputs and outputs 1 value
        self.lr = torch.nn.Linear(n_features, 1)
    
    # pass data through the model and apply sigmoid activation
    def forward(self, x):
        output = torch.sigmoid(self.lr(x))
        return output

# Define the model, optimizer, and loss function
# Unencrypted training
n_features = x_train.shape[1]
model = NE_LR(n_features)
# use gradient descent with a learning_rate=1
optim = torch.optim.SGD(model.parameters(), lr=1)
# use Binary Cross Entropy Loss
# BCELoss is the loss function used for binary classification
criterion = torch.nn.BCELoss()

# train the model for 5 epochs
EPOCHS = 5
# creating timing list to store the time taken for each epoch
times = []
def train(model, optim, criterion, x, y, epochs=EPOCHS):
    for e in range(1, epochs + 1):
        start = time()
        # set the gradients to zero
        optim.zero_grad()
        # pass the data through the model
        output = model(x)
        # calculate the loss
        loss = criterion(output, y)
        loss.backward()
        # update the weights
        optim.step()
        end = time()
        # loss is printed at each epoch
        print(f"Loss at epoch {e}: {loss.data:.4f}")
        times.append(end - start)
        #prints memory usage at epoch 3 - while processing is still occurring.)
    return model

# Evaluate the model
model = train(model, optim, criterion, x_train, y_train)
#Calculating the accuracy of the model
def accuracy(model, x, y):
    out = model(x)
    correct = torch.abs(y - out) < 0.5
    return correct.float().mean()

print("\n############# Non-Encrypted Training #############")
print(f"Average time per epoch: {int(sum(times) / len(times))} seconds")
NE_accuracy = accuracy(model, x_test, y_test)
print(f"Non-Encrypted Accuracy: {NE_accuracy:.4f}")
print_memory_usage()
print("##################################################")



Loss at epoch 1: 1.0242
Loss at epoch 2: 0.7431
Loss at epoch 3: 0.6005
Loss at epoch 4: 0.5368
Loss at epoch 5: 0.5026

############# Non-Encrypted Training #############
Average time per epoch: 0 seconds
Non-Encrypted Accuracy: 0.7411
Memory usage: 345.75 MB
##################################################


# Defining Encrypted Network
* requires defining normally standard functions such as sigmoid, the forward pass, backward, etc. 


In [5]:

class EncryptedLR:
    # Encrypted Logistic Regression model    
    def __init__(self, torch_lr):
        # extract the weights and bias from the torch model
        self.weight = torch_lr.lr.weight.data.tolist()[0]
        # extract the bias from the torch model
        self.bias = torch_lr.lr.bias.data.tolist()
        #initialize the gradient accumulators and iterations count
        self._delta_w = 0
        self._delta_b = 0
        self._count = 0
    
    #Forward pass
    def forward(self, enc_x):
        enc_out = enc_x.dot(self.weight) + self.bias
        #Calculates linear combination of input and weight, adds bias
        enc_out = EncryptedLR.sigmoid(enc_out)
        #Applies sigmoid function
        return enc_out
    
    #Backward pass
    #Calculates the gradient of the loss w.r.t the weights and bias
    def backward(self, enc_x, enc_out, enc_y):
        out_minus_y = (enc_out - enc_y)
        #Calculates the difference between the predicted value and the true value
        self._delta_w += enc_x * out_minus_y
        #Calculates the gradient of the loss w.r.t the weights
        self._delta_b += out_minus_y
        #Calculates the gradient of the loss w.r.t the bias
        self._count += 1
        #Increment the iteration count
        
    #Update the weights and bias
    def update_parameters(self):
        # update weights
        # Small regularization term to keep the output of the linear layer in the range of the sigmoid
        self.weight -= self._delta_w * (1 / self._count) + self.weight * 0.05
        self.bias -= self._delta_b * (1 / self._count)
        # reset gradient accumulators and iterations count
        self._delta_w = 0
        self._delta_b = 0
        self._count = 0
        
    #Sigmoid function
    @staticmethod
    def sigmoid(enc_x):
        # this is a degree 3 polynomial approximation of the sigmoid function
        return enc_x.polyval([0.5, 0.197, 0, -0.004])
    
    def plain_accuracy(self, x_test, y_test):
    #Calculates the accuracy of the model on non-encrypted data
        # convert the weights and bias to torch tensors
        w = torch.tensor(self.weight)
        b = torch.tensor(self.bias)
        # pass the data through the linear layer
        out = torch.sigmoid(x_test.matmul(w) + b).reshape(-1, 1)
        # calculate the accuracy
        correct = torch.abs(y_test - out) < 0.5
        return correct.float().mean()    
    
    def encrypt(self, context):
    #Encrypts the weights and bias
        self.weight = ts.ckks_vector(context, self.weight)
        self.bias = ts.ckks_vector(context, self.bias)

    def decrypt(self):
    #Decrypts the weights and bias
        self.weight = self.weight.decrypt()
        self.bias = self.bias.decrypt()
        
    def __call__(self, *args, **kwargs):
        return self.forward(*args, **kwargs)
    

# Performing the Encryption
* using CKKS encryption, because of using the sigmoid function for the output function. 
* Supports a wider range of Mathematical operations, Noise management, etc.

In [6]:
# parameters
# the degree of the polynomial modulus
poly_mod_degree = 8192
# the bit-length of the modulus chain
coeff_mod_bit_sizes = [40, 21, 21, 21, 21, 21, 21, 40]
# create TenSEALContext
enc_training = ts.context(ts.SCHEME_TYPE.CKKS, poly_mod_degree, -1, coeff_mod_bit_sizes)
# generate keys
enc_training.global_scale = 2 ** 21
enc_training.generate_galois_keys()

t_start = time()
enc_x_train = [ts.ckks_vector(enc_training, x.tolist()) for x in x_train]
enc_y_train = [ts.ckks_vector(enc_training, y.tolist()) for y in y_train]
t_end = time()
print("############# Encryption #############")
print(f"Encryption of the training_set took {int(t_end - t_start)} seconds")
print_memory_usage()
print("######################################")


############# Encryption #############
Encryption of the training_set took 12 seconds
Memory usage: 2286.07 MB
######################################


# Size of Datasets
* Checking size differential of the encrypted and non-encrypted data set.

In [7]:
print("############# Data summary #############")
def print_data_sizes(x_train, enc_x_train, y_train, enc_y_train):
    print(f"Size of x_train: {x_train.numpy().nbytes} bytes")
    print(f"Size of enc_x_train: {sum([len(x.serialize()) for x in enc_x_train])} bytes")
#    print(f"Size of enc_x_train: {sum([len(x) for x in enc_x_train])} bytes")
    print(f"Size of y_train: {y_train.numpy().nbytes} bytes")
    print(f"Size of enc_y_train: {sum([len(y.serialize()) for y in enc_y_train])} bytes")
#    print(f"Size of enc_y_train: {sum([len(y) for y in enc_y_train])} bytes")
print_data_sizes(x_train, enc_x_train, y_train, enc_y_train)
print_memory_usage()
print("#######################################")

############# Data summary #############
Size of x_train: 14336 bytes
Size of enc_x_train: 392529200 bytes
Size of y_train: 3584 bytes
Size of enc_y_train: 392569378 bytes
Memory usage: 2288.16 MB
#######################################


# Running Encrypted Model


In [8]:
# create the encrypted model
ELR = EncryptedLR(NE_LR(n_features))
accuracy = ELR.plain_accuracy(x_test, y_test)
print(f"Accuracy at epoch #0 is {accuracy}")
# train the encrypted model
times = []
for epoch in range(EPOCHS):
    ELR.encrypt(enc_training)
    
    t_start = time()
    for enc_x, enc_y in zip(enc_x_train, enc_y_train):
        # forward pass
        enc_out = ELR.forward(enc_x)
        # backward pass
        ELR.backward(enc_x, enc_out, enc_y)
    ELR.update_parameters()
    t_end = time()
    times.append(t_end - t_start)
    # decrypt the model and calculate the accuracy
    ELR.decrypt()
    EN_accuracy = ELR.plain_accuracy(x_test, y_test)
    print(f"Accuracy at epoch #{epoch + 1} is {EN_accuracy:.4f}")
    #prints memory usage at epoch 3 - while processing is still occuring.
    if(epoch == 3):
        print_memory_usage()
    #print(f"Loss at epoch #{epoch + 1} is {(1 - EN_accuracy):.4f}")

print("############# Encrypted Training #############")
print(f"\nAverage time per epoch: {int(sum(times) / len(times))} seconds")
print(f"Accuracy {EN_accuracy:.4f}")
diff_accuracy = NE_accuracy - EN_accuracy
print(f"Difference between plain and encrypted accuracies: {diff_accuracy:.4f}")
print_memory_usage()
print("################################################")


Accuracy at epoch #0 is 0.5669642686843872
Accuracy at epoch #1 is 0.7723
Accuracy at epoch #2 is 0.7634
Accuracy at epoch #3 is 0.7455
Accuracy at epoch #4 is 0.7321
Memory usage: 2304.91 MB
Accuracy at epoch #5 is 0.7366
############# Encrypted Training #############

Average time per epoch: 48 seconds
Accuracy 0.7366
Difference between plain and encrypted accuracies: 0.0045
Memory usage: 2304.91 MB
################################################


# Results
* Non-encrypted Accuracy is 0.7411
* Encrypted Accuracy is 0.7366
* In this situation the encrypted model did better than the non-encrypted version.

# Size
* Extreme size difference, 14,336 bytes to 392586908 bytes for X_train, although it was done through the use of serializing as nbytes does not work with CKKS.vecors

# Time
* The Non-encrypted model took less than a second per epoch with the cell needing 0.1 seconds to compute.
* Data Encryption took 13.8 seconds to complete
* The Encrypted Model took an average of just under 1 minute per epoch (58 seconds), and 4 minutes and 53 seconds to complete.

# Resources
* Non-encrypted model used 345.75 MB of memory
* Encryption used 2.2 Gigabytes of Memory to complete
* The Encrypted model used 2.3 Gigabytes of memory as well

# Remarks
* 27,000 times more bytes needed from 14 KilaBytes to 329 MegaBytes. 
* 2446 times longer from 0.1 seconds to 4min and 4.6 seconds
* While I Plan to continue studying in this field it is good to note, that as of now there is a better version being worked on and developed. SEAL has been unofficially killed by Microsoft as there is no longer a team working on the library, which also means TenSEAL is defunct as well. This model was inspired by one using TenSEAL 