# Homomorphic Encryption (HE) Laboratory Report

First things first we need to install 2 libraries and downlaod repository with dataset.
- TenSEAL is a library for doing homomorphic encryption operations on tensors, built on top of Microsoft SEAL
- Python Speach Features - is needed to extract features from audio

In [None]:
!pip install tenseal
!pip install python_speech_features
!git clone https://github.com/NescobarAlopLop/homomorhpic_lab.git

Collecting tenseal
[?25l  Downloading https://files.pythonhosted.org/packages/35/20/a4106c3eff920eccbe040276ed869193fadd8fbbc52307dd6922a453f085/tenseal-0.3.0-cp36-cp36m-manylinux2014_x86_64.whl (4.4MB)
[K     |████████████████████████████████| 4.4MB 5.2MB/s 
[?25hInstalling collected packages: tenseal
Successfully installed tenseal-0.3.0
Collecting python_speech_features
  Downloading https://files.pythonhosted.org/packages/ff/d1/94c59e20a2631985fbd2124c45177abaa9e0a4eee8ba8a305aa26fc02a8e/python_speech_features-0.6.tar.gz
Building wheels for collected packages: python-speech-features
  Building wheel for python-speech-features (setup.py) ... [?25l[?25hdone
  Created wheel for python-speech-features: filename=python_speech_features-0.6-cp36-none-any.whl size=5890 sha256=95619cb0ddd080c67ea07f04f38febcb278fb1d088497729ee0100e36478ef5b
  Stored in directory: /root/.cache/pip/wheels/3c/42/7c/f60e9d1b40015cd69b213ad90f7c18a9264cd745b9888134be
Successfully built python-speech-features

Nonthing special, just import packages

In [None]:
import codecs
import copy
import json
import os
import sys

import numpy as np
import pandas as pd
import python_speech_features as psf
import scipy.io.wavfile as sw
import tenseal as ts
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Read dataset and extract features.

In [None]:
audio_files_directory = '/content/homomorhpic_lab/model_training/training_data'


final_dataset = pd.DataFrame()

number_of_filters = 26
for file_name in os.listdir(audio_files_directory):
    if not os.path.isfile(os.path.join(audio_files_directory, file_name)):
        continue

    rate, signal = sw.read(os.path.join(audio_files_directory, file_name))
    features = psf.base.mfcc(signal=signal, samplerate=rate, preemph=1.1, nfilt=number_of_filters, numcep=17)
    features = psf.base.fbank(
        signal=features,
        samplerate=rate,
    )[1]
    features = psf.base.logfbank(features)
    features_df = pd.DataFrame(features)

    if 'dog' in file_name:
        features_df['label'] = '-1'
    elif 'cat' in file_name:
        features_df['label'] = '1'
    else:
        raise ValueError(f'Unsupported animal class {file_name}')

    final_dataset = final_dataset.append(features_df, ignore_index=True)


print(f'Dataset shape: {final_dataset.shape}')

# Finalize dataset with the attributes and target
X = final_dataset.iloc[:, 0:-1]
y = final_dataset.iloc[:, -1]

Dataset shape: (270, 27)


To improve training and ease on HE computation I scale the data. And save scaling arguments for feature extraction during testing.

In [None]:
# Splitting into test and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.24, random_state=1)
train_mean = np.array(X_train.mean())
tran_standard_deviation = np.array(X_train.std())

# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_train = pd.DataFrame(X_train)

X_test = sc.fit_transform(X_test)
X_test = pd.DataFrame(X_test)

Create and train SVM model

In [None]:
model = svm.SVC(
  kernel='poly',
  C=20,
  gamma=10,
)
model.fit(X_train, y_train)

SVC(C=20, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=10, kernel='poly',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

As every datascientis and good student knows we only test quality of a trained model on part of the data we have never seen.
So here we go:

In [None]:
accuracy_score = model.score(X_test, y_test)
print(f'accuracy_score: {accuracy_score}')

accuracy_score: 0.7538461538461538


Two predictions just go give a small test of the coming amazing results

In [None]:
print(f'this should be {y_train.iloc[0]} a.k.a. dog: {model.predict(np.array(X_train.iloc[0,:]).reshape((1,26)))}')
print(f'this should be {y_train.iloc[1]} a.k.a. cat: {model.predict(np.array(X_train.iloc[1,:]).reshape((1,26)))}')

this should be -1 a.k.a. dog: ['-1']
this should be 1 a.k.a. cat: ['1']


# Homomorphic Encription example
We finally here, this is officially the fun part!

We have a trained SVM model, and we have some data. Which in our case are recordings of cats meow and dogs bark.
Now lets assume that for some reason we are unable to tell the two apart, but we also do not want to whoever owns the server, or the "cloud" to know what animals do we have.

So how one does it? How can we run inference on a remote server without discovering our data?

**Homomorphic encryption to the resque!**

## This code is ran on "imaginary" client side:

First we generate SEAL context.

In [None]:
context = ts.context(
    ts.SCHEME_TYPE.CKKS,
    poly_modulus_degree=8192 * 2,
    coeff_mod_bit_sizes=[60, 40, 40, 40, 40, 40, 60]
)
context.generate_galois_keys()
context.global_scale = 2**40

Now lets load a test query:

In [None]:
rate, signal = sw.read(os.path.join(audio_files_directory, 'cat_21.wav'))
features = psf.base.mfcc(signal=signal, samplerate=rate, preemph=1.1, nfilt=number_of_filters, numcep=17)
features = psf.base.fbank(features)[1]
features = psf.base.logfbank(features)
query = np.array(features)[0]

Since we've trained our model on scaled data, we need to scale the queries as well:

In [None]:
scaled_query = np.array((query - train_mean) / tran_standard_deviation).reshape((1, 26))

Now we can encrypt the query:

In [None]:
enc_query = ts.ckks_vector(context, scaled_query.tolist()[0])

here we imagine that we've send the encrypted query to remote server. And the next cells are ran on "imaginary" server:

## This code is ran on "imaginary" server
To make things easier on the reader lets extract the required learned vectors from the trained model

In [None]:
bias = model.intercept_[0]
degree = model.degree
support_vectors = model.support_vectors_
gamma = model.gamma
dual_coefficients = model.dual_coef_

And so inference part:

In [None]:
kernel = enc_query.matmul(support_vectors.T.tolist()) * gamma
poly_kernel = kernel.square() * kernel

prediction_enc = poly_kernel.dot(dual_coefficients[0].tolist()) + bias

Just to be sure that the server has "no idea" what was the encripted query and the resulting prediction lets print them

In [None]:
print(f'encrypted query: {enc_query.data}')
print(f'encrypted prediction: {prediction_enc.data}')

encrypted query: <_tenseal_cpp.CKKSVector object at 0x7fa9899ccae8>
encrypted prediction: <_tenseal_cpp.CKKSVector object at 0x7fa9899cc7d8>


Back to the client, lets imagine that encrypted result was transfered back to the client side and 
## the following code is ran on the client side:

All whats left to do is to decrypt the prediction.
For the reader lets add 2 more lines that will do the inference as it would have been done without encryption, just to compare the results:

In [None]:
prediction = model.dual_coef_.dot(np.power(model.gamma * model.support_vectors_.dot(scaled_query.T), model.degree)) + model.intercept_
print(f'expected prediction value:\t\t{prediction[0]}')
print(f'result prediction from encrypted value:\t{prediction_enc.decrypt()}')

expected prediction value:		[1.77782917]
result prediction from encrypted value:	[1.7778570581410666]


Yes! I know! The result is nothing short from amazing.

We have been able to:
- train an SVM model
- encrypt our query
- run the inference on encrypted query
- and get result identical to one without encryption

From here all client needs to do is to ask server what is the meaning of positive and negative values (in case of 2 class classification with SVM).


In [None]:
print(f'expected result using original model without encryption: {model.predict(scaled_query)}')
print(f'result prediction from encrypted value:\t\t\t {np.sign(prediction_enc.decrypt())}')

expected result using original model without encryption: ['1']
result prediction from encrypted value:			 [1.]


Lets check the MSE and count the correct predictions comparing HE and unencrypted predictions on all available sound files in the dataset.
Thing to note here the MSE is tiny, negligable!
And all of the predictions are the same as the ones made on not encrypted query

In [None]:
correct_results_counter = 0
wrong_results_counter = 0
results = pd.DataFrame(columns=['open_text', 'HE'])

for features_array in X_train.iterrows():
    enc_query = ts.ckks_vector(context, features_array[1].tolist())

    inside_kernel = enc_query.matmul(support_vectors.T.tolist()) * gamma
    kernel_result = inside_kernel.square() * inside_kernel

    prediction_enc = kernel_result.dot(dual_coefficients[0].tolist()) + bias
    
    prediction_decrypted = prediction_enc.decrypt()
    prediction = dual_coefficients.dot(np.power(gamma * support_vectors.dot(features_array[1].T), degree)) + bias
    
    results = results.append(
        {
            'open_text': prediction,
            'HE': prediction_decrypted
        },
        ignore_index=True
    )

    if np.sign(prediction) == np.sign(prediction_decrypted):
      correct_results_counter += 1
    else:
      wrong_results_counter += 1

print(f'total correct: {correct_results_counter}')
print(f'total wrong: {wrong_results_counter}')
mse = ((results['open_text'] - results['HE'])**2).mean(axis=0)
print(f'MSE: {mse}')

total correct: 205
total wrong: 0
MSE: [1.16172214e-05]


Links:
- [More detailed overview of SVM and kernel functions](https://core.ac.uk/download/pdf/41757043.pdf)
- [dot product explanation](https://arxiv.org/pdf/2012.13552.pdf)
- [Support Vector Machines chapter from Python Data Science Handbook](https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.07-Support-Vector-Machines.ipynb#scrollTo=PDqscNUNJ7LV)
- [Scikit Learn SVC documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
- [SVM: Maximum margin separating hyperplane](https://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html#example-svm-plot-separating-hyperplane-py)
- [Python data science book](https://github.com/jakevdp/PythonDataScienceHandbook)
- [using custom kernels with SVM](https://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html)

- [Python speech features good place to find inspiration for feature extraction options](https://python-speech-features.readthedocs.io/en/latest/)
- [Standard scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)