## 0. Installing Dependencies

In [None]:
# cd to SyferText/
# And run the below commands
# !conda activate openmined
# !git checkout psi_one_hot
# !pip uninstall syfertext -y
# !python setup.py install

In [None]:
# # Install Git LFS
# !curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
# !sudo apt-get install git-lfs
# !git lfs install

# # Install syfertext language model
# ! pip install git+git://github.com/Nilanshrajput/syfertext_en_core_web_lg@master

## Vocabulary in Private NLP

During training Language Models we often need to know the size of our vocabulary and represent the words in our vocabulary as one-hot-vectors. This is a trivial task while dealing with non-private and non-remote datasets. 

But things start to get complicated as we begin to deal with private datasets, cause the contents of the datasets are private we can't search for the unique words as before, thus making it a challenging task to represent the words as one-hot vectors.

Fret Not ! SyferText to the rescue !! ;)

In [1]:
import syfertext
import syft as sy
import sys
import torch

from syft.generic.string import String

from syfertext.workers.virtual import VirtualWorker
from syfertext.encdec import encrypt, decrypt

hook = sy.TorchHook(torch)



In [2]:
# Remote workers
me = sy.local_worker
# me = syfertext.local_worker
# me = VirtualWorker(hook, "me")

# Language Object
nlp = syfertext.load("en_core_web_lg", owner=me)

bob = VirtualWorker(hook, "bob")
alice = VirtualWorker(hook, "alice")
carol = VirtualWorker(hook, "carol")
david = VirtualWorker(hook, "david")

In [3]:
# Simulate private datasets
workers = [bob, alice, carol, david]

texts = ["I am learning nlp during lockdown.",
         "I am learning cryptography during quarantine.",
         "I am learning more about privacy.",
         "I am building syfertext during quarantine.",
     ]

# Send text to workers
string_pointers = list()
for text, worker in zip(texts, workers):
    str_ptr = String(text).send(worker)
    string_pointers.append(str_ptr)

# Tokenize the strings and get doc pointers
docs = list()
for str_ptr in string_pointers:
    doc = nlp(str_ptr)
    docs.append(doc)

In [7]:
# Store the docs in corresponding variables
# Will be used later
bob_doc = docs[0]
alice_doc = docs[1]
carol_doc = docs[2]
david_doc = docs[3]

## 1. Private NLP Setup

While dealing with private datasets, we can't bring them to the local machine in their raw format. So instead we can first encrypt them on the remote machines and then bring them to the local machine for comparison. But, there are a few requirements for the encryption scheme that we should follow:

1. **Symmetric Key Encryption**: The keys used by each of the workers to encrypt their tokens should be same, cause only then the similar tokens reisdig on different workers can be compared to each other in their encrypted state.

2. **Determinstic Encryption**: The workers should encrypt their tokens using deterministic encryption, cause adding randomness to the encryption will not allow us to compare similar tokens even when they are encrypted with the same keys.

Hence, to fullfill these conditions we will be using the Diffie-Hellman Key exchange process and AES (Advance Encryption Scheme) in the ECB (Deterministic) Mode.

The PySyft BaseWorker has been extended with the following methods to allow us to perform the key exchange protocol easily.


```
_generate_private_key()
generate_public_key()
generate_secret_key()
```

In [4]:
shared_prime = 997
shared_base = 2

In [5]:
# Generate public keys
bob_public_key = bob.generate_public_key(shared_prime, shared_base)
alice_public_key = alice.generate_public_key(shared_prime, shared_base)

# Generate secret keys
alice.generate_secret_key(shared_prime, bob_public_key)
bob.generate_secret_key(shared_prime, alice_public_key)

Now Alice and Bob both have a **secret key**, which is same for both of them and known only to them. This enables us to get the encrypted tokens and find the distinct tokens among them.

In [8]:
bob_enc_tokens = bob_doc.get_encrypted_tokens_set()
alice_enc_tokens = alice_doc.get_encrypted_tokens_set()

# Print a sample of encrypted tokens
print("Bob's encrypted tokens:", bob_enc_tokens)
print("-"*20)
print("Alice's encrypted tokens:", alice_enc_tokens)

Bob's encrypted tokens: {b'seGjO0wglhH60HMgr0DG9w==', b'X77V3IjyxSuhbSnfY1B0fA==', b'0m4L5pj9M1BICVcWrKHzxg==', b'TXeoW9jn9zY3ZYARRnbehw==', b'XGqf0zLCEd8JeBhulZaZLw==', b'0OD9T1HyHxa6iEe7JR9SQA==', b'j/mFNmLH24omYUhiGfv/Ng=='}
--------------------
Alice's encrypted tokens: {b'X77V3IjyxSuhbSnfY1B0fA==', b'0m4L5pj9M1BICVcWrKHzxg==', b'XGqf0zLCEd8JeBhulZaZLw==', b'0OD9T1HyHxa6iEe7JR9SQA==', b'oN8WcT0dZ42CJUfuLZriiQ==', b'j/mFNmLH24omYUhiGfv/Ng==', b't6jAgx3TTNqLXtiG35Mklg=='}


These sets can be compared to each other and the unique tokens can be determined from them.
Let's write some handy functions to do it.

In [10]:
import random

def shuffle_set(tokens):
    """
    Shuffle the set to avoid leaking any info from
    the relative ordering of tokens.
    """
    tokens = list(tokens)
    return random.shuffle(tokens)

def assign_indices(all_tokens):
    """
    Args:
        all_tokens (Set): Set consisting
            tokens across all workers
    Returns:
        token_to_index (dict): Maps all unique 
            tokens to unique indices
    """
    shuffle_set(all_tokens)
    index = 0
    token_to_index = {}
    for token in all_tokens:
        # assign an index to token
        token_to_index[token] = index
        # increment index
        index += 1
    return token_to_index

def map_to_indices(workers_tokens, token_to_index):
    """
    Args:
        workers_tokens (Iterable): Consists of tokens
            belonging to a specific worker.
        token_to_index (dict): Map of all unique tokens 
            to unique indices
    
    Returns:
        worker_token_to_index (dict): Each token in
            `worker_tokens` is mapped to an index
    """
    worker_token_to_index = {}    
    for token in workers_tokens:
        # map token to index
        worker_token_to_index[token] = token_to_index[token]    
    return worker_token_to_index

In [11]:
def print_dict(token_to_index):
    for enc_token, index in token_to_index.items():
        print("Encrypted Token: ", enc_token, "\t\tIndex: ", index)


all_tokens = bob_enc_tokens.union(alice_enc_tokens)  # Take the union of both sets
VOCAB_SIZE = len(all_tokens)                         # store the vocaulary size

token_to_indices = assign_indices(all_tokens)

print("Vocabulary size: ", VOCAB_SIZE, "\n\n")
print_dict(token_to_indices)

Vocabulary size:  9 


Encrypted Token:  b'seGjO0wglhH60HMgr0DG9w==' 		Index:  0
Encrypted Token:  b'X77V3IjyxSuhbSnfY1B0fA==' 		Index:  1
Encrypted Token:  b'0m4L5pj9M1BICVcWrKHzxg==' 		Index:  2
Encrypted Token:  b'TXeoW9jn9zY3ZYARRnbehw==' 		Index:  3
Encrypted Token:  b'XGqf0zLCEd8JeBhulZaZLw==' 		Index:  4
Encrypted Token:  b'0OD9T1HyHxa6iEe7JR9SQA==' 		Index:  5
Encrypted Token:  b'oN8WcT0dZ42CJUfuLZriiQ==' 		Index:  6
Encrypted Token:  b'j/mFNmLH24omYUhiGfv/Ng==' 		Index:  7
Encrypted Token:  b't6jAgx3TTNqLXtiG35Mklg==' 		Index:  8


In [12]:
# Assign indices to bob's tokens
bob_token_to_index = map_to_indices(bob_enc_tokens, token_to_indices)

# Assign indices to alice's tokens
alice_token_to_index = map_to_indices(alice_enc_tokens, token_to_indices)

print("Bob's Tokens\n")
print_dict(bob_token_to_index)

print("\nAlice's Tokens\n")
print_dict(alice_token_to_index)

Bob's Tokens

Encrypted Token:  b'seGjO0wglhH60HMgr0DG9w==' 		Index:  0
Encrypted Token:  b'X77V3IjyxSuhbSnfY1B0fA==' 		Index:  1
Encrypted Token:  b'0m4L5pj9M1BICVcWrKHzxg==' 		Index:  2
Encrypted Token:  b'TXeoW9jn9zY3ZYARRnbehw==' 		Index:  3
Encrypted Token:  b'XGqf0zLCEd8JeBhulZaZLw==' 		Index:  4
Encrypted Token:  b'0OD9T1HyHxa6iEe7JR9SQA==' 		Index:  5
Encrypted Token:  b'j/mFNmLH24omYUhiGfv/Ng==' 		Index:  7

Alice's Tokens

Encrypted Token:  b'X77V3IjyxSuhbSnfY1B0fA==' 		Index:  1
Encrypted Token:  b'0m4L5pj9M1BICVcWrKHzxg==' 		Index:  2
Encrypted Token:  b'XGqf0zLCEd8JeBhulZaZLw==' 		Index:  4
Encrypted Token:  b'0OD9T1HyHxa6iEe7JR9SQA==' 		Index:  5
Encrypted Token:  b'oN8WcT0dZ42CJUfuLZriiQ==' 		Index:  6
Encrypted Token:  b'j/mFNmLH24omYUhiGfv/Ng==' 		Index:  7
Encrypted Token:  b't6jAgx3TTNqLXtiG35Mklg==' 		Index:  8


In [13]:
# Let's now return the token_to_index to the respective docs
# So that later on we can query them to get index for the tokens
# they contain.
bob_doc.set_indices(bob_token_to_index)
alice_doc.set_indices(alice_token_to_index)

So, we are able to find all the unique words across all workers without leaking any information about:
1. Tokens that the Bob and Alice contain to the local worker
2. Tokens that Bob contains to Alice and vice-versa

But, if there is a huge limitation to this approach. Let's see what it is.

## 2. Man-In-The-Middle Attack

One of the major drawbacks with the **public keys passing through the local worker is that the local worker can perform the man-in-the-middle attack**, which is a well-known limitation of the diffie-hellman key exchange. This allows the local worker to decrypt both Alice and Bob's tokens.

Note: `The key concept of this attack is that the local worker impersonates itself as a data provider. And it establishes trus with both Bob and Alice thus enabling it to decrypt both of their tokens.`.

Let's demonstrate how this can be done. 

In [23]:
# Local worker capable of performing DH key exchange
local_worker = VirtualWorker(hook, "local_worker")

In [24]:
# Generate public keys
bob_public_key = bob.generate_public_key(shared_prime, shared_base)
alice_public_key = alice.generate_public_key(shared_prime, shared_base)

# Local worker generates a public key for itself
local_public_key = local_worker.generate_public_key(shared_prime, shared_base)

In [26]:
""" This is the Key Step. """
# Now instead of sending bob's public key to alice and alice's 
# public key to bob, the local worker can pass both of them it's
# own public key.
alice.generate_secret_key(shared_prime, local_public_key)
bob.generate_secret_key(shared_prime, local_public_key)

# And using bob's public key, local worker
# can generate a secret key which is similar to the one
# with bob.
local_worker.generate_secret_key(shared_prime, bob_public_key)
bob_compromised_key = local_worker.secret_key

# And using alice's public key, local worker
# can generate a secret key which is similar to the one
# with alice.
local_worker.generate_secret_key(shared_prime, alice_public_key)
alice_compromised_key = local_worker.secret_key

print("Bob's compromised secret key   : \n", bob_compromised_key, '\n')
print("Alice's compromised secret key : \n", alice_compromised_key)

Bob's compromised secret key   : 
 b'\xbb\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' 

Alice's compromised secret key : 
 b'\xa5\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'


Now let us try to bob's and alice's tokens using their compromised secret keys.

In [27]:
# First we will get the encrypted tokens
bob_enc_tokens = bob_doc.get_encrypted_tokens_set()
alice_enc_tokens = alice_doc.get_encrypted_tokens_set()

# Print the encrypted tokens tokens
print("Bob's data:", bob_enc_tokens)
print('-'*10)
print("Alice's data:", alice_enc_tokens)

Bob's data: {b'2BhSqDapHekupVMlx4vY+A==', b'VAzfub/9sK9N0EorhRxulw==', b'c8raqwbr/05WAajos2V8ng==', b'V6hXYW2iiViyodLGKPPLGA==', b'VLtWs9+cLoai+Zw5jWuKRA==', b'FM64R2DFBl3b8U2A2B8r6g==', b'OBlQFCcC9EvYqhwTu/o/QA=='}
----------
Alice's data: {b'xINnBQqlTqxlV9qEvHcM4w==', b'BTpT29wbmPd6uJmV9/UX4Q==', b'qdwz9oEH0ycxn2RkL292PA==', b'Rbopm8tiA4ebUALfZ/9qZA==', b'hC0u3LpAEP0FDDXgstVoNw==', b'aVoM6WN+5pILR1Jm+j19ig==', b'C036A2gK397LdGqIsCxdYA=='}


In [34]:
def decrypt_tokens(enc_tokens, secret_key):
    """ Decrypts a set of enc_tokens using
    passsed in secret_key. """
    dec_tokens = list()
    for token in enc_tokens:
        dec_token = decrypt(token, secret_key).decode("utf-8")
        dec_tokens.append(dec_token)
    return dec_tokens

bob_dec_tokens = decrypt_tokens(bob_enc_tokens, 
                                bob_compromised_key)

alice_dec_tokens = decrypt_tokens(alice_enc_tokens, 
                                  alice_compromised_key)

# Print the decrypted tokens
print("Bob's private data:", bob_dec_tokens)
print('-'*10)
print("Alice's private data:", alice_dec_tokens)

Bob's private data: ['.', 'nlp', 'I', 'am', 'during', 'learning', 'lockdown']
----------
Alice's private data: ['cryptography', 'during', 'learning', 'am', 'I', 'quarantine', '.']


In just a few extra steps the local worker is able to decrypt both Bob's and Alice's encrypted tokens. So we must find a solution to avoid this.

## 3. DHKE on SecureWorker

DHKE: Diffie-Hellman Key Exchange

In general, the local worker is the one who is training a model.
For eg, it can be a company developing a health product and training a Deep Learning model on the data residing on the hospitals servers.
But as mentioned above, if the local worker is malicious, it is capable of decrypting the data present on the hospital's machines.

Thus, in order to avoid the data-breach and protect the privacy of the data belonging to the workers, the **workers can decide to trust a third-party worker** who acts as a neutral party. This worker called as the SecurWorker is trustworthy and **performs the Diffie-Hellman Key Exchange** between all workers.

So, in our hospital example, the SecureWorker can be an **AWS instance**. Moreover, it is required only for a very short instance.

The SecureWorker is capable of performing a secure key-exchange process between two or more workers. Here we will present an example, with **four workers**.

In [44]:
workers = [bob, alice, carol, david]
secure_worker = VirtualWorker(hook, id="james")

# Execute the DH key exchange protocol securely on Secure Worker
secure_worker.execute_dh_key_exchange(shared_prime, shared_base, workers)

##### And that's it!! ;)

Now all the workers have a secret key on their machines, which is same for all of them, and thus enables us the local-worker to perform the **Set Intersection**.

In [47]:
# List of Sets of encryted tokens
enc_token_sets = list()

for doc in docs:
    enc_tokens = doc.get_encrypted_tokens_set()
    enc_token_sets.append(enc_tokens)

In [49]:
all_tokens = set()

for token_set in enc_token_sets:
    # Union with the current set
    all_tokens = all_tokens.union(token_set)

token_to_index = assign_indices(all_tokens)

# Map each doc's tokens to indices
for doc, enc_tokens in zip(docs, enc_token_sets):

    cur_token_to_index = map_to_indices(enc_tokens, token_to_index)
    
    # Return the mapped tokens to doc
    doc.set_indices(cur_token_to_index)

## 4. One-Hot Vectors

Now let's learn how to get one-hot vectors from documents present on multiple workers. 

## 5. Diffference between our approach and PSI


Extensive research work already present in the field of PSI can be referred, but our problem statement is slightly different from PSI
- In PSI, the output (set intersection) is returned to one of the parties. In our case, it would be better if **we don't let any worker know the set intersection**. Thus avoiding leaking any info to the workers about the data present on the other workers. They just receive their tokens with indices assigned to each token. Nothing else.