# Getting Started: Remote Tokenization
----------

One of the main motivations behind the creation of SyferText is to provide the ability to preprocess strings residing on a remote machine. Tokenization is one such important preprocessing step when we create deep learning NLP models. 

SyferText leverages [PySyft](https://github.com/OpenMined/PySyft)'s distributed architecture and its privacy-preserving arsenal of tools to enable 'blind' tokenization of private strings residing on remote workers without revealing their contents.

In this tutorial, you will learn how to use SyferText to tokenize a PySyft `String` residing on a remote private worker.

A private worker is one whose data is considered as sensitive with restricted access. We do not have the right to read this data. SyferText enables us to 'blindly' tokenize such a string and get an encrypted version of its tokens' vector embeddings. With these encrypted embeddings, one could train an encrypted deep learning model using PySyft. 

Training an NLP model with encrypted embeddings will be the subject of an upcoming tutorial.

#### Author


- `Alan Aboudib`  -> [@alan_aboudib](https://twitter.com/alan_aboudib) (Twitter)

-----------------------

## 1. `SyferText`'s distributed architecture

-------------------

You have seen in a [previous tutorial](https://bit.ly/2RQ9lwl) that SyferText can be used to tokenize a PySyft string on a local worker. In such setting, all of the processing components such as the `Language`, `Tokenizer`, `Doc` and `Token` objects were created on the same worker.

What you haven't seen yet is that SyferText's `Language` object can also be used to tokenize a PySyft `StringPointer` object, which is a pointer to a PySyft `String` on a remote worker.

When the `Language` object receives a `StringPointer` to tokenize, some changes to its architecture are applied; some of its components are sent to the remote worker where the string lives. Here is how this is done:

1. The `Language` object receives a `StringPointer` pointing to a PySyft `String` to be tokenized.
2. A `Tokenizer` is created and sent to the worker where the `String` is.
3. A `TokenizerPointer` object pointing to the remote `Tokenizer` is created and kept at the local worker.
4. The remote tokenizer tokenizes the string on the remote worker and creates a `Doc` containing all of the `Token` objects.
5. A `DocPointer` object pointing at the remote `Doc` is kept at the local worker.

Here is an illustration of the steps I mentioned above.

![SyferText architecture: local case](art/syfertext_remote.png "SyferText architecture on remote workers")


Notice that the `DocPointer` in the illustration above does not provide access to the individual tokens as a `Doc` object does. This is because the string on the remote worker is considered as private; we are not allowed to reveal its tokens or bring them to our local machine. We are only allowed to obtain encrypted versions of them.

Let's now do some coding:

-----------------------

## 2. Tokenizing a remote PySyft `String`

-------------------

Let's first import SyferText. Since SyferText is based on PySyft, we also need to import the latter, as well as PyTorch:

In [1]:
# Hide warnings (nothing to do with SyferText)
import warnings
warnings.filterwarnings('ignore')

In [2]:
import syft as sy
import torch
import syfertext

# Import PySyft's String class
from syft.generic.string import String
from syfertext.local_pipeline import get_test_language_model



We now need to hook PyTorch using the TorchHook in PySyft

In [3]:
hook = sy.TorchHook(torch)



This will endow PyTorch with magic powers, privacy-preserving deep learning powers, such as Federated Learning, Differential Privacy, Encrypted Training tools and more. To learn more about PySyft, you can check out its awesome [tutorial notebooks](https://github.com/OpenMined/PySyft/tree/master/examples/tutorials).

Every machine in PySyft is called a worker. Since we are using SyferText to tokenize a string on a remote machine, then we should get an instance of the object representing that worker. So let's now create fours workers. We will call the local worker 'me', and the remote workers 'bob' and 'alice'. A fourth worker that will act as the crypto provider for SMPC, the encryption algorithm PySyft uses, should also be created.

In [4]:
# The local worker
me = hook.local_worker
me.is_client_worker = False
# The remote workers
bob = sy.VirtualWorker(hook, id = 'bob')
alice = sy.VirtualWorker(hook, id = 'alice')

# The crypto provider
crypto_provider = sy.VirtualWorker(hook, id = 'crypto_provider')

We are now ready to load the language model. The only language model available for the moment in SyferText is `en_core_web_lg`, which is a model for English language simplified from spaCy's language model with the same name. Check out the  properties of that model [here](https://spacy.io/models/en#en_core_web_lg).

In [5]:
nlp = get_test_language_model()

type(nlp), nlp.owner

(syfertext.language.Language, <VirtualWorker id:me #objects:3>)

Notice from the cell's output that the `nlp` variable is an object of the `Language` class, and similar to all PySyft objects, it has an owner, which is a PySyft `VirtualWorker` representing our local machine. This corresponds to the illustration I showed you above were the `Language` object was shown to reside in the local worker.

We are now going to create a PySyft `String` and send it to `bob`. This will be our remote string that we are going to tokenize:

In [6]:
string = String('A string to tokenize')

# Send the string to the remote worker `bob`
string_ptr = string.send(bob)

type(string_ptr)

syft.generic.pointers.string_pointer.StringPointer

Notice that the variable `string_ptr` is a `StringPointer` object returned by the `send()` method of the worker object. Let's make sure that the string is really sent to bob.

In [7]:
bob._objects

{62451856165: 'A string to tokenize'}

Notice that `bob`'s object store now includes the PySyft string we are willing to tokenizer.

You might have noticed that printing out the `_objects` attribute has actually revealed the string text. This of course violates our assumption that the remote string is private. This is one main developement issue SyferText is going to integrate in the near future.

We now have all what we need to tokenize the string. Let's do it:

In [8]:
doc_ptr = nlp(string_ptr)

# Some checks
print(type(doc_ptr))
print(doc_ptr)

<class 'syfertext.pointers.doc_pointer.DocPointer'>
[DocPointer | me:95081303078 -> bob:42849285597]


From the above output, you can see that calling `nlp` with the `StringPointer` as its argument returns a `DocPointer` object that points to a `Doc` with a specified ID residing on `bob`, the remote worker.

Let's check out if `bob` really has a `Doc` object with that ID:

In [9]:
bob._objects

{62451856165: 'A string to tokenize',
 99022402614: SubPipeline[tokenizer],
 4084818996: [StatePointer | bob:4084818996 -> me:syfertext_sentiment:vocab],
 'syfertext_sentiment:vocab': State>None,
 47498235170: [StatePointer | bob:47498235170 -> me:syfertext_sentiment:tokenizer],
 'syfertext_sentiment:tokenizer': State>None,
 42849285597: Doc>None}

It does! `bob` has a `Doc` with the same ID specified by the `DocPointer` printout above. Moreover, the `Tokenizer` also resides on `bob` which is what you saw in the above illustration of SyferText's architecture at the beginning of this notebook.

Using that `DocPointer` we can now get an SMPC-encrypted string vector. For the moment, the string vector here is simply a concatenation of the individual token vectors of the string. 

In [10]:
doc_vector_enc = doc_ptr.get_encrypted_vector(bob, alice, crypto_provider = crypto_provider)

print(f' Vector size is {doc_vector_enc.shape[0]}')
print(doc_vector_enc)

 Vector size is 300
(Wrapper)>AutogradTensor>FixedPrecisionTensor>[AdditiveSharingTensor]
	-> [PointerTensor | me:32586527613 -> bob:37845127773]
	-> [PointerTensor | me:99935642011 -> alice:77824897258]
	*crypto provider: crypto_provider*


As expected, the returned vector is an SMPC-encrypted PySyft `AdditiveSharingTensor` object. It is encrypted into two shares residing in `bob` and `alice` workers. It has a size of 300 which corresponds to the size of vector obtained by average of 4 token vectors each of size 300.



In [11]:
tok_vectors = doc_ptr.get_encrypted_token_vectors(bob, alice, crypto_provider = crypto_provider)
print(f' Vector size is {tok_vectors.shape}')
print(tok_vectors)

 Vector size is torch.Size([4, 300])
(Wrapper)>AutogradTensor>FixedPrecisionTensor>[AdditiveSharingTensor]
	-> [PointerTensor | me:61574219015 -> bob:90636773427]
	-> [PointerTensor | me:11695283936 -> alice:14333865650]
	*crypto provider: crypto_provider*


A SMPC-encrypted PySyft `AdditiveSharingTensor` representing the array of all vectors in the document this pointer points to.


### That's it!

You now know how to use SyferText to tokenize a remote string by manipulating a string pointer. You have seen how SyferText's distributed architecture make that easy by leveraging PySyft which, in turn, handles all the communication logic between workers.

However, since SyferText is still in its developement phase, more features and methods are constantly being added and/or modified. We will make sure to update this tutorial whenever such new features are released.

If you have any questions or suggestions, you can DM me on OpenMined's [slack channel](http://slack.openmined.org/), or otherwise contact me directly on my [Twitter page](https://twitter.com/alan_aboudib).