# Getting Started: Local Tokenization
----------

In this tutorial, you will learn about a very simple use case when using SyferText to tokenize a python `str` or a PySyft `String` residing on a local PySyft worker (No remote workers are involved). 

In addition to tokenization, you will also learn how to access the vector embedding of each resulting token.

#### Author


- `Alan Aboudib`  -> [@alan_aboudib](https://twitter.com/alan_aboudib) (Twitter)

-----------------------

## 1. `SyferText`'s local architecture

-------------------

SyferText's architecture is inspired by that of [spaCy](https://spacy.io/). If you are familiar with spaCy, you should feel familiar with the way SyferText works.

However, unlike spaCy, SyferText is designed to leverage [PySyft](https://github.com/OpenMined/PySyft)'s ability to work with remote workers and of course to enforce privacy when designing NLP deep learning models.

In this tutorial, we will focus on the local worker case. Using SyferText for remote string tokenizations is  discussed in [another tutorial](https://bit.ly/37VEJ28) that you can check out.

Here is the architecture of SyferText when used for tokenizing strings on the local worker.

![SyferText architecture: local case](art/syfertext_local.png "SyferText architecture on the local worker")


As you can notice from the above figure, a few steps are involved in the process of tokenization:

1. An object of the `Language` class is instantiated when a a language model is loaded by calling the `load()` method. 

2. When given a PySyft `String` or a Python `str`, the `Language` object spawns a `Tokenizer` object.

3. The tokenizer breaks that string down into `Token` objects. 

4. The `Doc` object keeps track of those tokens. 

In the below example, you will see what attributes such `Token` objects have. 

-----------------------

## 2. Tokenizing a Python `str` object

-------------------

Let's first import SyferText. Since SyferText is based on PySyft, we also need to import the latter, as well as PyTorch:

In [1]:
# Hide warnings (nothing to do with SyferText)
import warnings
warnings.filterwarnings('ignore')

In [2]:
import syft as sy
import torch
import syfertext



error: missing ), unterminated subpattern at position 33

We now need to hook PyTorch using the TorchHook in PySyft

In [3]:
hook = sy.TorchHook(torch)



This will endow PyTorch with magic powers, privacy-preserving deep learning powers, such as Federated Learning, Differential Privacy, encrypted training and more. To learn more about PySyft, you can check out its awesome [tutorial notebooks](https://github.com/OpenMined/PySyft/tree/master/examples/tutorials).

Every machine in PySyft is called a worker. Since we are using SyferText to tokenize a string on our local machine, then we should get an instance of the object representing that worker, let's call it 'me':

In [4]:
me = hook.local_worker

We are now ready to load the language model. The only language model available for the moment in SyferText is `en_core_web_lg`, which is a model for English language simplified from spaCy's language model with the same name. Check out the  properties of that model [here](https://spacy.io/models/en#en_core_web_lg).

In [5]:
nlp = syfertext.load('en_core_web_lg', owner = me)

type(nlp), nlp.owner

(syfertext.language.Language, <VirtualWorker id:me #objects:0>)

Notice from the cell's output that the `nlp` variable is an object of the `Language` class, and similar to all PySyft objects, it has an owner, which is a PySyft `VirtualWorker` representing our local machine. 

Let's define a python native `str` object and tokenize it using the `Language` object we created:

In [12]:
my_str = 'I !am ({tokenizing a #python string!!'

# Tokenization happens here
doc = nlp(my_str)

# A Doc object is returned
type(doc), doc.owner, len(doc)

(syfertext.doc.Doc, <VirtualWorker id:me #objects:0>, 12)

Notice that calling the `Language` object with the `str` object we defined as an argument returns a `Doc` object (a document object). The latter is also a PySyft object that has an owner (the local worker in this case).

In order to get access to `Token` objects, we can simply iterate through the `Doc` object. Again, if you know spaCy, this should be familiar to you:

In [13]:
for token in doc:
    print('%10s | %5s | %3s | %s'%(token, token.space_after, token.start_pos,token.stop_pos))

         I |  True |   0 | 1
         ! | False |   2 | 3
        am |  True |   3 | 5
         ( | False |   6 | 7
         { | False |   7 | 8
tokenizing |  True |   8 | 18
         a |  True |  19 | 20
         # | False |  21 | 22
    python |  True |  22 | 28
    string | False |  29 | 35
         ! | False |  35 | 36
         ! | False |  36 | 37


You can see that Token objects can be used to get access to the underlying string, to whether that string is followed by a space or not in the original sentence, and to the string's hash. We can also get the vector embedding for each token using the vector attribute.


Get off-the-shelf token vectors of the third word of the original sentence

In [8]:
print(doc[2])

tokenizing


In [9]:
doc[2].vector

array([-2.7238e-01,  3.0606e-01, -3.4498e-01,  2.1782e-01,  3.7487e-01,
        9.4773e-01,  1.7543e-01,  3.8181e-01, -3.3346e-03, -1.4996e+00,
        2.5235e-01, -3.2070e-01,  2.1666e-01,  3.8842e-01,  4.8307e-02,
       -2.0648e-01,  9.9902e-03, -1.3044e+00,  8.1312e-01,  1.1367e-02,
        2.0233e-02, -6.3248e-02, -9.4263e-01,  5.4770e-01,  1.7891e-01,
        5.5033e-01,  8.9688e-01, -6.5181e-01,  2.3126e-01, -4.4428e-01,
        3.4075e-01, -1.7907e-01,  3.2387e-01, -5.2024e-01, -2.4935e-01,
        2.1229e-01,  5.6182e-01, -4.8586e-01, -2.9836e-02,  2.2277e-01,
       -4.1493e-01, -1.4441e-01, -6.6352e-01,  3.1543e-01, -4.8249e-01,
        5.2673e-01, -8.1652e-04, -3.6682e-02, -3.3230e-01, -2.8628e-01,
       -4.3110e-01,  3.3238e-01,  8.8027e-01,  9.8399e-02, -1.1932e-01,
       -2.3699e-01, -6.4863e-02,  1.6402e-01,  1.1817e-01,  3.6778e-01,
       -2.7983e-01,  7.4529e-02, -4.2815e-01, -6.0214e-01, -1.1596e-01,
        7.1352e-01, -1.7460e-01,  6.5253e-02,  3.3433e-01, -1.96

So we tokenized a python native string, let's now do the same with a PySyft `String`.

-----------------------

## 3. Tokenizing a PySyft `String` object

-------------------

PySyft has its own string type which is basically a wrapper around the native `str` type with additional PySyft magic such as the ability to send a string to a remote worker and to manipulate it from the comfort of the local worker. We are not going to discuss this here since we are only doing local string tokenization. 

Let's import the PySyft's `String` class:

In [17]:
from syft.generic.string import String

Let's now define a PySyft `String` to tokenize:

In [19]:
my_string = String('I am token#izing a PySyft String object')

type(my_string), my_string.owner

(syft.generic.string.String, <VirtualWorker id:me #objects:0>)

Notice that the PySyft `String` is owned by the local worker.

Let's now use the `Language` object we created earlier to tokenize it:

In [20]:
doc = nlp(my_string)

for token in doc:
    print('%10s | %5s | %s'%(token, token.space_after, token.orth))

         I |  True | 5943131912006430202
        am |  True | 11728213064939857863
     token | False | 12977505502605755571
         # | False | 15748096671724166044
     izing |  True | 7913766558530700748
         a |  True | 5182201742351716208
    PySyft |  True | 13286865392898898656
    String |  True | 13847508276233069841
    object | False | 10176415242575268008


You will also be able to get the embedding vector using the `vector` attribute. Pretty convenient right? Using either a PySyft `string` or a `str` object does not change the way SyferText is used.

### That's it!

You should have a better sense of how SyferText works on a local worker by now. However, keep in mind that SyferText is still in its early developement phase. Things are evolving and more features will be added soon.

If you have any questions or suggestions, you can DM me on OpenMined's [slack channel](http://slack.openmined.org/), or otherwise directly on my [Twitter page](https://twitter.com/alan_aboudib).