# How to use: Simple Tagger
----------

In this tutorial, you will learn about how to use the `SimpleTagger` pipeline component that enables token level
tagging with custom attributes and custom labels. Today we'll focus on tagging `stop words` and english articles
as definite or indefinite to showcase the different setups available for the tagger. The same component could
be used for token level polarity tagging or simple forms of NER.

#### Author


- `Antonio Lopardo`  -> [@AntonioLprd](https://twitter.com/AntonioLprd) (Twitter)

## Torch hooking and worker setup

We start by hooking torch with PySyft to add additional functionalities and we create a local worker that will be
the owner of the pipeline in which the tagger will be integrated.

In [9]:
import torch
import spacy #<_ We use spacy just for a list of stop words

import syft as sy
from syft.generic.string import String #<- We import it to convert generic strings to PySyft strings

import syfertext
from syfertext.pipeline import SimpleTagger

In [10]:
# Create a torch hook for PySyft
hook = sy.TorchHook(torch)

# Create a PySyft workers
me = hook.local_worker #<- This is the worker from which the processing is managed



## StopwordsTagger setup using spacy for a set of stop words

In [11]:
#Loading farly extensive list of stop words from one of spacy's models
sp = spacy.load('en_core_web_sm')
stopwords = sp.Defaults.stop_words

Now let's initialize our `stop_tagger`
`attribute` helps us name the custom attribute we are creating
`lookups` in this case is a set of tokens that should be tagged with the object in the `tag` field
`tag` the object with which to tag the tokens in the `lookups` set
`default_tag` the object with which to tag tokens not in `the lookups` set
`case_sensitive` a boolean flag that indicates if capitalization should be considered when matching tokens in the ones in lookups

In [12]:
stop_tagger = SimpleTagger(attribute = 'is_stop',
                           lookups = stopwords,
                           tag = True,
                           default_tag = False,
                           case_sensitive = False
                          )

## ArticleTagger setup

We can also send a `dict` to `lookups` specifing for the tokens included an object with which to tag them
`attribute` helps us name the custom attribute we are creating
`lookups` in this case is a dict with tokens as keys and objects with which to tag them as values
`tag` ignored
`default_tag` the object with which to tag tokens not in the `lookups` dict
`case_sensitive` a boolean flag that indicates if capitalization should be considered when matching tokens in the ones in lookups


In [13]:
#Converting "definite" and "indefinite" to PySyft Strings
definite = String("definite")
indefinite = String("indefinite")

#Initializing the dict to feed to the SimpleTagger constructor 
articles_dict = {"the": definite, "a": indefinite, "an": indefinite}

article_tagger = SimpleTagger(attribute = 'is_article',
                           lookups = articles_dict,
                           tag = None,
                           default_tag = False,
                           case_sensitive = False
                          )

## Pipeline integration
Using the add_pipe method it is easy to integrate the new tagger in our nlp pipeline.

In [14]:
#Initialize an nlp pipeline that by default contains a tokenizer.
nlp = syfertext.load("en_core_web_lg", owner= me) 

nlp.add_pipe(name = 'stop tagger',#<- We add the stop tagger to the pipeline with a distinctive name
                 component = stop_tagger,
                 remote = True
                )

nlp.add_pipe(name= 'article tagger', #<- We add the article tagger to the pipeline with a distinctive name
             component= article_tagger,
             remote= True)


In [17]:
test_string = String("thereafter a various group of the people left")

#apply in sequence tokenizer->stop_tagger->article_tagger
tagged_test_string = nlp(test_string)

for token in tagged_test_string: #<-If the data on which we operate is local we can access the custom attribute using "._."
    print('%10s | %5s | %s'%(token, token._.is_stop, token._.is_article))

thereafter |  True | False
         a |  True | indefinite
   various |  True | False
     group | False | False
        of |  True | False
       the |  True | definite
    people | False | False
      left | False | False
