# Getting Started: SimpleTagger
----------

In this tutorial, you will learn about how to use the `SimpleTagger` pipeline component that enables token level
tagging with custom attributes and custom labels. Today we'll focus on tagging `stop words` and english articles
as definite or indefinite to showcase the different setups available for the tagger. The same component could
be used for token level polarity tagging or simple forms of NER.

#### Author


- `Antonio Lopardo`  -> [@AntonioLprd](https://twitter.com/AntonioLprd) (Twitter)

## 1. Torch hooking and worker setup

We start by hooking torch with PySyft to add additional functionalities and we create a local worker that will be
the owner of the pipeline in which we will run the tagger.

In [1]:
import torch
import spacy #<- We use spacy just to import a list of stop words

import syft as sy

import syfertext
from syfertext.pipeline import SimpleTagger



In [2]:
# Create a torch hook for PySyft
hook = sy.TorchHook(torch)

# Create a PySyft workers
me = hook.local_worker #<- This is the worker from which we manage the processing



## 2. Creating a stop-word tagger

In [0]:
#Loading farly extensive list of stop words from one of spacy's models
sp = spacy.load('en_core_web_sm')
stopwords = sp.Defaults.stop_words

The `SimpleTagger` can be initialized in two ways. We can pass a `set` of tokens in the lookups argument, these tokens 
should be labeled with the object in the `tag` field and while every other token not included in `set`
in `lookups` should be labeled with the `default_tag`. Or `lookups` can be a `dict` where the keys are tokens and the
values are their respective tags. In this second case the `default_tag` can also be used for tokens not in the `dict`.

Now let's initialize our `stop_tagger` with the set of stop-words from spacy
* `attribute` helps us name the custom attribute we are creating
* `lookups` in this case is a set of tokens
* `tag` the object with which to tag the tokens in the `lookups` set
* `default_tag` the object with which to tag tokens not in `lookups` set
* `case_sensitive` a boolean flag that indicates if capitalization should be considered when matching tokens

In [0]:
stop_tagger = SimpleTagger(attribute = 'is_stop',
                           lookups = stopwords,
                           tag = True,
                           default_tag = False,
                           case_sensitive = False
                          )

## 3. Creating a tagger for article type

Now let's initialize our `article_tagger` with a `dict` with tokens as keys and tags for the attribute as values
* `attribute` helps us name the custom attribute we are creating
* `lookups` in this case it is a dictionary
* `default_tag` the object with which to tag tokens not in the `lookups` dict
* `case_sensitive` a boolean flag that indicates if capitalization should be considered when matching tokens


In [0]:
#define the tags for our dictionary
definite = "definite"
indefinite = "indefinite"

#Initializing the dict to feed to the SimpleTagger constructor 
articles_dict = {"the": definite, "a": indefinite, "an": indefinite}

article_tagger = SimpleTagger(attribute = 'is_article',
                           lookups = articles_dict,
                           default_tag = False,
                           case_sensitive = False
                          )

## 4. Pipeline integration
Using the add_pipe method it is easy to integrate the new tagger in our nlp pipeline.

In [0]:
#Initialize an nlp pipeline that by default contains a tokenizer.
nlp = syfertext.load("en_core_web_lg", owner= me) 

nlp.add_pipe(name = 'stop tagger',#<- We add the stop tagger to the pipeline with a distinctive name
                 component = stop_tagger,
                 remote = True
                )

nlp.add_pipe(name= 'article tagger', #<- We add the article tagger to the pipeline with a distinctive name
             component= article_tagger,
             remote= True)


In [7]:
test_string = "thereafter a various group of the people left"

#apply in sequence tokenizer->stop_tagger->article_tagger
tagged_test_string = nlp(test_string)

for token in tagged_test_string: #<-If the data on which we operate is local we can access the custom attribute using "._."
    print('%10s | %5s | %s'%(token, token._.is_stop, token._.is_article))

thereafter |  True | False
         a |  True | indefinite
   various |  True | False
     group | False | False
        of |  True | False
       the |  True | definite
    people | False | False
      left | False | False


## And we are done!👍

With the help of `SimpleTagger` now you should be able to tackle most token-level 
tagging task when using SyferText.

If you have any questions or suggestions, you can find us on OpenMined's [slack channel](http://slack.openmined.org/)
