# Getting Started: SimpleTagger
----------

In this tutorial, you will learn about how to use the `SimpleTagger` pipeline component that enables token level
tagging with custom attributes and custom labels. Today we'll focus on tagging `stop words` and english articles
as definite or indefinite to showcase the different setups available for the tagger. The same component could
be used for token level polarity tagging or simple forms of NER.

#### Author


- `Antonio Lopardo`  -> [@AntonioLprd](https://twitter.com/AntonioLprd) (Twitter)

## 1. Torch hooking and worker setup

We start by hooking torch with PySyft to add additional functionalities and we create a local worker that will be
the owner of the pipeline in which we will run the tagger.

In [1]:
import torch

import syft as sy

import syfertext
from syfertext.pipeline import SimpleTagger
from syfertext.local_pipeline import get_test_language_model



In [2]:
# Create a torch hook for PySyft
hook = sy.TorchHook(torch)

# Create a PySyft workers
me = hook.local_worker #<- This is the worker from which we manage the processing
me.is_client_worker = False



## 2. Creating a stop-word tagger

In [3]:
#Initialize a farly extensive list of stop words from https://meta.wikimedia.org/wiki/Stop_word_list/google_stop_word_list#English

stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", 
             "aren't", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", 
             "but", "by", "can't", "cannot", "could", "couldn't", "did", "didn't", "do", "does", "doesn't", 
             "doing", "don't", "down", "during", "each", "few", "for", "from", "further", "had", "hadn't",
             "has", "hasn't", "have", "haven't", "having", "he", "he'd", "he'll", "he's", "her", "here", 
             "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", 
             "i've", "if", "in", "into", "is", "isn't", "it", "it's", "its", "itself", "let's", "me", "more", 
             "most", "mustn't", "my", "myself", "no", "nor", "not", "of", "off", "on", "once", "only", "or", 
             "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "shan't", "she",
             "she'd", "she'll", "she's", "should", "shouldn't", "so", "some", "such", "than", "that", "that's",
             "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they",
             "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under",
             "until", "up", "very", "was", "wasn't", "we", "we'd", "we'll", "we're", "we've", "were", "weren't",
             "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom",
             "why", "why's", "with", "won't", "would", "wouldn't", "you", "you'd", "you'll", "you're", "you've",
             "your" "yours" "yourself" "yourselves"]

The `SimpleTagger` can be initialized in two ways. 
1. With the `lookups` argument as a `set`
2. With the `lookups` argument as a `dict`

Now let's initialize our `stop_tagger` with the set of stop-words we initialized
* **attribute**: name of the attribute to add to the tokens
* **lookups**: a `set` of tokens to be labeled with `tag`
* **tag**: the object with which to tag the tokens in the `lookups` set
* **default_tag**: the object with which to tag tokens not in `lookups` set
* **case_sensitive**: a boolean flag that indicates if capitalization should be considered when matching tokens

In [4]:
stop_tagger = SimpleTagger(attribute = 'is_stop',
                           lookups = stopwords,
                           tag = True,
                           default_tag = False,
                           case_sensitive = False
                          )

## 3. Creating a tagger for article type

Now let's initialize our `article_tagger` with a `dict` with tokens as keys and tags for the attribute as values
* **attribute**: name of the attribute to add to the tokens
* **lookups**: a `dict` where keys are tokens and the values are the objects with which to label them
* **tag**: not needed if `lookups` is a dict since the dict itself contains the tag objects
* **default_tag**: the object with which to tag tokens not in the `lookups` dict
* **case_sensitive**: a boolean flag that indicates if capitalization should be considered when matching tokens


In [5]:
#define the tags for our dictionary
definite = "definite"
indefinite = "indefinite"

#Initializing the dict to feed to the SimpleTagger constructor 
articles_dict = {"the": definite, "a": indefinite, "an": indefinite}

article_tagger = SimpleTagger(attribute = 'is_article',
                           lookups = articles_dict,
                           default_tag = False,
                           case_sensitive = False
                          )

## 4. Pipeline integration
Using the add_pipe method it is easy to integrate the new tagger in our nlp pipeline.

In [6]:
#Initialize an nlp pipeline that by default contains a tokenizer.
nlp = get_test_language_model()

#We add the stop tagger to the pipeline with a distinctive name
nlp.add_pipe(name = 'stop tagger',
                 component = stop_tagger,
                 access = {'*'}
                )

#We add the article tagger to the pipeline with a distinctive name
nlp.add_pipe(name= 'article tagger', 
             component= article_tagger,
             access = {'*'})


In [7]:
test_string = "thereafter a various group of the people left"

#apply in sequence tokenizer->stop_tagger->article_tagger
tagged_test_string = nlp(test_string)

#If the data on which we operate is local we can access the custom attribute using "._."
for token in tagged_test_string: 
    print('%10s | %5s | %s'%(token, token._.is_stop, token._.is_article))

thereafter | False | False
         a |  True | indefinite
   various | False | False
     group | False | False
        of |  True | False
       the |  True | definite
    people | False | False
      left | False | False


  current_tensor = hook_self.torch.native_tensor(*args, **kwargs)


## And we are done!👍

With the help of `SimpleTagger` now you should be able to tackle most token-level 
tagging tasks when using SyferText.

If you have any questions or suggestions, you can find us on OpenMined's [slack channel](http://slack.openmined.org/)
