In [None]:
import json
import pandas as pd

# Stanford Part-Of-Speech tagger

Let's load our dataset and take a peek at some of the values in it:

In [None]:
# Load dataset:
vuelos = pd.read_csv('data/vuelos.csv', index_col=0)
with pd.option_context('max_colwidth', 800):
    print(vuelos.loc[:100:5][['label']])

To interface with the Stanford tagger we could use the `StanforPOSTagger` inside the `nltk.tag.stanford` module, then we create an object passing in both our language-specific model as well as the tagger `.jar` we previously downloaded from Stanford's website.  

Then, as a quick test we tag a spanish sentence to see what is it that we get back from the tagger.

In [None]:
from nltk.tag.stanford import StanfordPOSTagger

spanish_postagger = StanfordPOSTagger('stanford-models/spanish.tagger', 
                                      'stanford-models/stanford-postagger.jar')

tags = spanish_postagger.tag('Amo el canto del cenzontle, pájaro de cuatrocientas voces.'.split()) 
print(tags)

The first thing to note is the fact that the tagger takes in lists of strings, not a full sentence, that is why we need to split our sentence before passing it in. A second thing to note is that we get back of tuples; where the first element of each tuple is the token and the second is the POS tag assigned to said token. The POS tags are [explained here](https://nlp.stanford.edu/software/spanish-faq.html) and I have made a dictionary for easy lookups.

We can inspect the tokens a bit more:

In [None]:
with open("aux/spanish-tags.json", "r") as r:
    spanish_tags = json.load(r)
    
for token, tag in tags[:10]:
    print(f"{token:15} -> {spanish_tags[tag]['description']}")

## Specific tokenisation  

As you may imagine, using `split` to tokenise our text is not the best idea; it is almost certainly better to create our own function, taking into consideration the kind of text that we are going to process. The function aboce uses the `TweetTokenizer`, and takes flag emojis into consideration. As a final touch, it also returns the position of each one of the returned tokens:

In [None]:
from nltk.tokenize import TweetTokenizer

TWEET_TOKENIZER = TweetTokenizer()

# This function exists in vuelax.tokenisation in this same repository
def index_emoji_tokenize(string, return_flags=False):
    flag = ''
    ix = 0
    tokens, positions = [], []
    for t in TWEET_TOKENIZER.tokenize(string):
        ix = string.find(t, ix)
        if len(t) == 1 and ord(t) >= 127462:  # this is the code for 🇦
            if not return_flags: continue
            if flag:
                tokens.append(flag + t)
                positions.append(ix - 1)
                flag = ''
            else:
                flag = t
        else:
            tokens.append(t)
            positions.append(ix)
        ix = +1
    return tokens, positions


        

label = vuelos.iloc[75]['label']
print(label)
print()
tokens, positions = index_emoji_tokenize(label, return_flags=True)
print(tokens)
print(positions)