This notebooks "borrows" functions from all other notebooks, these functions were replicated somewhere inside the `vuelax` packages and are imported to make them less messy to work with.

## We've got a new offer!

Imagine getting a new offer that looks like this:  

 > ¡Sin pasar EE.UU! 🇪🇬¡Todo México a El Cairo, Egipto $13,677!

In [None]:
offer_text = "¡Sin pasar EE.UU! 🇪🇬¡Todo México a El Cairo, Egipto $13,677!"

**Tokenise**: the first step was to tokenise it, by using our `index_emoji_tokenize` function

In [None]:
from vuelax.tokenisation import index_emoji_tokenize

tokens, positions = index_emoji_tokenize(offer_text)

print(tokens)

**POS Tagging**: the next thing in line is to obtain the POS tags corresponding to each one of the tokens. We can do this by using the `StanfordPOSTagger`:

In [None]:
from nltk.tag.stanford import StanfordPOSTagger

spanish_postagger = StanfordPOSTagger('stanford-models/spanish.tagger', 
                                      'stanford-models/stanford-postagger.jar')

In [None]:
_, pos_tags = zip(*spanish_postagger.tag(tokens))
pos_tags

**Prepare for the CRF**: This step involves adding more features and preparing the data to be consumed by the CRF package. All the required methods exist in `vuelax.feature_selection`

In [None]:
from vuelax.feature_selection import featurise_sentence

features = featurise_sentence(tokens, positions, pos_tags)

print(features[0])

**Sequence labelling with pycrfsuite**: And the final step is to load our trained model and tag our sequence:

In [None]:
import pycrfsuite

crf_tagger = pycrfsuite.Tagger()
crf_tagger.open('model/vuelax-bad.crfsuite')

In [None]:
assigned_tags = crf_tagger.tag(features)

### Results

In [None]:
for assigned_tag, token in zip(assigned_tags, tokens):
    print(f"{assigned_tag} - {token}")

By visual inspection we can confirm that the tags are correct: "Todo México" is the origin (o), "El Cairo, Egipto" is the destination and "13,677" is the price (p).

### What else is there to do?  

There are many ways this project could be improved, a few that come to mind:  
    
 - Improve the size/quality of the dataset by labelling more examples
 - Improve the way the labelling happens, using a single spreadsheet does not scale at all
 - Integrate everything under a single processing pipeline
 - "Productionify" the code, go beyond an experiment.