### Objectif: filter questions tokens using pretrained morpho-Syntactic Tagger (POS)
---
turn questions into keywords:
ex: Qui est le président de la France ? --> président France

Two pos tagger tested:
* [spaCy](#SpaCy)
* [Stanford NLP](#Stanford-Postagger)
* [Tests](#Tests)

### SpaCy

In [112]:
import spacy

In [122]:
loader = spacy.load('fr_core_news_sm')

In [125]:
from spacy.lang.fr.examples import sentences

doc = loader(sentences[0])
for token in doc:
    print(token.text, token.pos_)

Apple ADJ
cherche NOUN
a AUX
acheter VERB
une DET
startup ADJ
anglaise NOUN
pour ADP
1 NUM
milliard NOUN
de ADP
dollard NOUN


In [126]:
#https://spacy.io/api/annotation

spacy_filter_hard = ["NOUN","PROPN", "VERB", "X"]
spacy_filter_soft = ["NOUN","PROPN", "VERB", "X","ADJ", "ADV"]

In [128]:
sent = sentences[0]

In [135]:
def filter_sent_spacy(sent, filt):
    doc = loader(sent)
    return " ".join([tok.text for tok in doc if tok.pos_ in filt])

In [184]:
def test_filter(sentence, func, filt):
    print("keywords: ", func(sentence, filt))

In [169]:
test_filter(sentences[0], filter_sent_spacy, spacy_filter_hard)

original:  Apple cherche a acheter une startup anglaise pour 1 milliard de dollard
keywords:  cherche acheter anglaise milliard dollard


### Stanford Postagger

In [170]:
pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8' )
res = pos_tagger.tag('je suis libre'.split())
print(res)

The StanfordTokenizer will be deprecated in version 3.2.5.
Please use [91mnltk.tag.corenlp.CoreNLPPOSTagger[0m or [91mnltk.tag.corenlp.CoreNLPNERTagger[0m instead.
  super(StanfordPOSTagger, self).__init__(*args, **kwargs)


In [2]:
# http://ftb.linguist.univ-paris-diderot.fr/fichiers/public/guide-constit.pdf
stan_filter_hard = ["N", "V"]

In [172]:
from nltk import word_tokenize

def filter_sent_stanford(sent, filt):
    sent = word_tokenize(sent)
    tags = pos_tagger.tag(sent)
    return " ".join([w for w, t in tags if t[:1] in filt])

In [173]:
filter_sent_stanford(sentences[0], stan_filter_hard)

'Apple cherche a acheter startup milliard dollard'

In [176]:
test_filter(sentences[0], filter_sent_stanford, stan_filter_hard)

original:  Apple cherche a acheter une startup anglaise pour 1 milliard de dollard
keywords:  Apple cherche a acheter startup milliard dollard


### Tests

In [187]:
def test_both(sent):
    print("--"*12,"testing" ,"--"*12)
    print("Original:", sent)
    print("STANFORD:")
    test_filter(sent, filter_sent_stanford, stan_filter_hard)
    print("SPACY:")
    test_filter(sent, filter_sent_spacy, spacy_filter_hard)

In [188]:
[test_both(i) for i in sentences]

------------------------ testing ------------------------
Original: Apple cherche a acheter une startup anglaise pour 1 milliard de dollard
STANFORD:
keywords:  Apple cherche a acheter startup milliard dollard
SPACY:
keywords:  cherche acheter anglaise milliard dollard
------------------------ testing ------------------------
Original: Les voitures autonomes voient leur assurances décalées vers les constructeurs
STANFORD:
keywords:  voitures voient assurances décalées constructeurs
SPACY:
keywords:  voitures voient assurances décalées constructeurs
------------------------ testing ------------------------
Original: San Francisco envisage d'interdire les robots coursiers
STANFORD:
keywords:  envisage d'interdire
SPACY:
keywords:  San Francisco envisage interdire robots
------------------------ testing ------------------------
Original: Londres est une grande ville du Royaume-Uni
STANFORD:
keywords:  Londres est ville Royaume-Uni
SPACY:
keywords:  Londres ville Royaume-Uni
--------------

[None, None, None, None, None, None, None, None, None, None, None, None]