In [None]:
!pip install spacy
!python -m spacy download fr_core_news_md

# spacy

## Chargement de la librairie

- Import the `French` class from `spacy.lang.en`
- Create the `nlp` object avec le constructeur de la classe `French`
- Create a `doc` and print its text.

In [None]:
# Import the English language class
from spacy.lang.____ import ____

# Create the nlp object
nlp = ____

# Process a text
doc = nlp("Ceci est une phrase narcissique puisqu'elle "
          "ne parle que d'elle-m√™me.")

# Print the document text
print(____.text)

In [2]:
# Import the English language class
from spacy.lang.fr import French

# Create the nlp object
nlp = French()

# Process a text
doc = nlp("Ceci est une phrase narcissique puisqu'elle "
          "ne parle que d'elle-m√™me.")

# Print the document text
print(doc.text)

Ceci est une phrase narcissique puisqu'elle ne parle que d'elle-m√™me.


**L'objet NLP**

- contains the processing pipeline
- includes language-specific rules for tokenization etc.

**L'objet doc**

Contient un document, c'est-√†-dire un ensemble de *tokens*.

## Manipulation d'un objet `Doc`

L'objet `Doc`se manipule comme une liste. 

A l'aide des `[]`:
- Accedez au premier token de doc
- Selectionnez les mots `'est une'` dans une slice
- Selectionnez les mots entre `'est une phrase narcissique'` dans une slice

In [None]:
# A slice of the Doc for 'est une'
est_une = ____
print(est_une.text)

# A slice of the Doc for 'est une phrase narcissique'
est_une_phrase_narcissique = ____
print(est_une_phrase_narcissique.text)

## Attributs lexicaux

In this example, you‚Äôll use spaCy‚Äôs `Doc` and `Token` objects, and lexical attributes to find percentages in a text. You‚Äôll be looking for two subsequent tokens: a number and a percent sign.

- Use the `like_num` token attribute to check whether a token in the doc resembles a number.
- Get the token following the current token in the document. The index of the next token in the `doc` is `token.i + 1`.
- Check whether the next token‚Äôs text attribute is a percent sign `‚Äù%‚Äú`.

In [None]:
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if ____.____:
        # Get the next token in the document
        next_token = ____[____]
        # Check if the next token's text equals "%"
        if next_token.____ == "%":
            print("Percentage found:", token.text)

## Mod√®les statistiques

**What are statistical models ?**

- Enable spaCy to predict linguistic attributes in context
    - Part-of-speech tags
    - Syntactic dependencies
    - Named entities
- Trained on labeled example texts
- Can be updated with more examples to fine-tune predictions

In [4]:
import spacy
from spacy import displacy

nlp = spacy.load('fr_core_news_md')

In [5]:
# Process a text
doc = nlp("Elle a mang√© la pizza")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

Elle PRON
a AUX
mang√© VERB
la DET
pizza NOUN


In [6]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)
    
displacy.render(doc, style="dep")

Elle PRON nsubj mang√©
a AUX aux:tense mang√©
mang√© VERB ROOT mang√©
la DET det pizza
pizza NOUN obj mang√©


In [36]:
# Process a text
doc = nlp("Enedis cherche de nouveaux bureaux √† Vaison-La-Romaine "
          "avec l'aide de Fran√ßois Cordel.")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)
    
displacy.render(doc, style="ent")

Enedis ORG
Vaison-La-Romaine LOC
Fran√ßois Cordel PER


In [25]:
spacy.explain("ORG")

'Companies, agencies, institutions, etc.'

In [26]:
spacy.explain("LOC")

'Non-GPE locations, mountain ranges, bodies of water'

In [27]:
spacy.explain("PER")

'Named person or family.'

## Similarit√©

- `spaCy` can compare two objects and predict similarity
- `Doc.similarity()`, `Span.similarity()` and `Token.similarity()`
- Take another object and return a similarity score (`0` to `1`)
- Important: needs a model that has word vectors included, for example:
    - ‚úÖ `en_core_web_md` (medium model)
    - ‚úÖ `en_core_web_lg` (large model)
    - üö´ NOT `en_core_web_sm` (small model)

In [34]:
# Load a larger model with vectors
nlp = spacy.load("en_core_web_md")

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

OSError: [E050] Can't find model 'en_core_web_md'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

In [33]:
# Compare two tokens
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

1.0


In [None]:
text = "Ce jour-l√†, 25 mars dernier, P√©tersbourg fut le th√©√¢tre d‚Äôune aventure"
    "des plus √©tranges. Le barbier Ivan Yakovl√©vitch, domicili√© avenue de"
    "l‚ÄôAscension (son nom de famille est perdu et son enseigne ne porte que"
    "l‚Äôinscription : On pratique aussi les saign√©es, au-dessous d‚Äôun monsieur"
    "√† la joue barbouill√©e de savon), le barbier Ivan Yakovl√©vitch se r√©veilla"
    "d‚Äôassez bonne heure et per√ßut une odeur de pain chaud. S‚Äô√©tant mis sur"
    "son s√©ant, il vit que son √©pouse ‚Äì personne plut√¥t respectable et qui"
    "prisait fort le caf√© ‚Äì d√©fournait des pains tout frais cuits."
    " ¬´ Aujourd‚Äôhui, Prascovie Ossipovna, je ne prendrai pas de caf√©, d√©clara"
    "Ivan Yakovl√©vitch ; je pr√©f√®re grignoter un bon pain chaud avec de la"
    "ciboule. ¬ª √Ä la v√©rit√©, Ivan Yakovl√©vitch aurait bien voulu et pain et"
    "caf√©, mais il jugeait impossible de demander les deux choses √† la fois,"
    """ Prascovie Ossipovna ne tol√©rant pas de semblables caprices. ¬´ Tant"
    "mieux, se dit la respectable √©pouse en jetant un pain sur la table. Que"
    "mon nigaud s‚Äôempiffre de pain ! Il me restera davantage de caf√©. ¬ª"
    " Respectueux des convenances, Ivan Yakovl√©vitch passa son habit"
    "par-dessus sa chemise et se mit en devoir de d√©jeuner. Il posa devant"
    "lui une pinc√©e de sel, nettoya deux oignons, prit son couteau et, la"
    "mine grave, coupa son pain en deux. Il aper√ßut alors, √† sa grande"
    "surprise, un objet blanch√¢tre au beau milieu ; il le t√¢ta"
    "pr√©cautionneusement du couteau, le palpa du doigt‚Ä¶ ¬´ Qu‚Äôest-ce que cela"
    "peut bien √™tre ? ¬ª se dit-il en √©prouvant de la r√©sistance. Il fourra"
    "alors ses doigts dans le pain et en retira‚Ä¶ un nez ! Les bras lui en"
    "tomb√®rent."

In [1]:
# python -m spacy download fr
# python -m spacy download fr_core_news_md

import spacy
from spacy import displacy

In [None]:
# analyse factorielle (ACP, projection de texte)

# Tokenization, Stemming, Lemmatization, POS-tagging, Vectorization

In [2]:
# TF-IDF
# Stop words
# Lemmatisation : regroupement des mots d une m√™me famille dans un texte, afin de r√©duire ces mots √† leur forme canonique (le lemme), comme petit, petite, petits, et petites.
# Racinisation (stemming) : regroupement des mots ayant une racine commune et appartenant au m√™me champ lexical.
# Reconnaissance d entit√©s nomm√©es : d√©termination dans un texte des noms propres, tels que des personnes ou des endroits, ainsi que les quantit√©s, valeurs, ou dates.


# Ressources

https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/

1. Basic feature extraction using text data
    - Number of words
    - Number of characters
    - Average word length
    - Number of stopwords
    - Number of special characters
    - Number of numerics
    - Number of uppercase words
2. Basic Text Pre-processing of text data
    - Lower casing
    - Punctuation removal
    - Stopwords removal
    - Frequent words removal
    - Rare words removal
    - Spelling correction
    - Tokenization
    - Stemming
    - Lemmatization
3. Advance Text Processing
    - N-grams
    - Term Frequency
    - Inverse Document Frequency
    - Term Frequency-Inverse Document Frequency (TF-IDF)
    - Bag of Words
    - Sentiment Analysis
    - Word Embedding

https://www.actuia.com/contribution/victorbigand/tutoriel-tal-pour-les-debutants-classification-de-texte/

Classification de texte (spams sur commentaires youtube)

https://towardsdatascience.com/data-scientists-guide-to-summarization-fc0db952e363

**NLTK Summarizer**

We wanted to start our text summarization journey by trying something simple. So we turned to the popular NLP package in python ‚Äî NLTK. The idea here was to summarize by identifying ‚Äútop‚Äù sentences based on word frequency.

In [3]:
nlp = spacy.load('fr_core_news_md') # en_core_web_sm

from spacy.lang.fr.stop_words import STOP_WORDS
STOP_WORDS.add("blabla")
stopwords = STOP_WORDS
print(stopwords)

{'cinqui√®me', 'encore', 't‚Äô', 'ouverte', 'touchant', 'uns', 'tiens', 'toi-m√™me', 'superpose', 'tel', 'peut', 'cher', 'euh', 'de', 'personne', 'ces', 'si', 'suit', 'probable', 'ch√®re', 'premier', 'la', 'comme', 'il', 'siens', 'exactement', 'ouf', 'bas', 'semblable', 'tac', 'retour', 'rare', 'auraient', '√†', 'lorsque', 'anterieure', 'dix-sept', 'pan', 'probante', 'basee', 'celui-ci', 'suivante', 'sinon', 'maximale', 'naturel', 'seront', 'possibles', 'pass√©', 'elles-m√™mes', 'extenso', 'tente', 'six', 'certaine', 'h√©las', 'directement', 'm√™mes', 'tenir', 'tienne', 'vifs', 'pr√®s', 'ainsi', 'neuvi√®me', 'peux', 'ont', 'malgre', 'nombreuses', 'vont', 'aux', 'lesquelles', 'celles-l√†', 'parseme', 'autrement', 'peuvent', 'trois', 'enfin', 'oh', 'dessous', 'aie', 'v√¥tres', 'auquel', 'ol√©', 'ta', 'deuxi√®mement', 'merci', 'chacune', 'soixante', 'ai', 'particulier', 'fais', "c'", 'chut', 'assez', 'semblaient', 'clac', 'mon', 'ch√®res', 'attendu', 'restent', 'je', 'hep', 'cela', 'tr√®s

In [4]:
# text = "Ceci est un exemple de texte destin√© √† tester les diff√©rentes librairies Python et √† pr√©parer une d√©monstration. Ce texte a √©t√© pr√©par√© par Gabriel pour LinkyStat."
# text = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'
# text = "The engine coughs and shakes its head. The smoke, a plume of red and white, Waves madly in the face of night. And now the grave incurious stars Gleam on the groaning hurrying cars. Against the kind and awful reign Of darkness, this our angry train, A noisy little rebel, pouts Its brief defiance, flames and shouts‚Äî And passes on, and leaves no trace. For darkness holds its ancient place, Serene and absolute, the king Unchanged, of every living thing. The houses lie obscure and still In Rutherford and Carlton Hill. Our lamps intensify the dark Of slumbering Passaic Park. And quiet holds the weary feet That daily tramp through Prospect Street. What though we clang and clank and roar Through all Passaic's streets? No door Will open, not an eye will see Who this loud vagabond may be. Upon my crimson cushioned seat, In manufactured light and heat, I feel unnatural and mean. Outside the towns are cool and clean; Curtained awhile from sound and sight They take God's gracious gift of night. The stars are watchful over them. On Clifton as on Bethlehem The angels, leaning down the sky, Shed peace and gentle dreams. And I‚Äî I ride, I blasphemously ride Through all the silent countryside. The engine's shriek, the headlight s glare, Pollute the still nocturnal air. The cottages of Lake View sigh And sleeping, frown as we pass by. Why, even strident Paterson Rests quietly as any nun. Her foolish warring children keep The grateful armistice of sleep. For what tremendous errand's sake Are we so blatantly awake? What precious secret is our freight? What king must be abroad so late? Perhaps Death roams the hills to-night And we rush forth to give him fight. Or else, perhaps, we speed his way To some remote unthinking prey. Perhaps a woman writhes in pain And listens‚Äîlistens for the train! The train, that like an angel sings, The train, with healing on its wings. Now \"Hawthorne!\" the conductor cries. My neighbor starts and rubs his eyes. He hurries yawning through the car And steps out where the houses are. This is the reason of our quest! Not wantonly we break the rest Of town and village, nor do we Lightly profane night's sanctity. What Love commands the train fulfills, And beautiful upon the hills Are these our feet of burnished steel. Subtly and certainly I feel That Glen Rock welcomes us to her And silent Ridgewood seems to stir And smile, because she knows the train Has brought her children back again. We carry people home‚Äîand so God speeds us, wheresoe'er we go. Hohokus, Waldwick, Allendale Lift sleepy heads to give us hail. In Ramsey, Mahwah, Suffern stand Houses that wistfully demand A father‚Äîson‚Äîsome human thing That this, the midnight train, may bring. The trains that travel in the day They hurry folks to work or play. The midnight train is slow and old But of it let this thing be told, To its high honor be it said It carries people home to bed. My cottage lamp shines white and clear. God bless the train that brought me here."
text = "Ce jour-l√†, 25 mars dernier, P√©tersbourg fut le th√©√¢tre d‚Äôune aventure des plus √©tranges. Le barbier Ivan Yakovl√©vitch, domicili√© avenue de l‚ÄôAscension (son nom de famille est perdu et son enseigne ne porte que l‚Äôinscription : On pratique aussi les saign√©es, au-dessous d‚Äôun monsieur √† la joue barbouill√©e de savon), le barbier Ivan Yakovl√©vitch se r√©veilla d‚Äôassez bonne heure et per√ßut une odeur de pain chaud. S‚Äô√©tant mis sur son s√©ant, il vit que son √©pouse ‚Äì personne plut√¥t respectable et qui prisait fort le caf√© ‚Äì d√©fournait des pains tout frais cuits. ¬´ Aujourd‚Äôhui, Prascovie Ossipovna, je ne prendrai pas de caf√©, d√©clara Ivan Yakovl√©vitch ; je pr√©f√®re grignoter un bon pain chaud avec de la ciboule. ¬ª √Ä la v√©rit√©, Ivan Yakovl√©vitch aurait bien voulu et pain et caf√©, mais il jugeait impossible de demander les deux choses √† la fois, Prascovie Ossipovna ne tol√©rant pas de semblables caprices. ¬´ Tant mieux, se dit la respectable √©pouse en jetant un pain sur la table. Que mon nigaud s‚Äôempiffre de pain ! Il me restera davantage de caf√©. ¬ª Respectueux des convenances, Ivan Yakovl√©vitch passa son habit par-dessus sa chemise et se mit en devoir de d√©jeuner. Il posa devant lui une pinc√©e de sel, nettoya deux oignons, prit son couteau et, la mine grave, coupa son pain en deux. Il aper√ßut alors, √† sa grande surprise, un objet blanch√¢tre au beau milieu ; il le t√¢ta pr√©cautionneusement du couteau, le palpa du doigt‚Ä¶ ¬´ Qu‚Äôest-ce que cela peut bien √™tre ? ¬ª se dit-il en √©prouvant de la r√©sistance. Il fourra alors ses doigts dans le pain et en retira‚Ä¶ un nez ! Les bras lui en tomb√®rent."


In [5]:
print(text)

Ce jour-l√†, 25 mars dernier, P√©tersbourg fut le th√©√¢tre d‚Äôune aventure des plus √©tranges. Le barbier Ivan Yakovl√©vitch, domicili√© avenue de l‚ÄôAscension (son nom de famille est perdu et son enseigne ne porte que l‚Äôinscription : On pratique aussi les saign√©es, au-dessous d‚Äôun monsieur √† la joue barbouill√©e de savon), le barbier Ivan Yakovl√©vitch se r√©veilla d‚Äôassez bonne heure et per√ßut une odeur de pain chaud. S‚Äô√©tant mis sur son s√©ant, il vit que son √©pouse ‚Äì personne plut√¥t respectable et qui prisait fort le caf√© ‚Äì d√©fournait des pains tout frais cuits. ¬´ Aujourd‚Äôhui, Prascovie Ossipovna, je ne prendrai pas de caf√©, d√©clara Ivan Yakovl√©vitch ; je pr√©f√®re grignoter un bon pain chaud avec de la ciboule. ¬ª √Ä la v√©rit√©, Ivan Yakovl√©vitch aurait bien voulu et pain et caf√©, mais il jugeait impossible de demander les deux choses √† la fois, Prascovie Ossipovna ne tol√©rant pas de semblables caprices. ¬´ Tant mieux, se dit la respectable √©pous

In [27]:
# Tokenization
doc = nlp(text)
tokens = [token for token in doc if not token.text in stopwords]

In [28]:
print(tokens)

[Ce, jour, -, ,, 25, mars, ,, P√©tersbourg, fut, th√©√¢tre, aventure, √©tranges, ., Le, barbier, Ivan, Yakovl√©vitch, ,, domicili√©, avenue, Ascension, (, nom, famille, perdu, enseigne, porte, inscription, :, On, pratique, saign√©es, ,, -, monsieur, joue, barbouill√©e, savon, ), ,, barbier, Ivan, Yakovl√©vitch, r√©veilla, bonne, heure, per√ßut, odeur, pain, chaud, ., S‚Äô, mis, s√©ant, ,, vit, √©pouse, ‚Äì, respectable, prisait, fort, caf√©, ‚Äì, d√©fournait, pains, frais, cuits, ., ¬´, Aujourd‚Äôhui, ,, Prascovie, Ossipovna, ,, prendrai, caf√©, ,, d√©clara, Ivan, Yakovl√©vitch, ;, pr√©f√®re, grignoter, bon, pain, chaud, ciboule, ., ¬ª, √Ä, v√©rit√©, ,, Ivan, Yakovl√©vitch, voulu, pain, caf√©, ,, jugeait, impossible, demander, choses, fois, ,, Prascovie, Ossipovna, tol√©rant, semblables, caprices, ., ¬´, Tant, mieux, ,, respectable, √©pouse, jetant, pain, table, ., Que, nigaud, empiffre, pain, !, Il, restera, davantage, caf√©, ., ¬ª, Respectueux, convenances, ,, Ivan, Yakovl√©vitch, pa

In [30]:
# Lemmatization
for token in tokens:
    print('Original : %s, New: %s' % (token.text, token.lemma_))

Original : Ce, New: ce
Original : jour, New: jour
Original : -, New: -
Original : ,, New: ,
Original : 25, New: 25
Original : mars, New: mars
Original : ,, New: ,
Original : P√©tersbourg, New: P√©tersbourg
Original : fut, New: √™tre
Original : th√©√¢tre, New: th√©√¢tre
Original : aventure, New: aventure
Original : √©tranges, New: √©trange
Original : ., New: .
Original : Le, New: le
Original : barbier, New: barbier
Original : Ivan, New: Ivan
Original : Yakovl√©vitch, New: Yakovl√©vitch
Original : ,, New: ,
Original : domicili√©, New: domicilier
Original : avenue, New: avenue
Original : Ascension, New: ascension
Original : (, New: (
Original : nom, New: nom
Original : famille, New: famille
Original : perdu, New: perdre
Original : enseigne, New: enseigne
Original : porte, New: porte
Original : inscription, New: inscription
Original : :, New: :
Original : On, New: on
Original : pratique, New: pratique
Original : saign√©es, New: saign√©e
Original : ,, New: ,
Original : -, New: -
Original : 

In [31]:
# POS tags

# NP: noun phrase
# DT: determiner
# JJ: adjective
# JJS: adjective, superlative
# NN: noun
# NNP: proper noun, singular
# NNS: noun, plural
# IN: preposition or subordinating conjunction
# VBD: verb, past tense
# VBZ: verb, 3rd person singular present

for token in tokens:
    print('Word: %s, POS: %s' % (token.text, token.tag_))

Word: Ce, POS: DET__Gender=Masc|Number=Sing|PronType=Dem
Word: jour, POS: NOUN__Gender=Masc|Number=Sing
Word: -, POS: PUNCT___
Word: ,, POS: PUNCT___
Word: 25, POS: NUM__NumType=Card
Word: mars, POS: NOUN__Gender=Masc|Number=Sing
Word: ,, POS: PUNCT___
Word: P√©tersbourg, POS: PROPN__Gender=Masc|Number=Sing
Word: fut, POS: AUX__Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
Word: th√©√¢tre, POS: NOUN__Gender=Masc|Number=Sing
Word: aventure, POS: NOUN__Gender=Fem|Number=Sing
Word: √©tranges, POS: ADJ__Gender=Masc|Number=Plur
Word: ., POS: PUNCT___
Word: Le, POS: DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art
Word: barbier, POS: NOUN__Gender=Masc|Number=Sing
Word: Ivan, POS: PROPN__Gender=Masc|Number=Sing
Word: Yakovl√©vitch, POS: PROPN___
Word: ,, POS: PUNCT___
Word: domicili√©, POS: VERB__Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part
Word: avenue, POS: VERB__Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part
Word: Ascension, POS: NOUN__Gender=Fem|Number=Sing
Word: (, POS

In [10]:
# Lemmatization and POS tags
[(token.orth_,token.pos_, token.lemma_) for token in
 [y for y in doc if not y.is_stop and y.pos_ != 'PUNCT']]

[('jour', 'NOUN', 'jour'),
 ('25', 'NUM', '25'),
 ('mars', 'NOUN', 'mars'),
 ('P√©tersbourg', 'PROPN', 'P√©tersbourg'),
 ('fut', 'AUX', '√™tre'),
 ('th√©√¢tre', 'NOUN', 'th√©√¢tre'),
 ('aventure', 'NOUN', 'aventure'),
 ('√©tranges', 'ADJ', '√©trange'),
 ('barbier', 'NOUN', 'barbier'),
 ('Ivan', 'PROPN', 'Ivan'),
 ('Yakovl√©vitch', 'PROPN', 'Yakovl√©vitch'),
 ('domicili√©', 'VERB', 'domicilier'),
 ('avenue', 'VERB', 'avenue'),
 ('Ascension', 'NOUN', 'ascension'),
 ('nom', 'NOUN', 'nom'),
 ('famille', 'NOUN', 'famille'),
 ('perdu', 'VERB', 'perdre'),
 ('enseigne', 'NOUN', 'enseigne'),
 ('porte', 'VERB', 'porte'),
 ('inscription', 'NOUN', 'inscription'),
 ('pratique', 'NOUN', 'pratique'),
 ('saign√©es', 'NOUN', 'saign√©e'),
 ('monsieur', 'NOUN', 'Monsieur'),
 ('joue', 'NOUN', 'joue'),
 ('barbouill√©e', 'VERB', 'barbouiller'),
 ('savon', 'NOUN', 'savon'),
 ('barbier', 'NOUN', 'barbier'),
 ('Ivan', 'PROPN', 'Ivan'),
 ('Yakovl√©vitch', 'PROPN', 'Yakovl√©vitch'),
 ('r√©veilla', 'VERB', 'r√©ve

In [33]:
# Stemming
import nltk
from nltk.stem.porter import *
stemmer = PorterStemmer()
for token in tokens:
    print('Original : %s, Root form: %s' % (token.text, stemmer.stem(token.text)))

Original : Ce, Root form: Ce
Original : jour, Root form: jour
Original : -, Root form: -
Original : ,, Root form: ,
Original : 25, Root form: 25
Original : mars, Root form: mar
Original : ,, Root form: ,
Original : P√©tersbourg, Root form: p√©tersbourg
Original : fut, Root form: fut
Original : th√©√¢tre, Root form: th√©√¢tre
Original : aventure, Root form: aventur
Original : √©tranges, Root form: √©trang
Original : ., Root form: .
Original : Le, Root form: Le
Original : barbier, Root form: barbier
Original : Ivan, Root form: ivan
Original : Yakovl√©vitch, Root form: yakovl√©vitch
Original : ,, Root form: ,
Original : domicili√©, Root form: domicili√©
Original : avenue, Root form: avenu
Original : Ascension, Root form: ascens
Original : (, Root form: (
Original : nom, Root form: nom
Original : famille, Root form: famil
Original : perdu, Root form: perdu
Original : enseigne, Root form: enseign
Original : porte, Root form: port
Original : inscription, Root form: inscript
Original : :, Roo

In [14]:
# Name-Entity Recognition (NER)

for entity in doc.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

P√©tersbourg - LOC - Non-GPE locations, mountain ranges, bodies of water
Ivan Yakovl√©vitch - PER - Named person or family.
Ivan Yakovl√©vitch - PER - Named person or family.
Prascovie Ossipovna - PER - Named person or family.
Ivan Yakovl√©vitch - PER - Named person or family.
Ivan Yakovl√©vitch - PER - Named person or family.
Prascovie Ossipovna - PER - Named person or family.
¬ª Respectueux des convenances - MISC - Miscellaneous entities, e.g. events, nationalities, products or works of art
Ivan Yakovl√©vitch - PER - Named person or family.


In [35]:
# Detecting nouns

for noun in doc.noun_chunks:
    print(noun.text)

Ce jour-l√†, 25 mars dernier, P√©tersbourg fut le th√©√¢tre d‚Äôune aventure des plus √©tranges.
Le barbier Ivan Yakovl√©vitch, domicili√© avenue de l‚ÄôAscension (
son nom de famille
et son enseigne
On
les saign√©es
√† la joue barbouill√©e de savon
le barbier Ivan Yakovl√©vitch
une odeur de pain chaud
il
son √©pouse ‚Äì personne plut√¥t respectable et qui prisait fort le caf√© ‚Äì
des pains
tout frais cuits
je ne prendrai pas de caf√©, d√©clara Ivan Yakovl√©vitch
je
un bon pain chaud avec de la ciboule
Ivan Yakovl√©vitch
et pain et caf√©
il
les deux choses
Prascovie Ossipovna
de semblables caprices
la respectable √©pouse en jetant un pain sur la table
mon nigaud
de pain
Il
me
davantage de caf√©
¬ª Respectueux des convenances
Ivan Yakovl√©vitch
son habit
Il
une pinc√©e de sel
deux oignons
son couteau et, la mine grave
son pain
Il
un objet blanch√¢tre au beau milieu
il
du couteau
le palpa
du doigt
-ce
cela
il en √©prouvant de la r√©sistance
Il
ses doigts
un nez
Les bras lui


In [36]:
# Begin: first token of a multi-token entity
# In: inner token of a multi-token entity
# Last: last token of a multi-token entity
# Unit: single-token entity
# Out: non-entity token

print([(token, token.ent_iob_, token.ent_type_) for token in doc])

[(Ce, 'O', ''), (jour, 'O', ''), (-, 'O', ''), (,, 'O', ''), (25, 'O', ''), (mars, 'O', ''), (,, 'O', ''), (P√©tersbourg, 'B', 'LOC'), (fut, 'O', ''), (th√©√¢tre, 'O', ''), (aventure, 'O', ''), (√©tranges, 'O', ''), (., 'O', ''), (Le, 'O', ''), (barbier, 'O', ''), (Ivan, 'B', 'PER'), (Yakovl√©vitch, 'I', 'PER'), (,, 'O', ''), (domicili√©, 'O', ''), (avenue, 'O', ''), (Ascension, 'O', ''), ((, 'O', ''), (nom, 'O', ''), (famille, 'O', ''), (perdu, 'O', ''), (enseigne, 'O', ''), (porte, 'O', ''), (inscription, 'O', ''), (:, 'O', ''), (On, 'O', ''), (pratique, 'O', ''), (saign√©es, 'O', ''), (,, 'O', ''), (-, 'O', ''), (monsieur, 'O', ''), (joue, 'O', ''), (barbouill√©e, 'O', ''), (savon, 'O', ''), (), 'O', ''), (,, 'O', ''), (barbier, 'O', ''), (Ivan, 'B', 'PER'), (Yakovl√©vitch, 'I', 'PER'), (r√©veilla, 'O', ''), (bonne, 'O', ''), (heure, 'O', ''), (per√ßut, 'O', ''), (odeur, 'O', ''), (pain, 'O', ''), (chaud, 'O', ''), (., 'O', ''), (S‚Äô, 'O', ''), (mis, 'O', ''), (s√©ant, 'O', ''), (,

In [17]:
displacy.render(nlp(str(text)), jupyter=True, style='ent')

In [18]:
displacy.render(nlp(str(text)), style='dep', jupyter = True, options = {'distance': 120})

# Brouillon

## Matcher

- Lists of dictionaries, one per token
- Match exact token texts `[{"TEXT": "iPhone"}, {"TEXT": "X"}]`
- Match lexical attributes `[{"LOWER": "iphone"}, {"LOWER": "x"}]`
- Match any token attributes `[{"LEMMA": "buy"}, {"POS": "NOUN"}]`

In [None]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load("fr_core_news_md")

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", None, pattern)

# Process some text
doc = nlp("Upcoming iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

In [None]:
pattern = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]
doc = nlp("2018 FIFA World Cup: France won!")