Spacy course
-----------

In [1]:
# import requests as rq
# from bs4 import BeautifulSoup
from tqdm import tqdm
import pickle
# import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')

Let's load the texts describing each artist.

In [2]:
with open("artistes.txt", "rb") as f:
    depick = pickle.Unpickler(f)
    artistes = depick.load()

We will use one the descriptions. Say that we have to use the third description.

In [3]:
text = list(artistes.values())[3]

In [4]:
text[:60]

'\n\t\t\t\tPages pour les contributeurs déconnectés en savoir plus'

Let's import spacy and initialize the fr_core_news_md model/package.

In [5]:
import spacy

nlp = spacy.load("fr_core_news_md")

We have to create a Doc object.

In [6]:
doc = nlp(text)

Let's see how long is the doc object.

In [7]:
len(doc)

2070

It contains 2070 tokens.

We can print some tokens.

In [8]:
for token in list(doc)[:70]:
    print(token)


				
Pages
pour
les
contributeurs
déconnectés
en
savoir
plus


Pour
les
articles
homonymes
,
voir
Limbourg
(
homonymie
)
.


Ne
doit
pas
être
confondu
avec
Frères
(
Belgique
)
.


Les
Frères
de
Limbourg
(
Paul
,
Herman
en
Johan
,
Gebroeders
Van
Lymborch
)
,
de
Gueldre
,
nés
vers
1380
à
Nimègue
,
Pays
-
Bas
,
sont
des
peintres
et
enlumineurs
néerlandais
.


We remark that it separate the words correctly (each punctuation is a separate token). It use a pre-trained model to recognize easily the words that contain punctuations and take them as a whole token (not separated by punctuations). 

Let's take the sentences of the text with the `sents` attribute.

In [9]:
sentences = doc.sents

In [10]:
# show some of the sentences
for sent in list(sentences)[:50]:
    print(sent)


				Pages
pour les contributeurs déconnectés en savoir plus
Pour les articles homonymes, voir Limbourg (homonymie)
.
Ne doit pas être confondu avec Frères (Belgique).

Les Frères de Limbourg (Paul, Herman en Johan, Gebroeders Van Lymborch), de Gueldre, nés vers 1380 à Nimègue, Pays-Bas, sont des peintres et enlumineurs néerlandais.
Ils sont issus d'une famille de peintres blasonneurs, fils d'un sculpteur sur bois et neveux du peintre Jean Malouel.
Ils sont célèbres pour les livres enluminés pour le duc de Berry, Jean Ier, et notamment Belles Heures du duc de Berry mais surtout la majeure partie des enluminures du livre d'heures nommé Les Très Riches Heures du duc de Berry.
Ils laissent cette dernière œuvre inachevée quand ils meurent de la peste en 1416, la même année que leur commanditaire.
Leur grand-père, Johannes de Lymborch, est probablement originaire du village de Limbourg situé sur la Vesdre.
Il s'installe à Nimègue, alors capitale du Duché de Gueldre : les archives conservent

Here also it can easily recognize a sentence with some predefined grammar rules.

To close the sentence reader/manager and make the generator cleaned we can use the inner method `close`.

In [11]:
sentences.close()

In [12]:
# show some of the sentences
for sent in list(sentences)[:50]:
    print(sent) # nothing is showed

Let's make the sentences again readable.

In [13]:
sentences = doc.sents


It is possible to take the second token of a given sentence.

In [14]:
# let's recuperate the fifth sentence
sentence6 = list(sentences)[4]

In [15]:
sentence6

Ils sont issus d'une famille de peintres blasonneurs, fils d'un sculpteur sur bois et neveux du peintre Jean Malouel.

In [16]:
# take the second token with indexation
token2 = sentence6[1]

In [17]:
print(token2)

sont


In [18]:
type(token2)

spacy.tokens.token.Token

We see that the token2 variable is a `Token` object.

#### Token components or attributes

- text: We can recuperate the text of the token.

In [19]:
token2.text

'sont'

- left_edge: Recuperate the token for which the given token is directly relied on the left side.

In [25]:
token2.left_edge

sont

- right_edge: Recuperate the token for which the given token is directly relied on the right side.

In [26]:
token2.right_edge

sont

It prints the same token that we given because it hadn't identified the right and left edges for this french word.

- ent_type_: The entity type (name) of the given token.

In [22]:
token2.ent_type_

''

- ent_iob_: Gives 'I' if the token (the word) is inside a larger entity or 'O' if the token is outside a larger entity. We can have also 'B' when the token is the on the beginning of a larger entity and '' when the token has not an sat entity tag.

In [23]:
token2.ent_iob_

'O'

- Lemma_: Lem (the base form) the token and print the result.

In [24]:
token2.lemma_

'être'

- morph: Give the aspects or morphological analysis of the token within the sentence.

In [27]:
token2.morph

Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin

- pos_: give us the part of speech name of the given token.

In [28]:
token2.pos_

'AUX'

- dep_: Give us the role that the token takes in the sentence.

In [29]:
token2.dep_

'cop'

- lang_: The language of the doc object

In [30]:
token2.lang_

'fr'

Let's take a sentence and print some attributes.

In [31]:
text = "Mike aimes jouer au football."
doc2 = nlp(text)
print(doc2)

Mike aimes jouer au football.


In [32]:
for token in doc2:
    print(token.text, token.pos_, token.dep_)

Mike PROPN nsubj
aimes PROPN ROOT
jouer VERB xcomp
au ADP case
football NOUN obl:arg
. PUNCT punct


We can also print a schema representing the relations between the tokens of the sentence.

In [33]:
from spacy import displacy
displacy.render(doc2, style = "dep")

Let's make the same two previous tasks with the sentence that we got from the artists descriptions.

In [37]:
sentence6

Ils sont issus d'une famille de peintres blasonneurs, fils d'un sculpteur sur bois et neveux du peintre Jean Malouel.

Print pos tags and syntactic relations.

In [39]:
for token in sentence6:
    print(f"{token}:\n Pos tag -> {token.pos_};\n Synt rel -> {token.dep_}")

Ils:
 Pos tag -> PRON;
 Synt rel -> nsubj
sont:
 Pos tag -> AUX;
 Synt rel -> cop
issus:
 Pos tag -> ADJ;
 Synt rel -> ROOT
d':
 Pos tag -> ADP;
 Synt rel -> case
une:
 Pos tag -> DET;
 Synt rel -> det
famille:
 Pos tag -> NOUN;
 Synt rel -> obl:arg
de:
 Pos tag -> ADP;
 Synt rel -> case
peintres:
 Pos tag -> NOUN;
 Synt rel -> nmod
blasonneurs:
 Pos tag -> VERB;
 Synt rel -> amod
,:
 Pos tag -> PUNCT;
 Synt rel -> punct
fils:
 Pos tag -> NOUN;
 Synt rel -> conj
d':
 Pos tag -> ADP;
 Synt rel -> case
un:
 Pos tag -> DET;
 Synt rel -> det
sculpteur:
 Pos tag -> NOUN;
 Synt rel -> nmod
sur:
 Pos tag -> ADP;
 Synt rel -> case
bois:
 Pos tag -> NOUN;
 Synt rel -> nmod
et:
 Pos tag -> CCONJ;
 Synt rel -> cc
neveux:
 Pos tag -> PROPN;
 Synt rel -> conj
du:
 Pos tag -> ADP;
 Synt rel -> case
peintre:
 Pos tag -> NOUN;
 Synt rel -> nmod
Jean:
 Pos tag -> PROPN;
 Synt rel -> appos
Malouel:
 Pos tag -> PROPN;
 Synt rel -> flat:name
.:
 Pos tag -> PUNCT;
 Synt rel -> punct


Rendering the sentence's relations.


In [41]:
displacy.render(sentence6)

We got some mistakes due to the fact that we use a medium french model. We may get a more correct result with the largest model.

#### Work with entities

Let's print the entities of the Doc object and give for each of them the label (type of the entity).

In [42]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Limbourg LOC
Frères LOC
Belgique LOC
Frères de Limbourg MISC
Paul PER
Herman PER
Johan LOC
Gebroeders Van Lymborch PER
Gueldre PER
Nimègue LOC
Pays-Bas LOC
Jean Malouel PER
duc de Berry PER
Jean Ier PER
Belles Heures du duc de Berry MISC
Les Très Riches Heures du duc de Berry MISC
Johannes de Lymborch PER
Limbourg LOC
Vesdre LOC
Nimègue LOC
Duché de Gueldre : les archives LOC
Arnold PER
Arnold PER
Mechteld Maelwael PER
Malouel PER
Herman PER
Hermant PER
Paul PER
Polleke ORG
Polequinen ORG
Johan LOC
Johanneke ORG
Jacquemin LOC
Gillequin PER
Jehanequin PER
Rutger PER
Arnold PER
Greta PER
Rutger PER
Sainte-Chapelle de Bourges LOC
Arnold PER
Herman PER
Jean PER
Paris LOC
Jean Malouel PER
Johan Maelwael PER
de France LOC
Bourgogne LOC
Alebret de Bolure PER
Luxembourg LOC
duc de Bourgogne PER
Philippe le Hardi PER
Jean Malouel PER
Paris LOC
Nimègue LOC
Bruxelles LOC
duché de Brabant LOC
Gueldre LOC
Philippe II de Bourgogne PER
Jean Malouel PER
Jean PER
Paul PER
Philippe II PER
Paris LOC
Jean

Some of them can be incorrectly labeled since we use a model train on a medium sized dictionary and not a model on a large sized dictionary. 

We will use a best model later.

We can visualize named entities with `displacy`.

In [43]:
displacy.render(doc, style="ent")

Let's download the more large french model.

!python -m spacy download fr_core_news_sm