# **Série TP 4**
## Analyse syntaxique – POS tagging, parsing, et Named Entity Recognition avec SpaCy.


### **Qu'est-ce que le package Python SpaCy ?**

**SpaCy** est une bibliothèque gratuite et open-source pour le traitement automatique du langage naturel (NLP) en Python avec de nombreuses fonctionnalités intégrées. Il devient de plus en plus populaire pour le traitement et l’analyse des données en TAL. 
La bibliothèque SpaCy permet d'effectuer plusieurs opérations d'analyse sur des textes dans plus de 50 langues: tokenization, lemmatization, POST tagging, reconnaissance d'entités nommées, parsing, etc.
SpaCy est la principale alternative à NLTK (Natural Language Tool Kit), la librairie historique pour le TAL avec Python, et propose de nombreuses innovations et options de visualisation qui sont très intéressantes.

Le **tokenizer** transforme une chaine de caractères en un objet **Doc**. spaCy applique ensuite chaque composant du pipeline, dans l’ordre. **token.text**

spaCy propose des **modèles statistiques** permettant d’exécuter un **pipeline** comprenant en natif les fonctionnalités linguistiques suivantes : **POS tagger**, **parser** et **NER**. À ce pipeline peut s’ajouter un élément de classification de textes. Chaque élément peut être facilement activé et désactivé selon les besoins.

spaCy est fourni avec les composants intégrés suivants, à partir du pipeline:

- Le part-of-speech tagger définit les attributs **token.tag_** et **token.pos_**.

- Le lemmatizer : **token.lemma_**.

- Le dependency parser ajoute les attributs **token.dep** et **token.head** et est également chargé de détecter les phrases et les groupes nominaux, également appelés **"noun chunks"**.

- Le named entity recognizer ajoute les entités détectées à la propriété **doc.ents**. Il définit aussi les attributs de type d'entité sur les tokens qui indiquent si un token fait partie ou non d'une entité.


![SpaCy Pipeline Image](https://d33wubrfki0l68.cloudfront.net/3ad0582d97663a1272ffc4ccf09f1c5b335b17e9/7f49c/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

#### Pour **l’installer et télécharger** les modèles, en utilisant PIP depuis cmd.exe ou jupyter : 

>> pip install spacy

Pour modèles en **anglais** :
>>  python -m spacy download en_core_web_sm

SpaCy - https://spacy.io/usage/linguistic-features

Spacy - https://course.spacy.io/fr

## **Pratique**

### **Installation**

In [None]:
!pip install -U spacy

Collecting spacy
  Using cached https://files.pythonhosted.org/packages/99/c9/35d94c73e26b194c07a3d3adb82c06c38f76bebd3e1ba1e7195fb6f5a7cc/spacy-3.0.6-cp37-cp37m-win_amd64.whl
Collecting blis<0.8.0,>=0.4.0 (from spacy)
  Using cached https://files.pythonhosted.org/packages/7c/f3/2c18510d125d6af493120ca50fc8f2e3c21c9f58fb38d34c032f813dadcb/blis-0.7.4-cp37-cp37m-win_amd64.whl
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Using cached https://files.pythonhosted.org/packages/3f/7f/e4b81e6355e87c08a039cbc64c3044046ba4665a74e9b62246c71976a849/cymem-2.0.5-cp37-cp37m-win_amd64.whl
Collecting srsly<3.0.0,>=2.4.1 (from spacy)
  Using cached https://files.pythonhosted.org/packages/e7/f8/62520edb641dde8ba57f7ba9aa82f3c8e6567b8b8aacb690615c9800d156/srsly-2.4.1-cp37-cp37m-win_amd64.whl
Collecting typer<0.4.0,>=0.3.0 (from spacy)
  Using cached https://files.pythonhosted.org/packages/90/34/d138832f6945432c638f32137e6c79a3b682f06a63c488dcfaca6b166c64/typer-0.3.2-py3-none-any.whl
Collecting spacy-legac

In [None]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl#egg=en_core_web_sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7MB)
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.0.0
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


### **Import and Models**

In [None]:
# Import and Models : https://spacy.io/usage/models
import spacy
from spacy import displacy

nlp = spacy.load('en_core_web_sm')

### **Text Data**

In [None]:
# Text
text = "I prefer the morning flight through Denver."

### **SpaCy NLP Pipeline**

In [None]:
doc = nlp(text)

In [None]:
doc

I prefer the morning flight through Denver.

In [None]:
# Pipeline information - les noms des composants de pipeline
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']


In [None]:
# Pipeline information
print(nlp.pipeline)

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x000002B28FD41888>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x000002B28FD41648>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x000002B28EABBD68>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x000002B28EABBEB8>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x000002B28FDF8D48>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x000002B28FE08348>)]


### **Tokenization**

In [None]:
for token in doc:
    print(token.text)

I
prefer
the
morning
flight
through
Denver
.


In [None]:
[token.text for token in doc]

### **Part-of-Speech Tagging**:   `token.pos_`    et    `token.tag_`

**Tag types**:

-Coarse-grained (**Universal Part-of-Speech Tagset**) : Noun, verb, adjective, etc. https://universaldependencies.org/u/pos/

-Fine-grained (**Penn Treebank tagset**): https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    ● noun-proper-singular, noun-proper-plural, nouncommon-mass, ..
    ● verb-past, verb-present-3rd, verb-base, ...
    ● adjective-simple, adjective-comparative, ...

In [None]:
doc

I prefer the morning flight through Denver.

In [None]:
# TEXT  -   LEMMA  -  Coarse-grained POS TAG -   Fine-grained POS TAG

for token in doc:
    print(f'{token.text:{15}} {token.lemma_:{15}}  {token.pos_:{15}} {token.tag_}')

I               I                PRON            PRP
prefer          prefer           VERB            VBP
the             the              DET             DT
morning         morning          NOUN            NN
flight          flight           NOUN            NN
through         through          ADP             IN
Denver          Denver           PROPN           NNP
.               .                PUNCT           .


In [None]:
# Label explanations
spacy.explain("VBP")

'verb, non-3rd person singular present'

In [None]:
spacy.explain("VERB")

'verb'

In [None]:
# Coarse-grained part-of-speech tags  :  token.pos_
[token.pos_ for token in doc]

['PRON', 'VERB', 'DET', 'NOUN', 'NOUN', 'ADP', 'PROPN', 'PUNCT']

In [None]:
# Fine-grained part-of-speech tags  :   token.tag_
[token.tag_ for token in doc]

['PRP', 'VBP', 'DT', 'NN', 'NN', 'IN', 'NNP', '.']

### **Dependency Parsing**: `token.dep_`   et  `token.head`

In [None]:
doc

I prefer the morning flight through Denver.

In [None]:
#  TEXT   -   Syntactic Dependency Label -   Syntactic head token
for token in doc:
  print(f'{token.text:{15}} {token.dep_:{20}} {token.head.text}')

I               nsubj                prefer
prefer          ROOT                 prefer
the             det                  flight
morning         compound             flight
flight          dobj                 prefer
through         prep                 flight
Denver          pobj                 through
.               punct                prefer


In [None]:
# Label explanations
spacy.explain("nsubj")

'nominal subject'

In [None]:
# Finding a verb with a subject
from spacy.symbols import nsubj, VERB

verbs = []
for possible_subject in doc:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        verbs.append(possible_subject.head)
print(verbs)

[prefer]


In [None]:
# Tree visualization using nltk

from nltk import Tree

doc2 = nlp("I prefer the morning flight through Denver")

def tok_format(tok):
    return "_".join([tok.orth_, tok.pos_])


def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(tok_format(node), [to_nltk_tree(child) for child in node.children])
    else:
        return tok_format(node)

[to_nltk_tree(sent.root).pretty_print() for sent in doc2.sents]

               prefer_VERB                          
   _________________|___________                     
  |                        flight_NOUN              
  |        _____________________|____________        
  |       |                     |       through_ADP 
  |       |                     |            |       
I_PRON the_DET             morning_NOUN Denver_PROPN



[None]

### **Shallow Parsing - Chunking - Constituants extracting**  :  `doc.noun_chunks`

Shallow parsing, or chunking, is the process of extracting phrases from text. Chunking groups adjacent tokens into phrases on the basis of their POS tags. There are some standard well-known chunks such as noun phrases, verb phrases, and prepositional phrases.

In [None]:
doc

I prefer the morning flight through Denver.

In [None]:
# TEXT   ROOT.TEXT   ROOT.DEP_   ROOT.HEAD.TEXT
for chunk in doc.noun_chunks:
    print(f'{chunk.text:{25}} {chunk.root.text:{25}} {chunk.root.dep_:{25}} {chunk.root.head.text}')

I                         I                         nsubj                     prefer
the morning flight        flight                    dobj                      prefer
Denver                    Denver                    pobj                      through


### **Dependency parsing visualization**

If you're in a **Jupyter notebook**, use **displacy.render**. Otherwise, use **displacy.serve** to start a web server and show the visualization in your browser.

In [None]:
displacy.render(doc, style='dep', jupyter=True)
#displacy.serve(doc, style="dep")

### **Named Entity Recognition** :    `doc.ents`

In [None]:
doc

I prefer the morning flight through Denver.

In [None]:
# TEXT  -  LABEL DESCRIPTION
for ent in doc.ents:
    print(ent.text, ent.label_)

the morning TIME
Denver GPE


In [None]:
# TEXT - LABEL DESCRIPTION 
[(ent.text, ent.label_) for ent in doc.ents]

[('the morning', 'TIME'), ('Denver', 'GPE')]

In [None]:
# Label explanations
spacy.explain("GPE")

'Countries, cities, states'

### **Named Entity Recognition Visualization**

In [None]:
# Visualizinf NER
displacy.render(doc, style="ent", jupyter=True)
#displacy.serve(doc, style="ent")

### **Parsing - Syntaxic Parse Tree (Consituency-Based Parse Tree)**

In [None]:
import nltk 

In [None]:
# Grammar CFG Declaration

grammar1 = nltk.CFG.fromstring("""
  S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  NP -> Det N | Det N PP
  V -> "saw" | "ate" | "walked"
  NP -> "John" | "Mary" | "Bob" 
  Det -> "a" | "an" | "the" | "my"
  N -> "girl" | "dog" | "cat" | "telescope" | "park" | "bone"
  P -> "in" | "on" | "by" | "with"
  """)

# Algo - RecursiveDescentParser : https://www.nltk.org/api/nltk.parse.html

rd_parser = nltk.RecursiveDescentParser(grammar1)

In [None]:
# Text and Parse Tree
sent = ['the', 'dog', 'ate', 'the', 'bone']

for tree in rd_parser.parse(sent):
    print(tree)

(S (NP (Det the) (N dog)) (VP (V ate) (NP (Det the) (N bone))))


In [None]:
# Cas - Ambiguïté structurelle
sent = ['John', 'saw', 'a', 'girl', 'with', 'a', 'telescope']

for tree in rd_parser.parse(sent):
    print(tree)

(S
  (NP John)
  (VP
    (V saw)
    (NP (Det a) (N girl) (PP (P with) (NP (Det a) (N telescope))))))
(S
  (NP John)
  (VP
    (V saw)
    (NP (Det a) (N girl))
    (PP (P with) (NP (Det a) (N telescope)))))


In [None]:
# Tree Visualization

def parse(sent):
    #Returns nltk.Tree.Tree format output
    a = []  
    parser = nltk.RecursiveDescentParser(grammar1)
    for tree in parser.parse(sent):
        a.append(tree)
    return(a[0]) 

sentence = ['the', 'dog', 'ate', 'the', 'bone']
# sentence = ['John', 'saw', 'a', 'girl', 'with', 'a', 'telescope']

#Gives output as structured tree   
print(parse(sentence))

#Gives tree diagrem in tkinter window
parse(sentence).draw()