![](https://drive.google.com/uc?export=view&id=1L9JLLQHPZoMRwzYfmKcyM9VME_SHeZrr)

# Pre-processing texts

Pre-process textual data using NLTK and Spacy.

Within a computer, text is encoded as a string of characters. 
In order to analyze textual data within NLP applications, we first need to properly preprocess it. 
An NLP preprocessing pipeline generally consists of the following steps :
* sentence segmentation
* tokenisation
* normalization: lower-casing, lemmatization, optionally removing stop-words and punctuation 
* pos-tagging
* named entity recognition
* parsing

The first two steps are necessary, while the others are optional.

For these exercises, we will use the modules **NLTK** and **spacy** (already installed on google colab, but some libraries might be missing for your NLTK, we'll see later).

NLTK and Spacy both provide ways to carry out tasks such as segmentation, tokenization, lemmatization and pos-tagging.

NLTK is a rather old library, but still used a lot. NLTK was built by scholars and researchers as a tool to help you create complex NLP functions.
Spacy is more recent one, it implements an NLP pipeline. While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it (https://www.activestate.com/blog/natural-language-processing-nltk-vs-spacy/).

We will extract information from Wikipedia pages as an example.

## 0- Upload and read the text files

At first, we're going to use a text written in English. Then, we'll try to apply the tools to French.
We'll use the wikipedia library to extract pages from wikipedia

In [None]:
! pip install wikipedia

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11695 sha256=4f15d766c2074d8b2b5bce4cb12f74a8e2701b65e6ba5c5f6dcc69470b8788b0
  Stored in directory: /root/.cache/pip/wheels/15/93/6d/5b2c68b8a64c7a7a04947b4ed6d89fb557dcc6bc27d1d7f3ba
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [None]:
import wikipedia
wikipedia.set_lang('en')
text_en = wikipedia.page("Lovelace")
print(text_en.content[:1000])

Augusta Ada King, Countess of Lovelace (née Byron; 10 December 1815 – 27 November 1852) was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine. She was the first to recognise that the machine had applications beyond pure calculation, and to have published the first algorithm intended to be carried out by such a machine. As a result, she is often regarded as the first computer programmer.Ada Byron was the only legitimate child of poet Lord Byron and mathematician Lady Byron. All of Byron's other children were born out of wedlock to other women. Byron separated from his wife a month after Ada was born and left England forever. Four months later, he commemorated the parting in a poem that begins, "Is thy face like thy mother's my fair child! ADA! sole daughter of my house and heart?". He died in Greece when Ada was eight. Her mother remained bitter and promoted Ada's interest in mathemati

In [None]:
wikipedia.set_lang('fr')
text_fr = wikipedia.page("Lovelace")
print(text_fr.content[:1000])

Ada Lovelace, de son nom complet Augusta Ada King, comtesse de Lovelace, née Ada Byron le 10 décembre 1815 à Londres et morte le 27 novembre 1852 à Marylebone dans la même ville, est une pionnière de la science informatique.
Elle est principalement connue pour avoir réalisé le premier véritable programme informatique, lors de son travail sur un ancêtre de l'ordinateur : la machine analytique de Charles Babbage. Dans ses notes, on trouve en effet le premier programme publié, destiné à être exécuté par une machine, ce qui fait d'Ada Lovelace « le premier programmeur du monde ». Elle a également entrevu et décrit certaines possibilités offertes par les calculateurs universels, allant bien au-delà du calcul numérique et de ce qu'imaginaient Babbage et ses contemporains,.
Elle est assez connue dans les pays anglo-saxons et en Allemagne, notamment dans les milieux féministes ; elle est moins connue en France, mais de nombreux développeurs connaissent le langage Ada, nommé en son honneur. 




## 1- Using NLTK



--> **For now, we work on the English file**

### 1.1 Sentence Segmentation

**Exercise 1:** Breaking the text into Sentences

* Import [NLTK](https://www.nltk.org/api/nltk.html)
* In NLTK, you can use help(X) to get information about function X works e.g., help(nltk.word_tokenize) to get information about NLTK's word tokenizer.   
Use the [help function](https://www.nltk.org/api/nltk.html?highlight=help#module-nltk.help) to see how nltk.sent_tokenize works  
* Now use the [sent_tokenize()](https://www.nltk.org/api/nltk.tokenize.html?highlight=sent_tokenize#nltk.tokenize.sent_tokenize) function to segment the text into sentences.
* Apply print() to the output of the tokenizer to view the results
* How many sentences do we have?

You might need to download additional resources for NLTK, e.g. the package 'punkt':



In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
import nltk

help(nltk.sent_tokenize)

content_en = text_en.content

sentences = nltk.sent_tokenize(content_en)
for i, s in enumerate( sentences):
  print(i, s )


### 1.2 Tokenization

* Tokenizing a text file 

* Tokenize the text using NLTK [word_tokenize](https://www.nltk.org/api/nltk.tokenize.html?highlight=word_tokenize)
* Inspect the results: does it work well?
* How many tokens do we have?
* How many words / types / unique tokens do we have? Hint: use numpy.unique(list).

In [None]:
import numpy as np

# Use NLTK to tokenize the text
tokens = nltk.word_tokenize(content_en)

# Print out the tokens
print(tokens)
print('Count tokens: ', len(tokens))
print('Vocabulary / Types / Unique tokens: ', len(np.unique(tokens)))

### 1.3 Pre-processing of French text

--> **Now, we will use the French wikipedia page**

**Exercise 3** Tokenization for French
* Now, try to perform the same pre-processing on the French document
* Do you see a problem?
* Check the language option for NLTK, does it work better?
* Use the *RegexpTokenizer* to solve the issue.

In [None]:
content_fr = text_fr.content

sentences = nltk.sent_tokenize(content_fr)
for i, s in enumerate( sentences[:10]):
  print(i, s )

# Use NLTK to tokenize the text
tokens = nltk.word_tokenize(content_fr)

# Print out the tokens
print(tokens)
print('Count tokens: ', len(tokens))
print('Vocabulary / Types / Unique tokens: ', len(np.unique(tokens)))


# output deleted because is very long

Issues with tokenization:
- look for tokens that are not properly recognized

Unfortunately, using the option "language='french'" doesn't solve all the issues...

What we can do is to use a regular expression; more specifically the base RegexTokenizer

In [None]:
# Use NLTK to tokenize the text
tokens = nltk.word_tokenize(content_fr, language='french')

# Print out the tokens
print(tokens)


In [None]:
from nltk import RegexpTokenizer
tokenizer = RegexpTokenizer(r'''\w'|\w+|[^\w\s]''')
tokens = tokenizer.tokenize(content_fr)

# Print out the tokens
print(tokens)

**Conclusion on Tokenization**

Tokenization is not as simple at it seems. Even for English, the behaviour of the NLTK tokenizer is probably not what you want for many applications, try e.g. to tokenize "I don't like it". 

In [None]:
# Use NLTK to tokenize the text
tokens = nltk.word_tokenize("I don't like it.")

# Print out the tokens
print(tokens)

['I', 'do', "n't", 'like', 'it', '.']


## 2- Using Spacy

All info about Spacy: https://spacy.io/ ; More info on the pipelines: https://spacy.io/usage/processing-pipelines 

Spacy is a more realistic library for NLP than NLTK, with higher performances on the basic processing steps. 

Spacy can be used to directly tokenize any text. 
With spacy, we build a pipeline that does everything at once. 
To make it work, you need to load a model specific to the target language, for example 'en' for English (there are also some domain specific models).


The model corresponds to a processing 'pipeline': 
  by default, it includes the tokenisation, the lemmatization and the POS tagging

Using spacy:
- import the spacy module into Python 
- load all the necessary models, e.g. for English


In [None]:
import spacy 
nlp = spacy.load('en_core_web_sm')


Then process a text with the pipeline: 



In [None]:
doc = nlp(content_en)

### 2.1 Tokenisation

**Exercise 4:** Tokenize the text in French
* Find a model for French and tokenize the text in the file. Hint: you will need to download the model first, that can be done in the notebook using: *spacy.cli.download( model_name )*
* What does contain the *doc* variable? Hint: You can either access Spacy's manual on the internet to find out how to access the information, or look at the built-in help by typing help(doc). https://spacy.io/api/doc
* Print the individual tokens. Do you see any error?
* How many tokens do we have?
* How many words / types / unique tokens do we have (i.e. vocabulary size)?
* Use Pandas to better visualize the results

In [None]:
import spacy.cli
spacy.cli.download("fr_core_news_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


In [None]:
import spacy

# Load the model
nlp = spacy.load('fr_core_news_sm')

# Preprocess using spacy's pipeline
doc = nlp(content_fr)

print('Preprocessing done')

# Inspect token
# Our preprocessed document is now present as a list of tokens in our doc variable, and we can access its different annotations by looping through it
all_tokens =  []
for token in doc:
  #print( token.text)
  all_tokens.append(token.text)
print('Count tokens', len(all_tokens))
print('Count types', len(np.unique(all_tokens)))


Preprocessing done
Count tokens 3209
Count types 1088


In [None]:
#ou bien
from collections import Counter
comptage_tokens = Counter((token.text for token in doc)) 
print(f"Count tokens : {sum(comptage_tokens.values())}")
print(f"Count types : {len(comptage_tokens.keys())}")
print(f"Most frequent tokens : {comptage_tokens.most_common(n=10)}")

Count tokens : 3209
Count types : 1088
Most frequent tokens : [(',', 198), ('de', 134), ('.', 94), ('Ada', 69), ('=', 66), ('et', 65), ('la', 58), ('à', 55), ('\n', 49), ('en', 47)]


Issues with tokenisation? (à chercher ?)
* -il
* machine[2].
* linguistique[3].


#### Getting some help

Hint: You can either access Spacy's manual on the internet to find out how to access the information, or look at the built-in help by typing *help(doc)*.

https://spacy.io/api/doc

#### Pandas

You can use Pandas to better visualize the results

In [None]:
# Using pandas for a better visualization 
import pandas as pd

spacy_tokens = [w for w in doc]
pd.DataFrame(spacy_tokens,
             columns=['Word'])

Unnamed: 0,Word
0,Ada
1,Lovelace
2,","
3,de
4,son
...,...
3204,de
3205,l’
3206,histoire
3207,des


In [None]:
w0 = spacy_tokens[0]
w0

Ada

### 2.2 Sentence segmentation

**Exercise 5:**
Apart from token segmentation, Spacy has also automatically segmented our document intro sentences. 
* **(a)** Print out the different sentences of the document.
Hint: Look at the "Data descriptors " in the help page for 'doc'.



In [None]:
# Print the sentences
for i, sent in enumerate( doc.sents ):
  print( i, sent.text.strip() )

## 3- Further pre-processing 

We saw earlier that the most frequent words are punctuation and function words. 
In order to find the most important words, e.g. to index documents, we probably want to remove these tokens.
We are thus now going to **remove punctuation signs and "stop words"**.
Note that for a full normalization, we would probably also lower case the first word of each sentence, and all words that are not tagged as proper nouns (but it requires pos tagging).

Exercise 6:
* Define a function that segments, tokenizes, removes punctuation and removes stop words. **Hint** spacy.lang contains language specific data for each language, in particular stop words lists.
* Apply this function to the french wikipedia page
* Display a panda dataframe containing an ordered list of the tokens after cleaning and their frequency


In [None]:
# on a vu plus mot que les mots les plus courants sont des ponctuations et mots fonctionnels. si on veut garder les mots importants par exemple pour indexer le texte
# on va enlever les tokens de ponctuation et les "stop words"
# pour normaliser complètement il faudrait passer en minuscule le premier mot de chaque phrase, si ce n'est pas un nom propre (mais on verra ça plus loin)
import string
punct = set(string.punctuation+"«»")
print("Punctuation signs: ", punct)

from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop

print("Stop words for French:", fr_stop)

def cleanup(text):
  text = text.lower().replace("\n","")# en premiere approximation
  doc = nlp(text)
  sents = doc.sents
  tokens = []
  for s in sents:
    tokens.extend([t.text for t in s if t.text not in (fr_stop|punct)])
  return list(tokens)
    
cleaned = cleanup(content_fr)
print("Cleaned, # tokens:", len(cleaned))

Punctuation signs:  {'|', '[', '$', '?', '=', '\\', '»', '`', '^', '/', '"', '~', '#', '&', '_', '!', '(', ';', '<', '+', ')', '.', '{', '>', '}', ']', '«', '@', ':', '*', '%', "'", '-', ','}
Stop words for French: {'stop', 'abord', 'quelques', 'nôtre', 'egalement', 'revoila', 'possible', 'revoici', 'suivante', 'antérieure', 'avec', 'onzième', 'néanmoins', 'telle', 'ouste', 'ça', 'doivent', 'mille', 'ta', 'vôtres', 'dixième', 'as', 'moi-meme', "d'", 'c’', 'dès', 'dans', 'leur', 'comme', 'mien', 'que', 'pu', 'da', 'suivre', 'me', 'au', 'dits', 'quel', 'restent', 'toujours', 'lui-meme', 'dessus', "quelqu'un", 'également', 'vingt', 'surtout', 'peut', 'gens', 'cela', 'soi', 'facon', 'sans', 'peu', 'soit', 'en', 'meme', 'enfin', 'mais', 'quelque', 'car', 'très', 'celles-la', 'lorsque', 'malgré', 'de', 'exactement', 'apres', 'nul', 'quatorze', 'outre', 'quoique', 'du', 'celle-là', 'avoir', 'bat', 'duquel', 'lesquelles', 'faisaient', "j'", 'sept', 'hormis', 'celle-ci', 'assez', 'puis', 'nôtre

In [None]:
print(cleaned[:100])

['ada', 'lovelace', 'nom', 'complet', 'augusta', 'ada', 'king', 'comtesse', 'lovelace', 'née', 'ada', 'byron', '10', 'décembre', '1815', 'londres', 'morte', '27', 'novembre', '1852', 'marylebone', 'ville', 'pionnière', 'science', 'informatique.elle', 'principalement', 'connue', 'réalisé', 'véritable', 'programme', 'informatique', 'travail', 'ancêtre', 'ordinateur', 'machine', 'analytique', 'charles', 'babbage', 'notes', 'trouve', 'programme', 'publié', 'destiné', 'exécuté', 'machine', 'ada', 'lovelace', 'programmeur', 'monde', 'entrevu', 'décrit', 'possibilités', 'offertes', 'calculateurs', 'universels', 'allant', 'bien', 'au-delà', 'calcul', 'numérique', 'imaginaient', 'babbage', 'contemporains,.elle', 'connue', 'pays', 'anglo-saxons', 'allemagne', 'milieux', 'féministes', 'connue', 'france', 'développeurs', 'connaissent', 'langage', 'ada', 'nommé', 'honneur', 'biographie', 'environnement', 'familial', 'ada', 'fille', 'légitime', 'poète', 'george', 'gordon', 'byron', 'épouse', 'annabe

In [None]:
import pandas as pds
table = pds.Series(cleaned)
cts = table.value_counts()
cts.sort_values(inplace=True,ascending=False)
cts.head(20)

ada              65
babbage          27
lovelace         24
machine          22
byron            15
the              15
mathématiques    12
notes            11
informatique     10
programme         9
of                9
nom               9
charles           9
calcul            7
2                 7
fille             7
été               6
portail           6
analytique        6
augusta           6
dtype: int64

In [None]:
type(fr_stop)

set

## 4- POS tagging

Remember that the model corresponds to a processing 'pipeline' in Spacy: 
  - by default, it includes the tokenisation, the lemmatization and the POS tagging

**Exercise 7**
- print each individual token, together with its lemmatized form and part of speech tag
- Use Panda to better visualize the results
- Look at the results, do you see any error?
- You can use the method 'spacy.explain' to have information about some annotation, for example the POS tags. Apply it to each POS tag to get a more detailed label.


In [None]:
import spacy

# Load the model
nlp = spacy.load('fr_core_news_sm')


# Preprocess using spacy's pipeline
doc = nlp(content_fr)

# Inspect tokens, lemmas, and pos tags
for token in doc:
  print( token.text, token.lemma_, token.pos_)

Ada Ada PROPN
Lovelace Lovelace PROPN
, , PUNCT
de de ADP
son son DET
nom nom NOUN
complet complet ADJ
Augusta Augusta PROPN
Ada Ada PROPN
King King PROPN
, , PUNCT
comtesse comtesse NOUN
de de ADP
Lovelace Lovelace PROPN
, , PUNCT
née naître VERB
Ada Ada X
Byron Byron PROPN
le le DET
10 10 NUM
décembre décembre NOUN
1815 1815 NUM
à à ADP
Londres Londres PROPN
et et CCONJ
morte morte ADV
le le DET
27 27 NUM
novembre novembre NOUN
1852 1852 NUM
à à ADP
Marylebone Marylebone PROPN
dans dans ADP
la le DET
même même ADJ
ville ville NOUN
, , PUNCT
est être AUX
une un DET
pionnière pionnier NOUN
de de ADP
la le DET
science science NOUN
informatique informatique ADJ
. . PUNCT

 
 SPACE
Elle lui PRON
est être AUX
principalement principalement ADV
connue connaître VERB
pour pour ADP
avoir avoir AUX
réalisé réaliser VERB
le le DET
premier premier ADJ
véritable véritable ADJ
programme programme NOUN
informatique informatique ADJ
, , PUNCT
lors lors ADV
de de ADP
son son DET
travail travail NOUN
s

#### Pandas

You can use Pandas to better visualize the results

In [None]:
# Using pandas for a better visualization 
import pandas as pd

spacy_pos_tagged = [(w, w.lemma_, w.tag_, w.pos_) for w in doc]
pd.DataFrame(spacy_pos_tagged,
             columns=['Word', 'Lemma', 'POS tag', 'Tag type'])

Unnamed: 0,Word,Lemma,POS tag,Tag type
0,Ada,Ada,PROPN,PROPN
1,Lovelace,Lovelace,PROPN,PROPN
2,",",",",PUNCT,PUNCT
3,de,de,ADP,ADP
4,son,son,DET,DET
...,...,...,...,...
3204,de,de,ADP,ADP
3205,l’,l’,ADJ,ADJ
3206,histoire,histoire,NOUN,NOUN
3207,des,de,ADP,ADP


#### Look at the results:
* lemmatization:
  * ambiant ambier VERB

#### Notes on POS tags:

* You can use the method 'explain' to have information about some annotation, for example the POS tags, see the code below.
* Here we used a very small set of POS (vs e.g. 36 in the PTB: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) 

In [None]:
# Inspect POS tags
all_tags = set()
for token in doc:
  all_tags.add(token.pos_)
for tag in all_tags:
  print( tag, spacy.explain(tag)) # explain each label

ADJ adjective
PUNCT punctuation
PRON pronoun
NOUN noun
VERB verb
SPACE space
ADV adverb
AUX auxiliary
DET determiner
SCONJ subordinating conjunction
X other
PROPN proper noun
ADP adposition
NUM numeral
CCONJ coordinating conjunction


## 5- Named entity recognition

As part of the preprocessing pipeline, Spacy has also carried out named entity recognition.

**Exercise 8:**
* print out each named entity, together with the label assigned to it
* what do the labels stand for?
* Use the module called 'displacy' to visualize the Named Entities directly in the text.

In [None]:
entity_labels = set()
for entity in doc.ents:
  label = entity.label_
  print( entity.text, '\t', label )
  entity_labels.add( label )

Ada Lovelace 	 PER
Augusta 	 MISC
comtesse de 	 LOC
Ada Byron 	 PER
Londres 	 LOC
Marylebone 	 LOC
Charles Babbage 	 PER
Ada Lovelace 	 PER
Babbage 	 LOC
Allemagne 	 LOC
France 	 LOC
Ada 	 MISC
Ada 	 MISC
George Gordon 	 PER
Annabella 	 PER
Caroline Lamb 	 PER
Byron 	 PER
Ada 	 MISC
Augusta 	 MISC
Augusta Leigh 	 PER
Byron 	 PER
Ada 	 MISC
Byron 	 PER
Augusta 	 PER
Byron 	 PER
Annabella 	 PER
Ada 	 MISC
Byron 	 PER
Annabella 	 PER
Byron 	 PER
Ada 	 MISC
Byron 	 PER
Royaume-Uni 	 LOC
Annabella 	 PER
Byron 	 PER
Annabella 	 PER
Ada 	 MISC
Ada 	 MISC
Mary Somerville 	 PER
XIXe 	 MISC
Mary 	 PER
Charles Babbage 	 PER
Ada 	 MISC
Ada 	 MISC
Babbage 	 LOC
David Brewster 	 PER
Charles Wheatstone 	 PER
Charles Dickens 	 PER
Michael Faraday 	 PER
William King 	 PER
Byron 	 PER
Annabella 	 PER
Anne Blunt 	 PER
Ralph Gordon 	 PER
William 	 PER
Ada 	 MISC
Ada 	 MISC
Ockham Park 	 LOC
Okham 	 LOC
Augusta Ada 	 MISC
Lovelace 	 LOC
Ada Lovelace 	 PER
Lady Lovelace 	 PER
Ada 	 MISC
Babbage 	 PER
August

In [None]:
for l in entity_labels:
  print( l, spacy.explain(l))

ORG Companies, agencies, institutions, etc.
MISC Miscellaneous entities, e.g. events, nationalities, products or works of art
PER Named person or family.
LOC Non-GPE locations, mountain ranges, bodies of water


#### Visualization

A module called 'displacy' can be used to visualize the Named Entities directly in the text.

In [None]:
from spacy import displacy

# Visually
displacy.render(doc, style='ent', jupyter=True)

Note on Named Entity Recognition
* "Church" = organization instead or person
* Pitts' 1943: person + date = ORG, not well segmented

## 6- Parsing 

Finally, as part of the pipeline, Spacy has also performed a dependency parsing (note that each module can de disabled if not needed).

**Exercise 9:** Extract all Noun Phrases from the file

* Retrieve the information from the dependency parses: dependent and head of each token
* Use displacy to visualize a parse tree: first try with a simple sentence (e.g. *La petite brise la glace.*) then use the first sentence of the document.
* Navigating the parse tree. Each element of the tree is associated to attributes: you can use them to inspect the different elements of the trees: 
  * Define a Panda dataframe with each token id associated to its head, with the relation between them. The eventual children of the current token are also printed.
* Print all the adjectives and the noun they modify



In [None]:
# Load the model
#nlp = spacy.load('fr_core_news_sm')

doc_ada_fr = nlp(content_fr)

sentence = next(doc_ada_fr.sents)

for token in sentence:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.head)

Ada Ada PROPN PROPN ROOT Ada
Lovelace Lovelace PROPN PROPN flat:name Ada
, , PUNCT PUNCT punct Ada
de de ADP ADP case nom
son son DET DET det nom
nom nom NOUN NOUN nmod Ada
complet complet ADJ ADJ amod nom
Augusta Augusta PROPN PROPN nmod nom


In [None]:
# Load the model
#nlp = spacy.load('fr_core_news_sm')
text = 'La petite brise la glace.'

doc_toy = nlp(text)

# Visualization
displacy.render(doc_toy, style="dep", jupyter=True)

In [None]:

# Print the next sentence of our document
sentence0 = next(doc_ada_fr.sents)
print(sentence0)


# Visualization
displacy.render(sentence0, style="dep", jupyter=True)

Ada Lovelace, de son nom complet Augusta


#### Navigating the parse tree

In [None]:
# Navigating the parse tree
spacy_dep_rel = [(w.text, w.pos_, w.dep_, w.head.text, w.head.pos_, [";".join([str(child.idx),child.text]) for child in w.children]) for w in sentence0]
pd.DataFrame(spacy_dep_rel,
             columns=['Word', 'Pos', 'Dep', 'Head text', 'Head pos', 'children'])

Unnamed: 0,Word,Pos,Dep,Head text,Head pos,children
0,Ada,PROPN,ROOT,Ada,PROPN,"[4;Lovelace, 12;,, 21;nom]"
1,Lovelace,PROPN,flat:name,Ada,PROPN,[]
2,",",PUNCT,punct,Ada,PROPN,[]
3,de,ADP,case,nom,NOUN,[]
4,son,DET,det,nom,NOUN,[]
5,nom,NOUN,nmod,Ada,PROPN,"[14;de, 17;son, 25;complet, 33;Augusta]"
6,complet,ADJ,amod,nom,NOUN,[]
7,Augusta,PROPN,nmod,nom,NOUN,[]


In [None]:
for sent in doc_ada_fr.sents:
  for w in sent:
    if w.pos_ == 'ADJ':
      print( w.text, w.head.text, w.head.pos_) # pas que des Noms, révèle des erreurs de POS tagging

complet nom NOUN
même ville NOUN
informatique science NOUN
premier programme NOUN
véritable programme NOUN
informatique programme NOUN
analytique machine NOUN
premier programme NOUN
premier programmeur NOUN
offertes possibilités NOUN
universels calculateurs NOUN
numérique calcul NOUN
féministes milieux NOUN
nombreux développeurs NOUN
familial Environnement NOUN
seule fille NOUN
légitime fille NOUN
poète fille NOUN
intelligente femme NOUN
premier prénom NOUN
dernier eu VERB
incestueuses relations NOUN
court « VERB
ancien « VERB
vocalique ancien ADJ
même année NOUN
quitte Annabella PROPN
« appelait VERB
jeune fille NOUN
éminente chercheuse ADJ
chercheuse , PUNCT
scientifique autrice NOUN
XIXe siècle NOUN
— Ada PROPN
âgée Ada PROPN
proches deviennent VERB
autres connaissances NOUN
1er marie VERB
dévoué dévoué ADJ
complet nom NOUN
grande partie NOUN
honorable honorable ADJ
fragile santé NOUN
mathématiques activités NOUN
enthousiaste élève NOUN
créative enthousiaste ADJ
positifs retours NOU

In [None]:
for sent in doc_ada_fr.sents:
  for w in sent:
    if w.pos_ == 'ADJ' and w.head.pos_ == 'NOUN':
      print( w.text, w.head.text, w.head.pos_) # pas que des Noms, révèle des erreurs de POS tagging

complet nom NOUN
même ville NOUN
informatique science NOUN
premier programme NOUN
véritable programme NOUN
informatique programme NOUN
analytique machine NOUN
premier programme NOUN
premier programmeur NOUN
offertes possibilités NOUN
universels calculateurs NOUN
numérique calcul NOUN
féministes milieux NOUN
nombreux développeurs NOUN
familial Environnement NOUN
seule fille NOUN
légitime fille NOUN
poète fille NOUN
intelligente femme NOUN
premier prénom NOUN
incestueuses relations NOUN
même année NOUN
jeune fille NOUN
scientifique autrice NOUN
XIXe siècle NOUN
autres connaissances NOUN
complet nom NOUN
grande partie NOUN
fragile santé NOUN
mathématiques activités NOUN
enthousiaste élève NOUN
positifs retours NOUN
combinaison singulière NOUN
inépuisable énergie NOUN
analytique machine NOUN
suisse journal NOUN
analytique machine NOUN
italien mathématicien NOUN
bon niveau NOUN
spécialisé journal NOUN
scientifiques articles NOUN
étrangers articles NOUN
même période NOUN
analytique machine N

In [None]:
## TO REMOVE
import spacy
import spacy.cli
from spacy import displacy
from spacy.matcher import DependencyMatcher
spacy.cli.download("fr_core_news_sm")


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


## 7- Putting it all together

Now we are going to use the skills practiced in the preceding exercises to build a simple question-answering system on a toy dataset (in French).

We will focus on specific questions of the form "Qui a peint X ?". 
We will define patterns based on differents ways of formulating this question, and use them to extract the answer from a small toy corpus based on wikipedia pages on paintings. 

When you're done with this exercise, try to answer other types of questions, such as "Où est exposé X ?", "Quand a été peinte X ?".

Below, we reload the spacy French model adding specific options to merge named entities containing multiple tokens.

In [None]:
# Load the model
nlp = spacy.load('fr_core_news_sm')

nlp.add_pipe("merge_entities")
nlp.add_pipe("merge_noun_chunks")

<function spacy.pipeline.functions.merge_noun_chunks(doc: spacy.tokens.doc.Doc) -> spacy.tokens.doc.Doc>

Here is the list of questions we will consider. 
You also need a corpus of source document, you can find it on Moodle (corpus_qa.txt).

In [None]:
question_list = [
    'La Joconde est un tableau de qui ?',
    'Le radeau de la méduse est une peinture réalisée par qui ?'
]
corpus = 'corpus_qa.txt'

**Exercise 10:** In this part, we focus on the first question. This question is designed to be similar to the document containing the answer. We can thus define a pattern based on its structure to extract the answer from the document.

- Process the question using the spacy nlp pipeline 
- display its parse tree and / or print a Pandas dataframe containing information from the parse tree
- Now to define a lexico-syntactic pattern, you are going to use spacy *DependencyMatcher* : https://spacy.io/usage/rule-based-matching#dependencymatcher
  * Look at the doc to understand how it works
  * Define a pattern that should match 'qui ' in the question 

In [None]:
# Display the parse tree of the first question 
q1 = question_list[0]
q1_prep = nlp(q1)
displacy.render(q1_prep, style="dep", jupyter=True)

In [None]:
from spacy.matcher import DependencyMatcher
# Define a lexico-syntactic pattern that allows to retrieve the answer

pattern_tableau = [
    {
        "RIGHT_ID": "tableau",
        "RIGHT_ATTRS": {"ORTH": "tableau"}
    },

    {
        "LEFT_ID": "tableau",
        "REL_OP": ">",
        "RIGHT_ID": "mod",
        "RIGHT_ATTRS": {"DEP": "nmod"},
    },

]

# If you match the pattern to the original question, it should output 'qui'

matcher = DependencyMatcher(nlp.vocab)
matcher.add("tableau", [pattern_tableau])

matches = matcher(q1_prep)

print(matches) # [(4851363122962674176, [6, 0, 10, 9])]
# Each token_id corresponds to one pattern dict
match_id, token_ids = matches[0]

for i in range(len(token_ids)):
  if pattern_tableau[i]["RIGHT_ID"] == 'mod':
    print(q1_prep[token_ids[i]].text)

[(2044181489145587150, [3, 5])]
qui


**Exercise 11**: Retrieve matching documents 
Retrieve the documents that are relevant to the question, i.e. the ones containing the keyword 'La Joconde'.

It is recommended to define a function, that could be used for the next exercises. It could work by:

* first indexing all the documents using the named entities present in the document (i.e. build a dictionnary mapping a named entity to all documents where it is present)
* now the *retrieve_documents(...)* method should try to match the input keywork with a named entity and return the matching documents.
* Test the function with 'Joconde':
  * Does it work with the method based on named entities?
  * Add a backup solution : if no document is found,simply try to find the string corresponding to the keyword in the document

In [None]:
# Retrieve the document that is relevant to the question, i.e. the one 
# containing 'La Joconde'
# Indexation of the documents based on named entities 
# + if no NE found, simple pattern matching


# Add a lowercasing ?
def index_corpus( documents_prep ):
  index = {}
  # indexation on named entities
  for i, doc in enumerate(documents_prep):
    entities = doc.ents
    for ent in doc.ents:
      if ent in index:
        index[ent].append(i)
      else:
        index[ent]=[i]
  return index 

def retrieve_documents(documents_prep, index, keyword):
  matching_docs = []
  for e in index.keys():
    if keyword in e.text: # match Joconde in La Joconde
      matching_docs.extend(index[e])
  if len(matching_docs) == 0: # solution de repli
    for i, doc in enumerate(documents_prep):
      for tok in doc:
        if keyword in tok.text:
          matching_docs.append(i)
  return matching_docs


with open(corpus) as infile:
  documents = infile.read().split('\n\n')
documents_prep = [nlp(doc) for doc in documents]
index = index_corpus( documents_prep )

matching_docs = retrieve_documents(documents_prep, index, 'Joconde')
print(matching_docs)

[0]


**Exercise 12:** Test the pattern

- Apply the pattern to each sentence of the retrieved document, do you find the right answer?

In [None]:
#  Apply the pattern to each sentence of this document, do you find the right answer?
# --> you should define a more general function that goes through all the matched
# documents (but here toy dataset, only one doc per question)

id_doc = matching_docs[0]
answers = []
for sent in documents_prep[id_doc].sents:
  matches = matcher(sent)

  for match_id, tok_id in matches:
    for idx,i in enumerate(tok_id[:]):
      pattern_part = pattern_tableau[idx]
      text = sent[i].text
      if pattern_part["RIGHT_ID"]=="mod":
          answers.append(text)

print("Réponses possibles: ")
print("-"+"\n-".join(answers))

Réponses possibles: 
-Léonard de Vinci


**Exercise 13:** Now define a pattern to answer the second question, and find the answer!

Be careful, there is a little issue here:
- Display the parse tree for the question and the matching document: what do you observe?
- Build a pattern to match the question and the answer (Hint : you can use a list of dependency relations in the pattern, using e.g. *"RIGHT_ATTRS": {"DEP": {"IN":[ "acl", "advcl" ] }*
- Finally, retrieve the answer

In [None]:
# Display the parse tree of the second question 
q2 = question_list[1]
q2_prep = nlp(q2)
displacy.render(q2_prep, style="dep", jupyter=True)

In [None]:
# Display the parse tree of the matching document (first and unique sentence)
matching_docs = retrieve_documents(documents_prep, index, 'Méduse')
id_doc = matching_docs[0]
sentences = documents_prep[id_doc].sents
displacy.render(list(sentences)[0], style="dep", jupyter=True)

In [None]:
# Define a pattern that matches both the question and answer
# here we have a slight difference between the dep rel label for quest and answer
from spacy.matcher import DependencyMatcher
pattern_peinture = [
    {
        "RIGHT_ID": "peinture",
        "RIGHT_ATTRS": {"ORTH": "peinture"}
    },
    {
        "LEFT_ID": "peinture",
        "REL_OP": ">",
        "RIGHT_ID": "realise",
        "RIGHT_ATTRS": {"DEP": {"IN":[ "acl", "advcl" ] }},
    },


    {
        "LEFT_ID": "realise",
        "REL_OP": ">",
        "RIGHT_ID": "agent",
        "RIGHT_ATTRS": {"DEP": "obl:agent"},
    },

]

# If you match the pattern to the original question, it should output 'qui'

matcher = DependencyMatcher(nlp.vocab)
matcher.add("peinture", [pattern_peinture])

matches = matcher(q2_prep)

print(matches) # [(4851363122962674176, [6, 0, 10, 9])]
# Each token_id corresponds to one pattern dict
match_id, token_ids = matches[0]

for i in range(len(token_ids)):
  if pattern_peinture[i]["RIGHT_ID"] == 'agent':
    print(q2_prep[token_ids[i]].text)

[(11010221642568005822, [5, 6, 8])]
qui


In [None]:
#  Apply the pattern to each sentence of this document, do you find the right answer?

def answer( doc, matcher):
  answers = []
  for sent in doc.sents:
    matches = matcher(sent)

    for match_id, tok_id in matches:
      for idx,i in enumerate(tok_id[:]):
        pattern_part = pattern_peinture[idx]
        text = sent[i].text
        if pattern_part["RIGHT_ID"]=="agent":
          answers.append(text)
  return answers

answers = answer( documents_prep[id_doc], matcher)
print("Réponses possibles: ")
print("-"+"\n-".join(answers))

Réponses possibles: 
-Théodore Géricault


**Exercise 14:** Use the patterns defined to find the answers to the questions below. Here, we know that we're looking for the painter, we don'y want to match the question to the answer, we want to test known patterns to find the right answer.

- find a way to extract the name of the painting from the question
- retrieve the relevant document
- test the patterns defined previously to extract the correct answer

In [None]:
new_questions = [
    'Qui a peint American Gothic ?',
    'Qui est l\'auteur de la peinture La Nuit étoilée',
    'Qui a réalisé Un Garrochista ?',
]

In [None]:
# Search entities in the question (there should be only one..)
q2ents = {}
for i,q in enumerate( new_questions ):
  q2ents[i] = []
  qp = nlp( q )
  for ent in qp.ents:
    q2ents[i].append( ent )
print(q2ents)

{0: [American Gothic], 1: [La Nuit étoilée], 2: [Un Garrochista]}


In [None]:
# Retrive document
q2doc = {}
for q in q2ents.keys():
  for ent in q2ents[q]:
    matching_docs = retrieve_documents(documents_prep, index, ent.text)
    id_doc = matching_docs[0]
    q2doc[q] = id_doc
    print( 'Q',q, ent.text, 'Doc', q2doc[q])

Q 0 American Gothic Doc 3
Q 1 La Nuit étoilée Doc 2
Q 2 Un Garrochista Doc 7


In [None]:
# Test the documents against several patterns

matcher = DependencyMatcher(nlp.vocab)
matcher.add("PAINTER", [pattern_tableau, pattern_peinture])

def answer( doc_prep, matcher):
  answers = []
  for sent in doc_prep.sents:
    matches = matcher(sent)

    for match_id, tok_id in matches:
      for idx,i in enumerate(tok_id[:]):
        try:
          pattern_part = pattern_tableau[idx]
          text = sent[i].text
          if pattern_part["RIGHT_ID"] in ["mod"]:
            print(text)
            answers.append(text)
        except:
          print("")

        try:
          pattern_part = pattern_peinture[idx]
          text = sent[i].text
          if pattern_part["RIGHT_ID"] in ["agent"]:
            answers.append(text)
        except:
          print("")

  return answers
for q in q2ents:
  print( "\nQ", q, "Tableau:", q2ents[q] )
  answers = answer( documents_prep[q2doc[q]], matcher)
  print("Réponses possibles: ")
  print("-"+"\n-".join(answers))


Q 0 Tableau: [American Gothic]
Grant Wood
Réponses possibles: 
-Grant Wood

Q 1 Tableau: [La Nuit étoilée]
réalisée

Réponses possibles: 
-réalisée
-Vincent van Gogh

Q 2 Tableau: [Un Garrochista]
réalisée

Réponses possibles: 
-réalisée
-Francisco de Goya


In [None]:
sentences = documents_prep[7].sents
displacy.render(list(sentences)[0], style="dep", jupyter=True)