**Morpho-syntactic analysis with NLTK (https://www.nltk.org/)**
Goal of this tutorial is to understand and then test several NLTK and SpaCy functions for pre-processing and text vectorization. I'll give you an example for each feature, so take the time to test with other sentences and understand how to manipulate this type of text data.

**Tokenisation and POS tagging** avec NLTK:

In [1]:
%pip install nltk

Collecting nltkNote: you may need to restart the kernel to use updated packages.

  Using cached nltk-3.9.1-py3-none-any.whl (1.5 MB)
Collecting click
  Using cached click-8.1.8-py3-none-any.whl (98 kB)
Collecting regex>=2021.8.3
  Downloading regex-2024.11.6-cp38-cp38-win_amd64.whl (274 kB)
Collecting tqdm
  Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)
Collecting joblib
  Using cached joblib-1.4.2-py3-none-any.whl (301 kB)
Installing collected packages: click, regex, tqdm, joblib, nltk
Successfully installed click-8.1.8 joblib-1.4.2 nltk-3.9.1 regex-2024.11.6 tqdm-4.67.1


You should consider upgrading via the 'c:\Users\Adam\AppData\Local\Programs\Python\Python38\python.exe -m pip install --upgrade pip' command.


In [2]:
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
text = nltk.word_tokenize("Today is raining!")
nltk.pos_tag(text)

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Adam\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Adam\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


[('Today', 'NN'), ('is', 'VBZ'), ('raining', 'VBG'), ('!', '.')]

Découvrons quelles sont les étiquettes les plus courantes dans la catégorie NEWS du corpus Brown:

In [3]:
from nltk.corpus import brown
nltk.download('brown')
nltk.download('universal_tagset')
brown_news_tagged = brown.tagged_words(categories='news',tagset='universal')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Adam\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\Adam\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


[('NOUN', 30654),
 ('VERB', 14399),
 ('ADP', 12355),
 ('.', 11928),
 ('DET', 11389),
 ('ADJ', 6706),
 ('ADV', 3349),
 ('CONJ', 2717),
 ('PRON', 2535),
 ('PRT', 2264),
 ('NUM', 2166),
 ('X', 92)]

**Stemming** with NLTK.

In [4]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()

for word in ['walking', 'walks', 'walked']:
    print(porter.stem(word))

walk
walk
walk


**Distribution des mots dans le texte**

La méthode text.similar() prend un mot w, recherche tous les contextes w1 w w2, puis tous les mots w’ qui apparaissent dans le même contexte, c.-à-d. w1 w’ w2

In [5]:
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
#text.similar('woman')
#Testez les mots suivants et d'autres mots:
text.similar('bought')
#text.similar('the')

made said done put had seen found given left heard was been brought
set got that took in told felt


**Commment créer une CFG?**

Définissons une grammaire et voyons comment analyser une phrase simple admise par la grammaire.

Quelles phrases peut reconnaître cette grammaire?

In [6]:
from nltk.corpus import treebank
grammar1 = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
V -> 'saw' | 'ate' | 'walked' | 'chase'
NP -> 'John' | 'Mary' | 'Bob' | Det N | Det N PP | N
Det -> 'a' | 'an' | 'the' | 'my'
N -> 'man' | 'dog' | 'cat' | 'telescope' | 'park'| 'dogs' | 'cats'
P -> 'in' | 'on' | 'by' | 'with'
""")
#sent = "Mary saw Bob".split()
sent = "dogs chase cats".split()
rd_parser = nltk.RecursiveDescentParser(grammar1)
for tree in rd_parser.parse(sent):
  print(tree)

(S (NP (N dogs)) (VP (V chase) (NP (N cats))))


Modifiez la grammaire pour que elle puisse reconnaitre la phrase: "dogs chase cats". Testez avec NLTK!!

In [7]:
#Testez ici!!

**Une CFG pour le Français.** Quelles phrases peut reconnaître cette grammaire?

Modifiez la grammaire pour que elle puisse reconnaitre des autres phrases!

In [8]:
grammaire = nltk.CFG.fromstring("""
S -> SN SV
SN -> Art Nom
SV -> V SN | V
Nom -> 'chien' | 'chat'
Art -> 'le'
V -> 'mange'
V -> 'dort'
""")
sent = "le chien dort".split()
rd_parser = nltk.RecursiveDescentParser(grammaire)
for tree in rd_parser.parse(sent):
  print(tree)

(S (SN (Art le) (Nom chien)) (SV (V dort)))


Testons maintenant l'outil SpaCy, une autre bibliothèque open-source pour le traitement avancé du langage naturel en Python.

In [9]:
%pip install -U pip setuptools wheel
%pip install -U spacy==3.5.0

Collecting pipNote: you may need to restart the kernel to use updated packages.

  Using cached pip-25.0.1-py3-none-any.whl (1.8 MB)
Collecting setuptools
  Using cached setuptools-75.3.2-py3-none-any.whl (1.3 MB)
Collecting wheel
  Using cached wheel-0.45.1-py3-none-any.whl (72 kB)
Installing collected packages: pip, setuptools, wheel
  Attempting uninstall: pip
    Found existing installation: pip 20.2.1
    Uninstalling pip-20.2.1:
      Successfully uninstalled pip-20.2.1
  Attempting uninstall: setuptools
    Found existing installation: setuptools 49.2.1
    Uninstalling setuptools-49.2.1:
      Successfully uninstalled setuptools-49.2.1
Successfully installed pip-25.0.1 setuptools-75.3.2 wheel-0.45.1




Collecting spacy==3.5.0
  Downloading spacy-3.5.0-cp38-cp38-win_amd64.whl.metadata (25 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy==3.5.0)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy==3.5.0)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy==3.5.0)
  Using cached murmurhash-1.0.12.tar.gz (13 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting cymem<2.1.0,>=2.0.2 (from spacy==3.5.0)
  Using cached cymem-2.0.11.tar.gz (10 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting require



In [10]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ------------- -------------------------- 4.5/12.8 MB 26.9 MB/s eta 0:00:01
     ---------------------------------- ---- 11.3/12.8 MB 29.4 MB/s eta 0:00:01
     --------------------------------------  12.6/12.8 MB 29.3 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 18.3 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.5.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [11]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Hello     World!')
for token in doc:
    print('"' + token.text + '"')


"Hello"
"    "
"World"
"!"


In [12]:
# détection de phrases

doc = nlp("These are apples. These are oranges.")

for sent in doc.sents:
    print(sent)



These are apples.
These are oranges.


In [13]:
# POS Tagging

doc = nlp("Next week I'll be in Madrid.")
print([(token.text, token.tag_) for token in doc])



[('Next', 'JJ'), ('week', 'NN'), ('I', 'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('in', 'IN'), ('Madrid', 'NNP'), ('.', '.')]


In [14]:
# NER Named Entity Recognition

doc = nlp(u"Next week I'll be in Madrid.")
for ent in doc.ents:
    print(ent.text, ent.label_)


Next week DATE
Madrid GPE


In [15]:
# Spacy Entity Types

doc = nlp(u"I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")
for ent in doc.ents:
    print(ent.text, ent.label_)


2 CARDINAL
9 a.m. TIME
30% PERCENT
just 2 days DATE
WSJ ORG


In [16]:
# displaCy

from spacy import displacy

doc = nlp(u'I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ')
displacy.render(doc, style='ent', jupyter=True)

In [17]:
# Chunking

doc = nlp("Wall Street Journal just published an interesting piece on crypto currencies")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.label_, chunk.root.text)


Wall Street Journal NP Journal
an interesting piece NP piece
crypto currencies NP currencies


In [18]:
# Dependency Parsing

doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')

for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

Wall/NNP <--compound-- Street/NNP
Street/NNP <--compound-- Journal/NNP
Journal/NNP <--nsubj-- published/VBD
just/RB <--advmod-- published/VBD
published/VBD <--ROOT-- published/VBD
an/DT <--det-- piece/NN
interesting/JJ <--amod-- piece/NN
piece/NN <--dobj-- published/VBD
on/IN <--prep-- piece/NN
crypto/JJ <--amod-- currencies/NNS
currencies/NNS <--pobj-- on/IN


In [19]:
# Visualisation des Dependency Parsing

from spacy import displacy

doc = nlp(u'Wall Street Journal just published an interesting piece on crypto currencies')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

In [20]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0-py3-none-any.whl (587.7 MB)
     ---------------------------------------- 0.0/587.7 MB ? eta -:--:--
     --------------------------------------- 3.4/587.7 MB 20.2 MB/s eta 0:00:29
      -------------------------------------- 9.2/587.7 MB 23.8 MB/s eta 0:00:25
      ------------------------------------- 14.9/587.7 MB 24.7 MB/s eta 0:00:24
     - ------------------------------------ 21.0/587.7 MB 26.0 MB/s eta 0:00:22
     - ------------------------------------ 27.3/587.7 MB 25.8 MB/s eta 0:00:22
     -- ----------------------------------- 31.5/587.7 MB 26.3 MB/s eta 0:00:22
     -- ----------------------------------- 31.7/587.7 MB 21.4 MB/s eta 0:00:26
     -- ----------------------------------- 35.1/587.7 MB 20.5 MB/s eta 0:00:27
     -- ----------------------------------- 41.9/587.7 MB 21.9 MB/s eta 0:00:25
     --- ---------------------

In [21]:
# Load the en_core_web_lg embeddings

nlp = spacy.load('en_core_web_lg')


In [22]:
# View vector representation for the word 'banana'

print(nlp.vocab[u'banana'].vector)

[ 0.20778  -2.4151    0.36605   2.0139   -0.23752  -3.1952   -0.2952
  1.2272   -3.4129   -0.54969   0.32634  -1.0813    0.55626   1.5195
  0.97797  -3.1816   -0.37207  -0.86093   2.1509   -4.0845    0.035405
  3.5702   -0.79413  -1.7025   -1.6371   -3.198    -1.9387    0.91166
  0.85409   1.8039   -1.103    -2.5274    1.6365   -0.82082   1.0278
 -1.705     1.5511   -0.95633  -1.4702   -1.865    -0.19324  -0.49123
  2.2361    2.2119    3.6654    1.7943   -0.20601   1.5483   -1.3964
 -0.50819   2.1288   -2.332     1.3539   -2.1917    1.8923    0.28472
  0.54285   1.2309    0.26027   1.9542    1.1739   -0.40348   3.2028
  0.75381  -2.7179   -1.3587   -1.1965   -2.0923    2.2855   -0.3058
 -0.63174   0.70083   0.16899   1.2325    0.97006  -0.23356  -2.094
 -1.737     3.6075   -1.511    -0.9135    0.53878   0.49268   0.44751
  0.6315    1.4963    4.1725    2.1961   -1.2409    0.4214    2.9678
  1.841     3.0133   -4.4652    0.96521  -0.29787   4.3386   -1.2527
 -1.7734   -3.5637   -0.20035

In [24]:
%pip install -U scipy

Collecting scipy
  Downloading scipy-1.10.1-cp38-cp38-win_amd64.whl.metadata (58 kB)
Downloading scipy-1.10.1-cp38-cp38-win_amd64.whl (42.2 MB)
   ---------------------------------------- 0.0/42.2 MB ? eta -:--:--
   ------- -------------------------------- 8.4/42.2 MB 47.5 MB/s eta 0:00:01
   ------------------ --------------------- 19.9/42.2 MB 50.4 MB/s eta 0:00:01
   ------------------------------ --------- 31.7/42.2 MB 51.7 MB/s eta 0:00:01
   ---------------------------------------  42.2/42.2 MB 52.7 MB/s eta 0:00:01
   ---------------------------------------  42.2/42.2 MB 52.7 MB/s eta 0:00:01
   ---------------------------------------- 42.2/42.2 MB 36.8 MB/s eta 0:00:00
Installing collected packages: scipy
Successfully installed scipy-1.10.1
Note: you may need to restart the kernel to use updated packages.


In [25]:
# Word embedding Math: "queen" = "king"

from scipy import spatial

cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

man = nlp.vocab[u'man'].vector
woman = nlp.vocab[u'woman'].vector
queen = nlp.vocab[u'queen'].vector
king = nlp.vocab[u'king'].vector

# We now need to find the closest vector in the vocabulary to the result of "man" - "woman" + "queen"
maybe_king = man - woman + queen
computed_similarities = []

for word in nlp.vocab:
    # Ignore words without vectors
    if not word.has_vector:
        continue

    similarity = cosine_similarity(maybe_king, word.vector)
    computed_similarities.append((word, similarity))

computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])
print([w[0].text for w in computed_similarities[:10]])

# ['Queen', 'QUEEN', 'queen', 'King', 'KING', 'king', 'KIng', 'KINGS', 'kings', 'Kings']


['queen', 'man', 'king', 'woman', 'he', 'nothin’', "'cause", "'Cause", 'He', 'That']


In [26]:
# Computing Similiarity

banana = nlp.vocab[u'banana']
dog = nlp.vocab[u'dog']
fruit = nlp.vocab[u'fruit']
animal = nlp.vocab[u'animal']

print(dog.similarity(animal), dog.similarity(fruit)) # 0.6618534 0.23552845
print(banana.similarity(fruit), banana.similarity(animal)) # 0.67148364 0.2427285



0.5192115902900696 0.13643456995487213
0.6650428175926208 0.18752224743366241


In [27]:
# Computing Similarity on entire texts

target = nlp(u"Cats are beautiful animals.")

doc1 = nlp(u"Dogs are awesome.")
doc2 = nlp(u"Some gorgeous creatures are felines.")
doc3 = nlp(u"Dolphins are swimming mammals.")

print(target.similarity(doc1))  # 0.8901765218466683
print(target.similarity(doc2))  # 0.9115828449161616
print(target.similarity(doc3))  # 0.7822956752876101

0.925293344292394
0.9067517259890845
0.9037427153904276
