# Série TP 3 - TALN - Traitements Lexicaux - avec NLTK

## Partie 1 - POS Tagging, Stemming, et Lemmatization

Qu'est-ce que le package Python **NLTK** ?

Natural Language Tool Kit (NLTK) est une bibliothèque Python permettant de créer des programmes fonctionnant avec le langage naturel. Il fournit une interface conviviale aux ensembles de données contenant plus de 50 corpus et ressources lexicales telles que WordNet. La bibliothèque peut effectuer différentes opérations telles que la tokenization, le stemming, la lemmatisation, la classification, le parsing, le pos tagging, etc.

NLTK peut être utilisé par les étudiants, les chercheurs et les industriels. C'est une bibliothèque Open Source et gratuite. Il est disponible pour Windows, Mac OS et Linux.

- Pour l’installer en utilisant PIP depuis cmd.exe : >> pip install nltk 


### Installation et Téléchargement

In [None]:
!pip install nltk

In [1]:
import nltk
nltk.download('universal_tagset')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\LeE\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\LeE\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\LeE\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\LeE\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\LeE\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### Importer les packages 

In [1]:
import nltk

# importing tokenizers
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import sent_tokenize

# importing pos taggers
from nltk.tag import pos_tag

# importing the stopwords
from nltk.corpus import stopwords

# importing Porter and Lancaster stemmers 
from nltk.stem import PorterStemmer, LancasterStemmer

# importing WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

from collections import Counter

# importing wordnet
from nltk.corpus import wordnet

### Tokenization with NLTK

In [3]:
text = "POS Tagging is a process to mark up the words in text format for a particular part of a speech, based on its definition and context."

In [4]:
# Default Tokenization - Punctuations are keeped with word_tokenizer
tokens = word_tokenize(text)

In [5]:
print(tokens)

['POS', 'Tagging', 'is', 'a', 'process', 'to', 'mark', 'up', 'the', 'words', 'in', 'text', 'format', 'for', 'a', 'particular', 'part', 'of', 'a', 'speech', ',', 'based', 'on', 'its', 'definition', 'and', 'context', '.']


In [6]:
# Punctuations are removed with RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(text)

In [7]:
print(tokens)

['POS', 'Tagging', 'is', 'a', 'process', 'to', 'mark', 'up', 'the', 'words', 'in', 'text', 'format', 'for', 'a', 'particular', 'part', 'of', 'a', 'speech', 'based', 'on', 'its', 'definition', 'and', 'context']


#### Stopwords filtering

In [34]:
print(stopwords.fileids())

['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']


In [35]:
en_sw = stopwords.words('english')

en_sw[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [37]:
ar_sw = stopwords.words('arabic')

ar_sw[0:10]

['إذ', 'إذا', 'إذما', 'إذن', 'أف', 'أقل', 'أكثر', 'ألا', 'إلا', 'التي']

In [36]:
# Filter stopwords from tokens

clean_tokens = []

for token in tokens:
    if token not in en_sw:
        clean_tokens.append(token)

In [38]:
print(clean_tokens)

['POS', 'Tagging', 'process', 'mark', 'words', 'text', 'format', 'particular', 'part', 'speech', ',', 'based', 'definition', 'context', '.']


In [41]:
# Autres tokenizers : Tweet Tokenizer
tknzr = TweetTokenizer()
tweet = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"
tokens = tknzr.tokenize(tweet)
print(tokens)

['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']


In [44]:
# Autres tokenizers : Sentence Tokenizer
sentences = 'This is a text written. It uses U.S. english to illustrate sentence tokenization.'

sent_tokens = sent_tokenize(sentences)
print(sent_tokens)

['This is a text written.', 'It uses U.S. english to illustrate sentence tokenization.']


### POS Tagging

POS Tagging is a process to mark up the words in text format for a particular part of a speech based on its definition and context. It categorizes the tokens in a text as nouns, verbs, adjectives, and so on.

NLTK : nltk.tag.pos_tag(tokens_list, tagset=None, lang='eng') : list(tuple(str, str))

In [10]:
# Default tagset : PennTreebank tagset
tags = pos_tag(tokens)
print(tags)

[('POS', 'NNP'), ('Tagging', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('process', 'NN'), ('to', 'TO'), ('mark', 'VB'), ('up', 'RP'), ('the', 'DT'), ('words', 'NNS'), ('in', 'IN'), ('text', 'JJ'), ('format', 'NN'), ('for', 'IN'), ('a', 'DT'), ('particular', 'JJ'), ('part', 'NN'), ('of', 'IN'), ('a', 'DT'), ('speech', 'NN'), ('based', 'VBN'), ('on', 'IN'), ('its', 'PRP$'), ('definition', 'NN'), ('and', 'CC'), ('context', 'NN')]


In [11]:
# With Universal dependencies tagset
tags = pos_tag(tokens, tagset = "universal")
print(tags)

[('POS', 'NOUN'), ('Tagging', 'NOUN'), ('is', 'VERB'), ('a', 'DET'), ('process', 'NOUN'), ('to', 'PRT'), ('mark', 'VERB'), ('up', 'PRT'), ('the', 'DET'), ('words', 'NOUN'), ('in', 'ADP'), ('text', 'ADJ'), ('format', 'NOUN'), ('for', 'ADP'), ('a', 'DET'), ('particular', 'ADJ'), ('part', 'NOUN'), ('of', 'ADP'), ('a', 'DET'), ('speech', 'NOUN'), ('based', 'VERB'), ('on', 'ADP'), ('its', 'PRON'), ('definition', 'NOUN'), ('and', 'CONJ'), ('context', 'NOUN')]


In [12]:
# Get all tags
tags_list = []
for tag in tags:
    tags_list.append(tag[1])
print(tags_list)

['NOUN', 'NOUN', 'VERB', 'DET', 'NOUN', 'PRT', 'VERB', 'PRT', 'DET', 'NOUN', 'ADP', 'ADJ', 'NOUN', 'ADP', 'DET', 'ADJ', 'NOUN', 'ADP', 'DET', 'NOUN', 'VERB', 'ADP', 'PRON', 'NOUN', 'CONJ', 'NOUN']


In [13]:
# Count the number of each tag
counts = Counter(tag for tag in tags_list)
print(counts)

Counter({'NOUN': 9, 'DET': 4, 'ADP': 4, 'VERB': 3, 'PRT': 2, 'ADJ': 2, 'PRON': 1, 'CONJ': 1})


In [14]:
# Count the number of each tag
counts = Counter(tag for token, tag in tags)
print(counts)

Counter({'NOUN': 9, 'DET': 4, 'ADP': 4, 'VERB': 3, 'PRT': 2, 'ADJ': 2, 'PRON': 1, 'CONJ': 1})


### **Stemming** - Porter and Lancaster stemmers

Lancaster Stemmer is simple, but heavy stemming due to iterations and over-stemming may occur. Aggressive stemming. Over-stemming causes the stems to be not linguistic, or they may have no meaning. Lancaster produces an even shorter stem than Porter because of iterations and over-stemming is occurred.

In [15]:
# Create an object of class PorterStemmer and LancasterStemmer
porter = PorterStemmer()
lancaster = LancasterStemmer()

In [16]:
porter_stem = porter.stem("probably")
print(porter_stem)

probabl


In [17]:
lancaster_stem = lancaster.stem("probably")
print(lancaster_stem)

prob


In [18]:
print("Porter Stemmer examples: ")
print(porter.stem("changes"))
print(porter.stem("troubling"))
print(porter.stem("troubled"))
print(porter.stem("cats"))
print(porter.stem("characterization"))

Porter Stemmer examples: 
chang
troubl
troubl
cat
character


In [19]:
print("Lancaster Stemmer examples : ")
print(lancaster.stem("changes"))
print(lancaster.stem("troubling"))
print(lancaster.stem("troubled"))
print(lancaster.stem("cats"))
print(lancaster.stem("characterization"))

Lancaster Stemmer examples : 
chang
troubl
troubl
cat
charact


In [20]:
# Stemming a list of words

word_list = ["friend", "friendship", "friends", "friendships","stabil", "destabilize", "misunderstanding", "railroad", "moonlight", "football"]

for word in word_list:
    print(f"{word:20} {porter.stem(word):20} {lancaster.stem(word)}")

friend               friend               friend
friendship           friendship           friend
friends              friend               friend
friendships          friendship           friend
stabil               stabil               stabl
destabilize          destabil             dest
misunderstanding     misunderstand        misunderstand
railroad             railroad             railroad
moonlight            moonlight            moonlight
football             footbal              footbal


In [21]:
# Stemming a sentence using a defined function : stem_sentence

def stem_sentence(sentence):
    # Tokenization
    tokens = word_tokenize(sentence)
    
    # Stemming
    stems = []
    for token in tokens:
        stems.append(porter.stem(token))
    
    return " ".join(stems)

In [22]:
sentence = "Pythoners are very intelligent, and work very pythonly and now they are pythoning their way to success."

stems = stem_sentence(sentence)

print('Stemmed sentence: ', stems)

Stemmed sentence:  python are veri intellig , and work veri pythonli and now they are python their way to success .


### **Lemmatization**  - WordNet Lemmatizer

In [23]:
# Instantiating the lemmaztizer object
lemmatizer = WordNetLemmatizer()

In [24]:
# Lemmatize a single word without context
print(lemmatizer.lemmatize("bats"))
print(lemmatizer.lemmatize("feet"))
print(lemmatizer.lemmatize("are"))
print(lemmatizer.lemmatize("changes"))

bat
foot
are
change


In [25]:
# Lemmatize a single word with context
print(lemmatizer.lemmatize("are", pos='v'))
print(lemmatizer.lemmatize("swimming", pos='v'))
print(lemmatizer.lemmatize("swimming", pos='n'))
print(lemmatizer.lemmatize("stripes", pos='v')) 
print(lemmatizer.lemmatize("stripes", pos='n'))

be
swim
swimming
strip
stripe


In [26]:
# Lemmatize a sentence
sentence = "He is running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."

# Tokenize the sentence into a list of words without punctuations
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(sentence)

# Lemmatize without context
for token in tokens:
    print(f"{token:20} {lemmatizer.lemmatize(token)}")

He                   He
is                   is
running              running
and                  and
eating               eating
at                   at
same                 same
time                 time
He                   He
has                  ha
bad                  bad
habit                habit
of                   of
swimming             swimming
after                after
playing              playing
long                 long
hours                hour
in                   in
the                  the
Sun                  Sun


### Stemming Vs. Lemmatization

In [27]:
print(porter.stem("leaves"))
print(porter.stem("leafs"))

leav
leaf


In [28]:
print(lemmatizer.lemmatize("leaves", pos='v'))
print(lemmatizer.lemmatize("leaves", pos='n'))
print(lemmatizer.lemmatize("leafs"))

leave
leaf
leaf


### N-grams, Bigrams and Trigrams

An **N-gram** is a contiguous sequence of n items from a given sample of text or speech. In Natural Language Processing, the concept of N-gram is widely used for text analysis. An N-gram of size 1 is referred to as a “unigram“, size 2 is a “bigram”, size 3 is a “trigram”.

- **Bigrams** combination of two words
- **Trigrams** combination of three words
- **N-grams** combination of n words

In [29]:
text = "POS Tagging is a process to mark up the words in text format for a particular part of a speech, based on its definition and context."
tokens = word_tokenize(text)

In [30]:
print(tokens)

['POS', 'Tagging', 'is', 'a', 'process', 'to', 'mark', 'up', 'the', 'words', 'in', 'text', 'format', 'for', 'a', 'particular', 'part', 'of', 'a', 'speech', ',', 'based', 'on', 'its', 'definition', 'and', 'context', '.']


In [31]:
bigrams = list(nltk.bigrams(tokens))
print(bigrams)

[('POS', 'Tagging'), ('Tagging', 'is'), ('is', 'a'), ('a', 'process'), ('process', 'to'), ('to', 'mark'), ('mark', 'up'), ('up', 'the'), ('the', 'words'), ('words', 'in'), ('in', 'text'), ('text', 'format'), ('format', 'for'), ('for', 'a'), ('a', 'particular'), ('particular', 'part'), ('part', 'of'), ('of', 'a'), ('a', 'speech'), ('speech', ','), (',', 'based'), ('based', 'on'), ('on', 'its'), ('its', 'definition'), ('definition', 'and'), ('and', 'context'), ('context', '.')]


In [32]:
trigrams = list(nltk.trigrams(tokens))
print(trigrams)

[('POS', 'Tagging', 'is'), ('Tagging', 'is', 'a'), ('is', 'a', 'process'), ('a', 'process', 'to'), ('process', 'to', 'mark'), ('to', 'mark', 'up'), ('mark', 'up', 'the'), ('up', 'the', 'words'), ('the', 'words', 'in'), ('words', 'in', 'text'), ('in', 'text', 'format'), ('text', 'format', 'for'), ('format', 'for', 'a'), ('for', 'a', 'particular'), ('a', 'particular', 'part'), ('particular', 'part', 'of'), ('part', 'of', 'a'), ('of', 'a', 'speech'), ('a', 'speech', ','), ('speech', ',', 'based'), (',', 'based', 'on'), ('based', 'on', 'its'), ('on', 'its', 'definition'), ('its', 'definition', 'and'), ('definition', 'and', 'context'), ('and', 'context', '.')]


### WordNet - Distance and Word similarity - Wu-Palmer Similarity

In [51]:
dog = wordnet.synsets('dog')[0]

In [52]:
cat = wordnet.synsets('cat')[0]

In [57]:
play = wordnet.synsets('play')[0]

In [56]:
dog.wup_similarity(cat)

0.8571428571428571

In [58]:
dog.wup_similarity(play)

0.125

### WordNet - Synonyms 

In [61]:
for ss in wordnet.synsets('small'):
    for name in ss.lemma_names():
        print(name)

small
small
small
little
minor
modest
small
small-scale
pocket-size
pocket-sized
little
small
small
humble
low
lowly
modest
small
little
minuscule
small
little
small
small
modest
small
belittled
diminished
small
small


## Partie 2 - Exercices

- Ecrire une fonction get_sent_stems(str) qui prend comme paramètre une phrase et qui retourne la liste des stems de chaque mot de cette phrase, sans les ponctuations et sans les stopwords. 

- Tester cette fonction avec une phrase de votre choix, puis convertir la liste des stems retournée en une chaine de caractères str. Afficher cette chaine.

- Soit la chaine de caractères suivante : "This is a text written. It uses U.S. english to illustrate sentence tokenization and so on.”. Ecrire le code permettant d’afficher les informations suivantes pour chaque mot de la chaine :
    Token	---  Stem  ----	Lemma  ---	POS Universal ---- POS PennTreeBank


In [2]:
def get_sent_stems(sentence):
    
    # Tokenization, without punkt
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(sentence)
    
    # Clean stopwods
    clean_tokens = []
    sw = stopwords.words('english')
    for token in tokens:
        if token not in sw:
            clean_tokens.append(token)
    
    # Stemming
    stems = []
    porter = PorterStemmer()
    for token in clean_tokens:
        stems.append(porter.stem(token))
    
    return stems

In [3]:
sentence = "Pythoners are very intelligent, and work very pythonly and now they are pythoning their way to success."
stems = get_sent_stems(sentence)
print(stems)

['python', 'intellig', 'work', 'pythonli', 'python', 'way', 'success']


In [4]:
mystr = " ".join(stems)
print(mystr)

python intellig work pythonli python way success


In [6]:
sentence = "This is a text written. It uses U.S. english to illustrate sentence tokenization and so on." 

In [17]:
tokens = word_tokenize(sentence)

for token in tokens:
    porter = PorterStemmer()
    stem = porter.stem(token)
    
    lemmatizer = WordNetLemmatizer()
    lemma = lemmatizer.lemmatize(token)
    
    postag = pos_tag([token])
    postag_uni = pos_tag([token], tagset='universal')
    
    print(f"{token:20} {stem:20} {lemma:20} {postag_uni[0][1]:20} {postag[0][1]}")

This                 thi                  This                 DET                  DT
is                   is                   is                   VERB                 VBZ
a                    a                    a                    DET                  DT
text                 text                 text                 NOUN                 NN
written              written              written              VERB                 VBN
.                    .                    .                    .                    .
It                   it                   It                   PRON                 PRP
uses                 use                  us                   NOUN                 NNS
U.S.                 u.s.                 U.S.                 NOUN                 NNP
english              english              english              ADJ                  JJ
to                   to                   to                   PRT                  TO
illustrate           illustr           

## Partie 2 - Soundex Algorithm

In [29]:
groups = ['aehiouwy', #0
    'bfpv', #1
    'cgjkqsxz', #2
    'dt', #3
    'l', #4
    'mn', #5
    'r'] #6

In [22]:
numerics = {'a': '0',  'c': '2', 'b': '1', 'e': '0', 'd': '3', 'g': '2', 'f': '1', 'i': '0', 'h': '0', 'k': '2', 
            'j': '2', 'm': '5', 'l': '4', 'o': '0', 'n': '5', 'q': '2', 'p': '1', 's': '2', 'r': '6', 'u': '0', 
            't': '3', 'w': '0', 'v': '1', 'y': '0', 'x': '2', 'z': '2'}

In [43]:
def soundex(word):
    """ 
      Soundex module conforming to Knuth's algorithm implementation 2000-12-24 by Gregory Jorgensen public domain
    """
    
    sndx = ''
    
    firstchar = word[0].upper()
    word = word[1:]

    for car in word.lower():
        digit = numerics[car]
       
        if not sndx or digit != sndx[-1]:
            sndx += digit

    sndx = sndx.replace('0','')
    
    sndx = firstchar + sndx

    sndx =  sndx + '0000'
    
    return sndx[:4]

In [45]:
words =[ "physique", "physik", "phosphore", "fosfor", "hello", "robert", "rupert"]

for word in words :
    code = soundex(word)
    print(word, code)

physique P220
physik P220
phosphore P216
fosfor F216
hello H400
robert R163
rupert R163
