# TP Traitement des textes

### Installation

Ce TP utilise la librairie python NTLK (natural language processing toolkit), qui est la librairie python de référence
pour le traitement des textes. Elle est organisée de manière plus complexe que les autres librairies, puisqu'elle fait appel à de nombreux composants et corpus qu'on peut charger via un "downloader" spécifique, comme ci-dessous :


Hors du notebook :

installer le package python nltk (avec conda ou pip, par exemple)

puis 
* `import nltk`
* `nltk.download()`

ce qui doit ouvrir une fenêtre pour chargement interactif de corpus et modules de traitement de la langue. Il faudra installer des modules au fur des besoins pour le TP.


### Tokenization

Ici s'agit ici de découper automatiquement un texte en phrases. Quelles sont les difficultés que doit résoudre la fonction (le travail est déjà fait !) ? (penser à la ponctuation et aux majuscules, modifier le texte pour confronter la fonction à des difficultés).


In [10]:
from nltk import sent_tokenize, word_tokenize, pos_tag

# Ici un texte que vous êtes invités à modifier
text = "Machine learning is the lovely science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself. More importantly, you'll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems. Finally, you'll learn about some of Silicon Valley's best practices in innovation as it pertains to machine learning and AI."

sentences = sent_tokenize(text)
print(sentences) # Affichons le découpage en phrases

premiere_phrase = sentences[0]
les_mots=word_tokenize(premiere_phrase)
print(les_mots) # Affichons le découpage en mots

['Machine learning is the lovely science of getting computers to act without being explicitly programmed.', 'In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome.', 'Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it.', 'Many researchers also think it is the best way to make progress towards human-level AI.', 'In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself.', "More importantly, you'll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems.", "Finally, you'll learn about some of Silicon Valley's best practices in innovation as it pertains to machine learning and AI."]
['Machine', 'learni

### Part-of-speech tagging

On peut également tenter d'identifier la nature grammaticale de chaque mot (Part-of-Speech) :

In [4]:
pos_tag(les_mots)

[('Machine', 'NN'),
 ('learning', 'NN'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('lovely', 'JJ'),
 ('science', 'NN'),
 ('of', 'IN'),
 ('getting', 'VBG'),
 ('computers', 'NNS'),
 ('to', 'TO'),
 ('act', 'VB'),
 ('without', 'IN'),
 ('being', 'VBG'),
 ('explicitly', 'RB'),
 ('programmed', 'VBN'),
 ('.', '.')]

où VBP : verbe au présent, IN : préposition, NN: nom, NNP : nom propre, CC: conjonction de coordination, JJ : adjectif,....

### Collocations

Il s'agit d'expressions composée de plusieurs mots successifs  ("Machine learning", "pâté de campagne","école d'ingénieurs","abat-jour","travaux pratiques) où dont sémantique est associée à la séquence et non à chaque mot pris isolément. 

* Proposer un principe statistique d'identification automatique des collocations dans un corpus de textes ? 
* Quelle est l'importance de cette identification pour, par exemple, la traduction ?

Ci-dessous, un petit exemple sur le script (texte) de Monty Python-Sacré Graal.

In [11]:

import nltk
from nltk.collocations import *
from nltk.book import *
import re

bigram_measures = nltk.collocations.BigramAssocMeasures()

montypython_words = sum([re.findall(r'[A-Z]?[a-z]+', token) for token in text6.tokens], [])
finder = BigramCollocationFinder.from_words(montypython_words)
finder.apply_freq_filter(3)
finder.nbest(bigram_measures.likelihood_ratio, 10)
finder.score_ngrams(bigram_measures.likelihood_ratio)


*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


[(('Hello', 'Hello'), 267.112668959853),
 (('don', 't'), 261.26601625820626),
 (('clop', 'clop'), 260.48367234352037),
 (('mumble', 'mumble'), 210.87280109162216),
 (('Holy', 'Grail'), 203.6128877611493),
 (('squeak', 'squeak'), 200.11837146003575),
 (('saw', 'saw'), 199.43405994849405),
 (('ha', 'ha'), 195.5837918341728),
 (('Burn', 'her'), 189.491732138607),
 (('Sir', 'Robin'), 187.39540407668534),
 (('Ni', 'Ni'), 181.96030649440533),
 (('Run', 'away'), 176.98935205620185),
 (('witch', 'witch'), 172.35868495940355),
 (('Iesu', 'domine'), 157.49017281313883),
 (('Pie', 'Iesu'), 157.49017281313883),
 (('King', 'Arthur'), 151.01995532192427),
 (('going', 'to'), 145.03673312788362),
 (('Come', 'on'), 142.06822969964873),
 (('her', 'Burn'), 138.81153073577218),
 (('clang', 'Bring'), 134.52227490026843),
 (('Bring', 'out'), 129.873654900665),
 (('Round', 'Table'), 129.56408933134304),
 (('away', 'Run'), 121.21891004917532),
 (('clap', 'clap'), 119.30449709216805),
 (('the', 'Holy'), 119.02

## Stemming 

Il s'agit d'extraire automatiquement la racine de mots. 
Le premier des algorithmes est expliqué ici :
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

Quel est l'intérêt de cette opération, dans la perspective de calculer les fréquences d'occurrence des termes dans un texte ?


In [13]:
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
print(porter_stemmer.stem("computer"))
print(porter_stemmer.stem("computation"))
print(porter_stemmer.stem("working"))
print(porter_stemmer.stem("worked"))
print(porter_stemmer.stem("coming"))

comput
comput
work
work
come


In [14]:
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer("french")
print(snowball_stemmer.stem("finir"))
print(snowball_stemmer.stem("finis"))
print(snowball_stemmer.stem("finissant"))
print(snowball_stemmer.stem("fini"))
print(snowball_stemmer.stem("infini"))

fin
fin
fin
fin
infin


### Lemmatisation

Il s'agit d'un objectif voisin, mais plus soigné parce qu'exploitant des ressources linguistiques (ici la base Wordnet)

In [18]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
wordnet_lemmatizer.lemmatize('was', pos='v')

'be'

## Stop word removal

List des mots à enlever 

In [21]:
from nltk.corpus import stopwords
stop = set(stopwords.words('french'))
print(stop)

{'fut', 'aux', 'eut', 'serez', 'mes', 'ma', 'ayons', 'on', 'des', 'aura', 'ayants', 'une', 'c', 'en', 'te', 'es', 'ses', 'étées', 'vos', 'le', 'ton', 'nos', 'auraient', 'fusse', 'eût', 'toi', 'ce', 'sont', 'sa', 'seriez', 'fussions', 'tu', 'soient', 'seraient', 'sommes', 'eûtes', 'aies', 'ne', 'mais', 'ta', 'eusse', 'seront', 't', 'aie', 'du', 'leur', 'de', 'avaient', 'fusses', 'eussions', 'ayantes', 'suis', 'l', 'tes', 'eue', 'aviez', 'eus', 'avez', 'avec', 'auras', 'n', 'ayante', 'son', 'pour', 'j', 'd', 'la', 'ait', 'nous', 'qu', 'dans', 'fussent', 'sois', 'serait', 'à', 'eues', 'étant', 'vous', 'notre', 'soyez', 's', 'étants', 'fussiez', 'serai', 'fût', 'moi', 'auriez', 'je', 'fûtes', 'fus', 'aurai', 'étiez', 'même', 'm', 'êtes', 'et', 'sur', 'étée', 'eux', 'eurent', 'furent', 'étions', 'auront', 'étantes', 'il', 'étés', 'aurez', 'serions', 'est', 'ai', 'as', 'eûmes', 'un', 'me', 'été', 'aurais', 'eusses', 'ayant', 'serons', 'ou', 'se', 'y', 'était', 'eussent', 'ont', 'elle', 'qui'

Découpage et suppression des "stop words"

In [22]:
sentence = "Voici une longue phrase qui contient plusieurs petits et longs mots"

mots_non_stop = [i for i in sentence.lower().split() if i not in stop]
print(mots_non_stop)

['voici', 'longue', 'phrase', 'contient', 'plusieurs', 'petits', 'longs', 'mots']


### Ressource wordnet

In [23]:
from nltk.corpus import wordnet as wn 
wn.synsets('computer')

[Synset('computer.n.01'), Synset('calculator.n.01')]

In [24]:
wn.synset('computer.n.01').definition()

'a machine for performing calculations automatically'

In [29]:
wn.synset('dog.n.01').hypernyms()

[Synset('canine.n.02'), Synset('domestic_animal.n.01')]

In [30]:
wn.synset('computer.n.01').hyponyms()

[Synset('analog_computer.n.01'),
 Synset('digital_computer.n.01'),
 Synset('home_computer.n.01'),
 Synset('node.n.08'),
 Synset('number_cruncher.n.02'),
 Synset('pari-mutuel_machine.n.01'),
 Synset('predictor.n.03'),
 Synset('server.n.03'),
 Synset('turing_machine.n.01'),
 Synset('web_site.n.01')]

In [31]:
wn.synset('nice.a.01').lemmas()[0].antonyms()

[Lemma('nasty.a.01.nasty')]

In [32]:
wn.synsets('dark')
wn.synset('beautiful.a.01').lemmas()[0].antonyms()

[Lemma('ugly.a.01.ugly')]

In [33]:
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
dog.path_similarity(cat)

0.2

#### Wordnet pour l'analyse de sentiment

In [34]:
from nltk.corpus import sentiwordnet as swn
# permet de savoir si on dit du bien, ou du mal d'un texte
feeling = swn.senti_synset('failure.n.03')
print(feeling)
feeling = swn.senti_synset('wonderful.a.01')
print(feeling)
feeling = swn.senti_synset('hate.v.01')
print(feeling)
all = swn.all_senti_synsets()
print(all)

<failure.n.03: PosScore=0.125 NegScore=0.375>
<fantastic.s.02: PosScore=0.75 NegScore=0.0>
<hate.v.01: PosScore=0.0 NegScore=0.75>
<generator object SentiWordNetCorpusReader.all_senti_synsets at 0x7f075145e9e8>


## Identification de thèmes par factorisation matricielle

In [36]:
# adapté de : 
# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Lars Buitinck
#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>


from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF
from sklearn.datasets import fetch_20newsgroups



n_samples = 2000
n_features = 1000
n_topics = 10
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed.


dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))



In [37]:

data_samples = dataset.data[:n_samples]

print(data_samples[:3])

["Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n", "\n\n\n\n\n\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism?  No, you need a little leap 

### Fabrication d'une matrice tf.idf

In [38]:
# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')

tfidf = tfidf_vectorizer.fit_transform(data_samples)
# tfidf est une matrice creuse

print(tfidf)

Extracting tf-idf features for NMF...
  (0, 708)	0.12621877625178227
  (0, 410)	0.11650651629173196
  (0, 493)	0.1631127602376565
  (0, 548)	0.11873384536901997
  (0, 130)	0.13595955391213657
  (0, 567)	0.13595955391213657
  (0, 412)	0.12831668397369733
  (0, 750)	0.15376128408643466
  (0, 841)	0.18564440175793037
  (0, 206)	0.15810189392327795
  (0, 764)	0.1640284908630232
  (0, 748)	0.13595955391213657
  (0, 904)	0.08983671288492111
  (0, 923)	0.11966934266418663
  (0, 527)	0.1690393571774018
  (0, 432)	0.13369075280946802
  (0, 988)	0.12740095334833063
  (0, 488)	0.3750048191807266
  (0, 717)	0.17767638066823058
  (0, 587)	0.6454209423982519
  (0, 862)	0.1551447391479567
  (0, 286)	0.11115911128919416
  (0, 867)	0.15810189392327795
  (0, 881)	0.11227372176926384
  (1, 381)	0.20157910011124136
  :	:
  (1998, 504)	0.04875543232365812
  (1998, 991)	0.053978162418983656
  (1998, 566)	0.03637572081429063
  (1998, 611)	0.05504978412016225
  (1998, 171)	0.047384737904817335
  (1998, 414)	0

### Factorisation de la matrice

In [39]:


# Fit the NMF model
print("Fitting the NMF model with tf-idf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))

nmf = NMF(n_components=n_topics, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)


print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)



Fitting the NMF model with tf-idf features, n_samples=2000 and n_features=1000...





Topics in NMF model:
Topic #0:
just people don think like know time good make way really say right ve want did ll new use years
Topic #1:
windows use dos using window program os drivers application help software pc running ms screen files version card code work
Topic #2:
god jesus bible faith christian christ christians does heaven sin believe lord life church mary atheism belief human love religion
Topic #3:
thanks know does mail advance hi info interested email anybody looking card help like appreciated information send list video need
Topic #4:
car cars tires miles 00 new engine insurance price condition oil power speed good 000 brake year models used bought
Topic #5:
edu soon com send university internet mit ftp mail cc pub article information hope program mac email home contact blood
Topic #6:
file problem files format win sound ftp pub read save site help image available create copy running memory self version
Topic #7:
game team games year win play season players nhl runs goal 