**Année scolaire :** 2024 - 2025

**École :** YNOV

**Filière :** Data scientist

**Niveau :** M1

**Module :** NLP

**Progression pédagogique :** TD 9 - Le modèle LDA - Correction

**Intervenant :** Nicolas Miotto

# Latent Dirichlet Allocation (LDA)

🎯 Le but de ce challenge est de trouver des sujets au sein d'un corpus d'emails avec l'algorithme **LDA** (Apprentissage non-suppervisé en NLP)

✉️ Voici une collection de plus de 1 000 ***e-mails sans étiquette***. Essayons d'en ***extraire des sujets*** !

In [1]:
import pandas as pd

url = 'https://raw.githubusercontent.com/LucaSainteCroix/teaching-resources/main/exercises-data/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [2]:
data.shape

(1199, 1)

## (1) Preprocessing

❓ **Question (Nettoyage**) ❓ Vous y êtes habitué maintenant... Faites le ménage ! Stockez le texte nettoyé dans une nouvelle colonne "clean_text" du DataFrame.

In [24]:
from nltk.corpus import stopwords, words
import re
import string
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('words')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [27]:
dictionnary = set(words.words())
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean (text):
  text = re.sub(r'\b[\w.-]+@[\w.-]+\.[A-Za-z]{2,7}\b', '', text)
  for punctuation in string.punctuation:
      text = text.replace(punctuation, ' ')
  text = re.sub(r'\b\w\b', '', text) # Suppression des mots de longueur 1
  lowercased = text.lower()
  tokenized = word_tokenize(lowercased)
  words_only = [word for word in tokenized if word.isalpha()]
  good_words = [word for word in words_only if word in dictionnary]
  without_stopwords = [word for word in good_words if not word in stop_words]
  lemmatized = [lemmatizer.lemmatize(word) for word in without_stopwords]
  cleaned = ' '.join(lemmatized)
  return cleaned

In [28]:
data['clean_text'] = data.text.apply(clean)
data.head()

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,dare subject summary show prior hosting postin...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,cardinal subject arrogance organization nation...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,subject ancient organization university academ...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,cardinal subject hell organization national as...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,subject truly brutal loss organization univers...


## (2) Le modèle Latent Dirichlet Allocation

❓ **Question (Formation)** ❓ Former un modèle LDA pour extraire des sujets potentiels

In [29]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

In [32]:
vectorizer = CountVectorizer()
data_vectorized = vectorizer.fit_transform(data['clean_text'])
print(len(vectorizer.get_feature_names_out()))

8633


In [31]:
lda_model = LatentDirichletAllocation(n_components=2)
lda_vectors = lda_model.fit_transform(data_vectorized)

## (3) Visualisez les sujets potentiels

🎁 Nous avons codé une fonction qui imprime les mots associés aux sujets potentiels.

In [33]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

❓ **Question** ❓ Imprimez les sujets extraits par votre LDA.

In [34]:
print_topics(lda_model, vectorizer)

Topic 0:
[('subject', np.float64(904.4785623713537)), ('organization', np.float64(867.6523054183273)), ('would', np.float64(803.6444175965696)), ('team', np.float64(707.3510246772323)), ('one', np.float64(660.4742470345385)), ('hockey', np.float64(645.887727209439)), ('game', np.float64(609.4471107294997)), ('university', np.float64(583.6631873426428)), ('go', np.float64(542.1215385175577)), ('time', np.float64(494.5756622103872))]
Topic 1:
[('god', np.float64(1032.7719663146595)), ('one', np.float64(494.525752965431)), ('people', np.float64(470.7510884711307)), ('church', np.float64(430.2773410963121)), ('would', np.float64(405.3555824033983)), ('subject', np.float64(397.5214376286154)), ('believe', np.float64(337.19035828683656)), ('organization', np.float64(298.3476945816413)), ('think', np.float64(284.7403939293783)), ('truth', np.float64(282.78962399003854))]


## (4) Prédire le mélange document-sujet d'un nouveau texte

❓ **Question (Prédiction)** ❓

Maintenant que votre modèle LDA est ajusté, vous pouvez l'utiliser pour prédire les sujets d'un nouveau texte.

1. Vectorisez l'exemple
2. Utilisez le LDA sur l'exemple vectorisé pour prédire les sujets

In [35]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [36]:
example_vectorized = vectorizer.transform(example)

lda_vectors = lda_model.transform(example_vectorized)

print("topic 0 :", lda_vectors[0][0])
print("topic 1 :", lda_vectors[0][1])

topic 0 : 0.9466866219946585
topic 1 : 0.05331337800534153


🏁 Félicitations ! Vous savez mettre en œuvre une LDA rapidement.