**Année scolaire :** 2024 - 2025

**École :** YNOV

**Filière :** Data scientist

**Niveau :** M1

**Module :** NLP

**Progression pédagogique :** TD 9 - Le modèle LDA

**Intervenant :** Nicolas Miotto

# Latent Dirichlet Allocation (LDA)

🎯 Le but de ce challenge est de trouver des sujets au sein d'un corpus d'emails avec l'algorithme **LDA** (Apprentissage non-suppervisé en NLP)

✉️ Voici une collection de plus de 1 000 ***e-mails sans étiquette***. Essayons d'en ***extraire des sujets*** !

In [1]:
import pandas as pd

url = 'https://raw.githubusercontent.com/LucaSainteCroix/teaching-resources/main/exercises-data/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [2]:
data.shape

(1199, 1)

In [7]:
data['text'][0]

'From: gld@cunixb.cc.columbia.edu (Gary L Dare)\nSubject: Stan Fischler, 4/4\nSummary: From the Devils pregame show, prior to hosting the Penguins\nNntp-Posting-Host: cunixb.cc.columbia.edu\nReply-To: gld@cunixb.cc.columbia.edu (Gary L Dare)\nOrganization: PhDs In The Hall\nLines: 32\n\n\nAt the Lester Patrick Awards lunch, Bill Torrey mentioned that one of his\noptions next season is to be president of the Miami team, with Bob Clarke\nworking for him.  At the same dinner, Clarke said that his worst mistake\nin Philadelphia was letting Mike Keenan go -- in retrospect, almost all\nplayers came realize that Keenan knew what it took to win.  Rumours are\nnow circulating that Keenan will be back with the Flyers.\n\nNick Polano is sick of being a scapegoat for the schedule made for the\nRed Wings; After all, Bryan Murray approved it.\n\nGerry Meehan and John Muckler are worried over the Sabres\' prospects;\nAssistant Don Lever says that the Sabres have to get their share now,\nbecause a Que

## (1) Preprocessing

❓ **Question (Nettoyage**) ❓ Vous y êtes habitué maintenant... Faites le ménage ! Stockez le texte nettoyé dans une nouvelle colonne "clean_text" du DataFrame.

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
import string
import re
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [76]:
def cleaning(sentence):

    # Basic cleaning
    sentence = sentence.strip()
    sentence = sentence.lower()
    sentence = ''.join(char for char in sentence if not char.isdigit())
###
    # retirer la premiere adresse mail , puis toutes les autres
    sentence = re.sub(r'From:.*?Subject:', '', sentence, flags=re.DOTALL)
    sentence = re.sub(r'\S+@\S+', '', sentence)

    # Remove words with 3+ consecutive repeating letters
    sentence = re.sub(r'\b\w*(\w)\1{2,}\w*\b', '', sentence)

    # Remove URLs
    sentence = re.sub(r'http\S+|www\S+|https\S+', '', sentence, flags=re.MULTILINE)
###
    # Advanced cleaning
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '')

    tokenized_sentence = word_tokenize(sentence)
    tokenized_sentence_cleaned = [
        w for w in tokenized_sentence if not w in set(stopwords.words('english'))
    ]

    lemmatized = [
        WordNetLemmatizer().lemmatize(word, pos = "v")
        for word in tokenized_sentence_cleaned
    ]

    cleaned_sentence = ' '.join(word for word in lemmatized)

    return cleaned_sentence

In [77]:
data["clean_text"] = data["text"].apply(cleaning)
data["clean_text"][0]

'gary l dare subject stan fischler summary devil pregame show prior host penguins nntppostinghost cunixbcccolumbiaedu replyto gary l dare organization phds hall line lester patrick award lunch bill torrey mention one options next season president miami team bob clarke work dinner clarke say worst mistake philadelphia let mike keenan go retrospect almost players come realize keenan know take win rumour circulate keenan back flyers nick polano sick scapegoat schedule make red wing bryan murray approve gerry meehan john muckler worry sabre prospect assistant lever say sabre get share quebec dynasty emerge mighty duck declare throw money around loosely buy team oilers coach ted green remark guy around fill tie domis skate none fill helmet senators andrew mcbain tell security guard chicago stadium warn stairs lead locker room mcbain mouth season professional tumble entire steep flight gld je souviens gary l dare go winnipeg jet go selanne domi stanley'

## (2) Le modèle Latent Dirichlet Allocation

❓ **Question (Formation)** ❓ Former un modèle LDA pour extraire des sujets potentiels

In [78]:
# Lemmatizer

from sklearn.decomposition import LatentDirichletAllocation
#from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# vectorizer = TfidfVectorizer()
vectorizer = CountVectorizer()

vectorized_documents = vectorizer.fit_transform(data["clean_text"])
vectorized_documents = pd.DataFrame(
    vectorized_documents.toarray(),
    columns = vectorizer.get_feature_names_out())
vectorized_documents

Unnamed: 0,aa,aacc,aadams,aargh,aaron,aarons,aassists,aatchoo,ab,abandon,...,zoerasterism,zombo,zone,zoo,zoom,zorasterism,zubov,zupancic,zurich,zwart
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1194,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1195,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1196,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1197,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [79]:
from sklearn.decomposition import LatentDirichletAllocation

# Instantiate the LDA
n_components = 2 # pour le nombre de thème
lda_model = LatentDirichletAllocation(
    n_components=n_components,
    max_iter=100)

# Fit the LDA on the vectorized documents
lda_model.fit(vectorized_documents)

In [80]:
document_topic_mixture = lda_model.transform(vectorized_documents)
document_topic_mixture

array([[0.99354023, 0.00645977],
       [0.00352637, 0.99647363],
       [0.02280498, 0.97719502],
       ...,
       [0.0042509 , 0.9957491 ],
       [0.98108715, 0.01891285],
       [0.01043628, 0.98956372]])

## (3) Visualisez les sujets potentiels

In [81]:
topic_word_mixture = pd.DataFrame(
    lda_model.components_,
    columns = vectorizer.get_feature_names_out()
)

🎁 Nous avons codé une fonction qui imprime les mots associés aux sujets potentiels.

In [63]:
def print_topics(lda_model, vectorizer, top_words):
    # 1. TOPIC MIXTURE OF WORDS FOR EACH TOPIC
    topic_mixture = pd.DataFrame(
        lda_model.components_,
        columns = vectorizer.get_feature_names_out()
    )

    # 2. FINDING THE TOP WORDS FOR EACH TOPIC
    ## Number of topics
    n_components = topic_mixture.shape[0]

    ## Top words for each topic
    for topic in range(n_components):
        print("-"*10)
        print(f"For topic {topic}, here are the the top {top_words} words with weights:")

        topic_df = topic_mixture.iloc[topic].sort_values(ascending = False).head(top_words)

        print(round(topic_df,3))

❓ **Question** ❓ Imprimez les sujets extraits par votre LDA.

In [82]:
print_topics(lda_model, vectorizer, 20)

----------
For topic 0, here are the the top 20 words with weights:
team            954.492
game            929.860
play            754.907
line            701.573
go              642.272
subject         625.790
organization    604.028
hockey          593.492
get             534.075
write           431.077
university      412.328
nhl             400.497
would           385.717
season          383.468
win             381.278
think           356.633
one             328.611
players         327.494
article         300.271
year            297.503
Name: 0, dtype: float64
----------
For topic 1, here are the the top 20 words with weights:
god             1223.476
say              859.284
would            819.283
one              782.389
people           708.453
subject          684.210
think            653.367
know             652.189
line             643.427
jesus            608.491
write            601.923
believe          570.590
organization     560.972
church           495.497
make      

## (4) Prédire le mélange document-sujet d'un nouveau texte

❓ **Question (Prédiction)** ❓

Maintenant que votre modèle LDA est ajusté, vous pouvez l'utiliser pour prédire les sujets d'un nouveau texte.

1. Vectorisez l'exemple
2. Utilisez le LDA sur l'exemple vectorisé pour prédire les sujets

In [83]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [84]:
cleaned_sentence = cleaning(example[0])
vectorized_sentence = vectorizer.transform([cleaned_sentence])
# document_topic_mixture
topic_distribution = lda_model.transform(vectorized_sentence)
topic_distribution



array([[0.95129327, 0.04870673]])

In [85]:
dominant_topic = topic_distribution.argmax()
print(f"The predicted subject is: Topic {dominant_topic}")

The predicted subject is: Topic 0


🏁 Félicitations ! Vous savez mettre en œuvre une LDA rapidement.