# Latent Dirichlet Allocation

In [6]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [44]:
#nettoyer aussi les nombres

import string

text = ''.join([i for i in text if not i.isdigit()])

def clean_up(text):

    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
        text = text.lower()
        return text

data['clean_text'] = data['text'].apply(clean_up)
data

Unnamed: 0,text,clean_text
0,from gldcunixbcccolumbiaedu gary l dare\nsubje...,from gldcunixbcccolumbiaedu gary l dare\nsubje...
1,from atterlepvelaacsoaklandedu cardinal ximene...,from atterlepvelaacsoaklandedu cardinal ximene...
2,from minerkuhubccukansedu\nsubject re ancient ...,from minerkuhubccukansedu\nsubject re ancient ...
3,from atterlepvelaacsoaklandedu cardinal ximene...,from atterlepvelaacsoaklandedu cardinal ximene...
4,from vzhivovsuperiorcarletonca vladimir zhivov...,from vzhivovsuperiorcarletonca vladimir zhivov...
...,...,...
1194,from jerrybeskimocom jerry kaufman\nsubject re...,from jerrybeskimocom jerry kaufman\nsubject re...
1195,from golchowyalchemychemutorontoca gerald olch...,from golchowyalchemychemutorontoca gerald olch...
1196,from jaynemmaltguildorg jayne kulikauskas\nsub...,from jaynemmaltguildorg jayne kulikauskas\nsub...
1197,from sclarkepasutorontoca susan clark\nsubject...,from sclarkepasutorontoca susan clark\nsubject...


## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [45]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer().fit(data['text'])

data_vectorized = vectorizer.transform(data['text'])

# Train the LDA model
ldamodel = LatentDirichletAllocation(n_components=2).fit(data_vectorized)

def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                  for i in topic.argsort()[:-10 -1:-1]])

print_topics(ldamodel, vectorizer)

Topic 0:
[('the', 110.19206651465497), ('of', 67.68737522345293), ('to', 66.69173133209398), ('that', 53.08754610138504), ('is', 51.9770069238538), ('and', 48.47899302371806), ('in', 42.84053923891562), ('you', 31.8044800345981), ('it', 31.484100620336793), ('not', 29.980036957249908)]
Topic 1:
[('the', 81.01073489261918), ('in', 32.02664745833698), ('to', 30.039024028607926), ('and', 26.137994991051997), ('of', 25.377119131594892), ('for', 18.640882550678963), ('game', 17.70024917622676), ('team', 17.325445200695643), ('hockey', 16.56641489476912), ('on', 16.40325972643322)]


## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [46]:
text = data['clean_text']

text_vectorized = vectorizer.transform(text)

lda_vectors = ldamodel.transform(text_vectorized)

print("topic 0 :", lda_vectors[0][0])
print("topic 1 :", lda_vectors[0][1])

topic 0 : 0.06273869854849222
topic 1 : 0.9372613014515079
