# Latent Dirichlet Allocation

In [1]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [4]:
import string 

def remove_punct(df_to_treat):
    dataf = df_to_treat.copy()
    for col in dataf:
        if dataf[col].dtype == 'O':
            for punct in string.punctuation:
                dataf[col] = [text.replace(punct, ' ') for text in dataf[col]]
    return dataf

def lower_func(df_to_treat):
    dataf = df_to_treat.copy()
    for col in dataf:
        if dataf[col].dtype == 'O':
            for punct in string.punctuation:
                dataf[col] = [text.lower() for text in dataf[col]]
    return dataf

clean_data = lower_func(remove_punct(data))
clean_data

def remove_nb(df_to_treat):
    dataf = df_to_treat.copy()
    for col in dataf:
        if dataf[col].dtype == 'O':
            dataf[col] = [''.join(word for word in text if not word.isdigit()) for text in dataf[col]]
    return dataf

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english')) 

def remove_sw(df_to_treat):
    dataf = df_to_treat.copy()
    for col in dataf:
        if dataf[col].dtype == 'O':
            dataf[col] = [[w for w in word_tokenize(text) if not w in stop_words] for text in dataf[col]]
    return dataf

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def lemm_func(df_to_treat):
    dataf = df_to_treat.copy()
    for col in dataf:
        if dataf[col].dtype == 'O':
            dataf[col] = [" ".join([lemmatizer.lemmatize(word) for word in text]) for text in dataf[col]]
    return dataf

Unnamed: 0,text
0,from gld cunixb cc columbia edu gary l dare ...
1,from atterlep vela acs oakland edu cardinal ...
2,from miner kuhub cc ukans edu\nsubject re a...
3,from atterlep vela acs oakland edu cardinal ...
4,from vzhivov superior carleton ca vladimir z...
...,...
1194,from jerryb eskimo com jerry kaufman \nsubje...
1195,from golchowy alchemy chem utoronto ca geral...
1196,from jayne mmalt guild org jayne kulikauskas...
1197,from sclark epas utoronto ca susan clark \ns...


In [14]:
lemmatizer.lemmatize("married")

'married'

In [7]:
clean_data = lemm_func(remove_sw(remove_nb(clean_data)))
clean_data

Unnamed: 0,text
0,gld cunixb cc columbia edu gary l dare subject...
1,atterlep vela ac oakland edu cardinal ximenez ...
2,miner kuhub cc ukans edu subject ancient book ...
3,atterlep vela ac oakland edu cardinal ximenez ...
4,vzhivov superior carleton ca vladimir zhivov s...
...,...
1194,jerryb eskimo com jerry kaufman subject prayer...
1195,golchowy alchemy chem utoronto ca gerald olcho...
1196,jayne mmalt guild org jayne kulikauskas subjec...
1197,sclark epa utoronto ca susan clark subject pic...


## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [11]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer().fit(clean_data['text'])

data_vectorized = vectorizer.transform(clean_data['text'])

lda_model = LatentDirichletAllocation(n_components=10).fit(data_vectorized)

## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [12]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])
        

print_topics(lda_model, vectorizer)

Topic 0:
[('married', 4.705499797201052), ('marriage', 4.541112570250016), ('pope', 3.442161479890475), ('ceremony', 2.3560648896926004), ('jcj', 2.2024051265073648), ('eye', 2.1674535586790267), ('temple', 2.037488587276869), ('priest', 1.860301499941912), ('bishop', 1.6884242457801861), ('marry', 1.5888398966340005)]
Topic 1:
[('georgia', 4.4822418298030815), ('ai', 4.39433095328003), ('fisher', 3.8751187881926175), ('indiana', 3.835869373061983), ('uga', 3.756890955379642), ('athens', 3.263873046972323), ('covington', 3.100494101159468), ('sabbath', 2.9996534586707604), ('mcovingt', 2.9841736843499693), ('darius', 2.6344921162094126)]
Topic 2:
[('babylon', 1.2418487727994154), ('darren', 1.2201192331401791), ('infallible', 1.0414174197526616), ('rowan', 1.0305854419813678), ('pregnancy', 0.9154583796427921), ('tom', 0.8744659234469662), ('kilroy', 0.8564873700940343), ('gboro', 0.856487370093691), ('dlmqc', 0.7703898307329172), ('albrecht', 0.7262032430603643)]
Topic 3:
[('god', 34.

## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [31]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

to_test = lemm_func(remove_sw(remove_nb(lower_func(remove_punct(pd.DataFrame(example))))))
to_test[0]
example_vectorized = vectorizer.transform(to_test[0])

res = lda_model.transform(example_vectorized)

for top in range(len(res[0])):
    print(f"Le probabilité d'être dans le Topic {top} est de : {res[0][top]}")


Le probabilité d'être dans le Topic 0 est de : 0.024695404827896117
Le probabilité d'être dans le Topic 1 est de : 0.02469540400409921
Le probabilité d'être dans le Topic 2 est de : 0.024695405683845508
Le probabilité d'être dans le Topic 3 est de : 0.14523043891838258
Le probabilité d'être dans le Topic 4 est de : 0.024695403882919766
Le probabilité d'être dans le Topic 5 est de : 0.6572063184336685
Le probabilité d'être dans le Topic 6 est de : 0.024695406257364605
Le probabilité d'être dans le Topic 7 est de : 0.024695406027295586
Le probabilité d'être dans le Topic 8 est de : 0.024695406091351374
Le probabilité d'être dans le Topic 9 est de : 0.02469540587317669
