# Latent Dirichlet Allocation

In [19]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [51]:
from string import punctuation
import re
from nltk.corpus import stopwords

def clean_text(text):

    text = re.sub(r'http\S+', ' ', text)
    text = re.sub("\d+", " ", text)
    text = text.replace('\n', ' ')
    text = text.translate(str.maketrans("", "", punctuation))
    text = text.lower()
    
    stop_words = set(stopwords.words('english'))
    filtered_sentence = [w for w in text if not w.lower() in stop_words]
    filtered_sentence = []
    for w in text:
        if w not in stop_words:
            filtered_sentence.append(w)
    return text

data['clean_text'] = data['text'].apply(clean_text)
data.head()

  text = re.sub("\d+", " ", text)


Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,from gldcunixbcccolumbiaedu gary l dare subjec...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,from atterlepvelaacsoaklandedu cardinal ximene...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,from minerkuhubccukansedu subject re ancient b...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,from atterlepvelaacsoaklandedu cardinal ximene...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,from vzhivovsuperiorcarletonca vladimir zhivov...


## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [52]:
import gensim
from gensim import corpora
from gensim.corpora import Dictionary
from gensim.models import LdaModel

dictionary = corpora.Dictionary(data['clean_text'].str.split().tolist())

doc_term_matrix = [dictionary.doc2bow(text) for text in data['clean_text'].str.split().tolist()]

# creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# runnug and trainig LDA model on the document term matrix for 3 topics
ldamodel = Lda(doc_term_matrix, num_topics = 3, id2word = dictionary, passes = 50)

dictionary

<gensim.corpora.dictionary.Dictionary at 0x1c05aa91760>

## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [53]:
import pyLDAvis
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(ldamodel, doc_term_matrix, dictionary)
vis

  default_term_info = default_term_info.sort_values(


## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.