# Latent Dirichlet Allocation

In [7]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [33]:
import string

def punctuation_lower(text):
    text = text.translate(str.maketrans(" ", " ", string.punctuation))
    text = text.lower()
    return text

data['clean_text'] = data['text'].apply(punctuation_lower)
from nltk.corpus import stopwords


stop_words = set(stopwords.words('english'))

def remove_stop_words(text):
    words = text.split()
    words = [word for word in words if word not in stop_words]
    return ' '.join(words)

data['clean_text'] = data['clean_text'].apply(remove_stop_words)

def remove_numbers(s):
    return ''.join(c for c in s if not c.isdigit())

data['clean_text'] = data['clean_text'].apply(remove_numbers)
data


Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,gldcunixbcccolumbiaedu gary l dare subject sta...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,minerkuhubccukansedu subject ancient books org...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,vzhivovsuperiorcarletonca vladimir zhivov subj...
...,...,...
1194,From: jerryb@eskimo.com (Jerry Kaufman)\nSubje...,jerrybeskimocom jerry kaufman subject prayers ...
1195,From: golchowy@alchemy.chem.utoronto.ca (Geral...,golchowyalchemychemutorontoca gerald olchowy s...
1196,From: jayne@mmalt.guild.org (Jayne Kulikauskas...,jaynemmaltguildorg jayne kulikauskas subject q...
1197,From: sclark@epas.utoronto.ca (Susan Clark)\nS...,sclarkepasutorontoca susan clark subject picks...


## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [52]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer().fit(data["clean_text"])

data_vectorized = vectorizer.transform(data["clean_text"])
lda_model = LatentDirichletAllocation(n_components=2).fit(data_vectorized)





Topic 0:
[('testing', 1.5479902123759945), ('khettryrwpubutkedu', 1.1260396435828257), ('tennessee', 1.126039641983282), ('rfl', 1.1260396416291436), ('sturm', 0.9860209973821916), ('dee', 0.8991891370067002), ('howell', 0.8913440058627251), ('utk', 0.789968768495837), ('addresses', 0.7571852571456821), ('dohertyldcsglaacuk', 0.7317504784917539)]
Topic 0:
[('god', 29.920898926367332), ('would', 25.808759713078448), ('one', 23.025272690548928), ('subject', 22.43479138276808), ('organization', 21.552580379872396), ('lines', 21.47871132088038), ('university', 21.478577414248935), ('writes', 20.40588618493879), ('people', 20.37856514139802), ('game', 19.567286780859)]


## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [53]:
def print_topics(model,vectorizer):
    for ixc, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i],topic[i])
        for i in topic.argsort()[:-10 - 1:-1]])


print_topics(lda_model, vectorizer)

Topic 0:
[('testing', 1.5479902123759945), ('khettryrwpubutkedu', 1.1260396435828257), ('tennessee', 1.126039641983282), ('rfl', 1.1260396416291436), ('sturm', 0.9860209973821916), ('dee', 0.8991891370067002), ('howell', 0.8913440058627251), ('utk', 0.789968768495837), ('addresses', 0.7571852571456821), ('dohertyldcsglaacuk', 0.7317504784917539)]
Topic 0:
[('god', 29.920898926367332), ('would', 25.808759713078448), ('one', 23.025272690548928), ('subject', 22.43479138276808), ('organization', 21.552580379872396), ('lines', 21.47871132088038), ('university', 21.478577414248935), ('writes', 20.40588618493879), ('people', 20.37856514139802), ('game', 19.567286780859)]


## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [54]:
new_text = ["god, the book of truth, an epic saga by the famous author patrick rothfuss"]
vectorizer = TfidfVectorizer().fit(new_text)
data_vectorized = vectorizer.transform(new_text)
lda_model = LatentDirichletAllocation(n_components=2).fit(data_vectorized)
print_topics(lda_model, vectorizer)

Topic 0:
[('the', 0.9827895441852936), ('truth', 0.7361199661278315), ('rothfuss', 0.7361199661195417), ('author', 0.7361199661165183), ('of', 0.7361199661012101), ('book', 0.7361199660937225), ('god', 0.7361199660830005), ('famous', 0.7361199660776893), ('patrick', 0.7361199660732096), ('saga', 0.7361199660543825)]
Topic 0:
[('the', 0.5172104558147042), ('epic', 0.5138800339606036), ('an', 0.5138800339570218), ('by', 0.513880033949315), ('saga', 0.5138800339456157), ('patrick', 0.5138800339267886), ('famous', 0.513880033922309), ('god', 0.5138800339169978), ('book', 0.5138800339062757), ('of', 0.5138800338987881)]
