# Latent Dirichlet Allocation

In [7]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [5]:
import string 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

stop_words = set(stopwords.words('english')) 
stemmer = PorterStemmer()

def rem_punct(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

def low_case(text):
    text = text.lower() 
    return text

def rem_number(text):
    text = ''.join(word for word in text if not word.isdigit())
    return text

def rem_stop_words(text):
    
    word_tokens = word_tokenize(text) 
    text = [w for w in word_tokens if not w in stop_words]     
    return text

def app_lemmatize(text):
    
    stemmed = [stemmer.stem(word) for word in text]
    return stemmed



In [8]:
data['clean_mail'] = data['text'].apply(rem_punct)

data['clean_mail'] = data['clean_mail'].apply(low_case)
data['clean_mail'] = data['clean_mail'].apply(rem_number)
data['clean_mail'] = data['clean_mail'].apply(rem_stop_words)
data['clean_mail'] = data['clean_mail'].apply(app_lemmatize)

for i in range(len(data)):
    data['clean_mail'][i] = ' '.join(data['clean_mail'][i])


## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [30]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

n=2

vectorizer = TfidfVectorizer().fit(data['clean_mail'])

data_vectorized = vectorizer.transform(data['clean_mail'])

lda_model = LatentDirichletAllocation(n_components=n).fit(data_vectorized)



KeyError: 'clean_mail'

## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [24]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])
        

In [27]:
print_topics(lda_model, vectorizer)

Topic 0:
[('hrivnak', 3.7336364098066435), ('gtdaprismgatechedu', 3.5052790528906477), ('vote', 2.4715692738199357), ('poll', 2.3263337348697593), ('colon', 1.661666213586317), ('nhlpa', 1.544213735469102), ('brave', 1.5207563887578934), ('patton', 1.4591938258244483), ('hornet', 1.4591938258244483), ('friedman', 1.4219186948077607)]
Topic 1:
[('ideolog', 1.2223813100451495), ('denounc', 1.0220587322835872), ('lazaru', 0.9329203327632578), ('rolfedsuvaxdsuedu', 0.7616990845567164), ('administr', 0.7615172304457017), ('manipul', 0.6637480638369345), ('servic', 0.6266500379531555), ('mp', 0.592264029103994), ('wwaandrewcmuedu', 0.5687667841524586), ('categori', 0.5646702476005382)]
Topic 2:
[('tp', 1.1904744213787777), ('rochest', 0.923747376801503), ('terlep', 0.8856318665701742), ('disorgan', 0.8445506023497851), ('mi', 0.8376356615223374), ('oakland', 0.7976166689000257), ('alan', 0.7661201805924708), ('sledd', 0.7393109488807782), ('ylnen', 0.7107283588040664), ('centr', 0.6796547753

[('keller', 4.352869501951554), ('keith', 4.190935471402818), ('kkellermailsasupennedu', 3.8772217658495443), ('hartford', 2.714344832753437), ('period', 2.6412362162920484), ('lindro', 2.1604071954168886), ('nd', 2.0393199034924785), ('quaker', 1.8241852156569969), ('ivi', 1.8241852156569969), ('mailsasupennedu', 1.7777912661454607)]
Topic 27:
[('wsh', 0.8115511834622321), ('edm', 0.6791357475600194), ('mtl', 0.5999766788864254), ('wpg', 0.5495949190900543), ('phi', 0.504290106856664), ('rickgranberryptsmotcom', 0.4942606494191856), ('granberri', 0.4942606494191856), ('cgi', 0.4897605872007664), ('nyr', 0.48778175557570413), ('min', 0.4858487265963567)]
Topic 28:
[('kariya', 0.8905811893145608), ('secular', 0.8286045347505413), ('walsh', 0.7785573717014305), ('keenan', 0.7140689751654355), ('druce', 0.6259762921898607), ('neilsen', 0.602058313014648), ('embarrass', 0.5880386638823663), ('extant', 0.5676739630169999), ('elynuik', 0.5640644720001686), ('illiter', 0.5342456873007688)]
To

## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [25]:
example = ["In God we trust"]

example_vectorized = vectorizer.transform(example)

lda_vectors = lda_model.transform(example_vectorized)


In [26]:
for i in range(n):
    print(f"topic {i} : {lda_vectors[0][i]}")

topic 0 : 0.007693823015660035
topic 1 : 0.007693823015660035
topic 2 : 0.007693823015660035
topic 3 : 0.007693823015660035
topic 4 : 0.007693823015660035
topic 5 : 0.007693823015660035
topic 6 : 0.007693823015660035
topic 7 : 0.3123911548643185
topic 8 : 0.007693823015660035
topic 9 : 0.007693823015660035
topic 10 : 0.007693823015660035
topic 11 : 0.007693823015660035
topic 12 : 0.007693823015660035
topic 13 : 0.007693823015660035
topic 14 : 0.10423283902425216
topic 15 : 0.007693823015660035
topic 16 : 0.007693823015660035
topic 17 : 0.007693823015660035
topic 18 : 0.007693823015660035
topic 19 : 0.007693823015660035
topic 20 : 0.007693823015660035
topic 21 : 0.007693823015660035
topic 22 : 0.007693823015660035
topic 23 : 0.007693823015660035
topic 24 : 0.007693823015660035
topic 25 : 0.007693823015660035
topic 26 : 0.007693823015660035
topic 27 : 0.007693823015660035
topic 28 : 0.007693823015660035
topic 29 : 0.007693823015660035
topic 30 : 0.2217663243749652
topic 31 : 0.0076938230