# Latent Dirichlet Allocation

In [6]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

In [3]:
# just installing everything to be sure
!pip install nltk
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [20]:
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

stopw = set(stopwords.words('english'))
lem = WordNetLemmatizer()

clean_data = data

for t in clean_data.loc[:, 'text']:
    newt = ' '.join(lem.lemmatize(w) for w in word_tokenize(t.lower())
                    if (not w in stopw) and (not w.isdigit()))
    for p in string.punctuation:
        newt = newt.replace(p, "")
    newt = newt.replace("\n", "") # apparently \n's aren't removed otherwise
    clean_data = clean_data.replace(t, newt)
        
clean_data.head()

Unnamed: 0,text
0,gld cunixbcccolumbiaedu gary l dare subjec...
1,atterlep velaacsoaklandedu cardinal ximenez...
2,miner kuhubccukansedu subject ancient book...
3,atterlep velaacsoaklandedu cardinal ximenez...
4,vzhivov superiorcarletonca vladimir zhivov ...


## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [28]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer().fit(clean_data['text'])

vdata = vect.transform(clean_data['text'])

lda = LatentDirichletAllocation(n_components = 2).fit(vdata)

## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [29]:
def print_topics(model, vect):
    for i, topic in enumerate(model.components_):
        print("Topic " + str(i) + ":")
        print([(vect.get_feature_names_out()[j], topic[j])
               for j in topic.argsort()[:-10 - 1:-1]])

print_topics(lda, vect)

Topic 0:
[('petch', 4.189542730199536), ('grass', 3.952153513917817), ('valley', 3.4932416848907293), ('gvg47gvgtekcom', 2.467269853340981), ('550', 2.363794987652473), ('daily', 1.8760108818565893), ('chuck', 1.780004015427136), ('testing', 1.4863279104840235), ('khettry', 1.0991932459477602), ('r1w2pubutkedu', 1.0991932456388953)]
Topic 1:
[('god', 34.70866927365021), ('nt', 33.820259448795944), ('would', 26.05697743312894), ('game', 25.573434089959537), ('team', 24.365901247743583), ('one', 23.489159972185526), ('christian', 21.874989321717905), ('line', 21.791935258865664), ('subject', 21.591197231265173), ('organization', 20.718023826657003)]


## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [30]:
example = ["rice var congratulations save upenn"] # bless you

vect_ex = vect.transform(example)

ldavect = lda.transform(vect_ex)

print("Topic 0: " + str(ldavect[0][0]))
print("Topic 1: " + str(ldavect[0][1]))

Topic 0: 0.23996297682952167
Topic 1: 0.7600370231704783
