# Latent Dirichlet Allocation

In [1]:
import pandas as pd
df = pd.read_csv('data', sep=",", header=None)
df.columns = ['text']
df.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [5]:
import nltk
import string 

def remove_punct(x):
    for p in string.punctuation:
        x = x.replace(p, ' ').lower()
        x = ''.join(word for word in x if not word.isdigit())
    return x

df["clean_text"] = df['text'].apply(lambda x: remove_punct(x))

0    from  gld cunixb cc columbia edu  gary l dare ...
1    from  atterlep vela acs oakland edu  cardinal ...
2    from  miner kuhub cc ukans edu\nsubject  re  a...
3    from  atterlep vela acs oakland edu  cardinal ...
4    from  vzhivov superior carleton ca  vladimir z...
Name: clean_text, dtype: object

In [6]:
df["clean_text"].head()

0    from  gld cunixb cc columbia edu  gary l dare ...
1    from  atterlep vela acs oakland edu  cardinal ...
2    from  miner kuhub cc ukans edu\nsubject  re  a...
3    from  atterlep vela acs oakland edu  cardinal ...
4    from  vzhivov superior carleton ca  vladimir z...
Name: clean_text, dtype: object

## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [8]:
from sklearn.decomposition import LatentDirichletAllocation

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [14]:
vectorizer = TfidfVectorizer().fit(df['clean_text'])
data_vectorized = vectorizer.transform(df['clean_text'])
lda_model = LatentDirichletAllocation(n_components=2).fit(data_vectorized)

In [17]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
            for i in topic.argsort()[:-10 - 1:-1]])

## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [18]:
print_topics(lda_model, vectorizer)

Topic 0:
[('the', 183.07450404061046), ('to', 93.25932088951478), ('of', 89.20018968020251), ('and', 71.9353876798066), ('in', 71.56412131080592), ('that', 66.56131543562329), ('is', 64.83568725157284), ('it', 47.37432394041756), ('you', 44.956695307065644), ('edu', 39.87770410933625)]
Topic 1:
[('netlink', 1.0063458217029586), ('gilligan', 0.8495085878113668), ('howell', 0.8495085876724842), ('dee', 0.8261782316405819), ('ddf', 0.7540600808529532), ('drbombay', 0.7358135795682885), ('cts', 0.7358135748024549), ('ladwig', 0.7358135704838388), ('addresses', 0.6921013232945232), ('romford', 0.6867532431345285)]


## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [19]:
example = ["ludwig banane hell bike"]
example_vectorized = vectorizer.transform(example)
lda_vectors = lda_model.transform(example_vectorized)
print("topic 0 :", lda_vectors[0][0])
print("topic 1 :", lda_vectors[0][1])

topic 0 : 0.753526730715796
topic 1 : 0.24647326928420393
