# Latent Dirichlet Allocation

In [7]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [5]:
import string 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

stop_words = set(stopwords.words('english')) 
stemmer = PorterStemmer()

def rem_punct(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

def low_case(text):
    text = text.lower() 
    return text

def rem_number(text):
    text = ''.join(word for word in text if not word.isdigit())
    return text

def rem_stop_words(text):
    
    word_tokens = word_tokenize(text) 
    text = [w for w in word_tokens if not w in stop_words]     
    return text

def app_lemmatize(text):
    
    stemmed = [stemmer.stem(word) for word in text]
    return stemmed



In [8]:
data['clean_mail'] = data['text'].apply(rem_punct)

data['clean_mail'] = data['clean_mail'].apply(low_case)
data['clean_mail'] = data['clean_mail'].apply(rem_number)
data['clean_mail'] = data['clean_mail'].apply(rem_stop_words)
data['clean_mail'] = data['clean_mail'].apply(app_lemmatize)

for i in range(len(data)):
    data['clean_mail'][i] = ' '.join(data['clean_mail'][i])


## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [23]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

n=50

vectorizer = TfidfVectorizer().fit(data['clean_mail'])

data_vectorized = vectorizer.transform(data['clean_mail'])

lda_model = LatentDirichletAllocation(n_components=n).fit(data_vectorized)



## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [24]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])
        

In [20]:
print_topics(lda_model, vectorizer)

Topic 0:
[('game', 27.031724323725783), ('team', 25.699181527951172), ('play', 19.985315999640438), ('hockey', 18.793405352942806), ('player', 18.37738849579615), ('go', 17.11838500373782), ('win', 14.119318828898777), ('nhl', 13.753138805545305), ('year', 13.424287042389402), ('playoff', 13.17709051558444)]
Topic 1:
[('god', 36.18878143860725), ('christian', 28.55535872098567), ('jesu', 18.928704188039866), ('peopl', 18.04760467260348), ('believ', 17.40239050912162), ('would', 17.202661064795677), ('one', 16.94260984545755), ('church', 16.68952513101939), ('say', 15.50759146229505), ('know', 14.131848834411405)]


## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [21]:
example = ["In God we trust"]

example_vectorized = vectorizer.transform(example)

lda_vectors = lda_model.transform(example_vectorized)


In [22]:
for i in range(n):
    print(f"topic {i} : {lda_vectors[0][i]}")

topic 0 : 0.20790415630771303
topic 1 : 0.792095843692287
