# Latent Dirichlet Allocation

In [9]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [10]:
import string 

def remove_punctuation(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, ' ')
    return text

def lower_case(text):
    return text.lower()

def remove_numbers(text):
    return ''.join(word for word in text if not word.isdigit())

In [11]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english')) 

def remove_stopwords(text):
    word_tokens = word_tokenize(text)
    return [w for w in word_tokens if not w in stop_words] 

In [12]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(word) for word in text]

def list_to_string(list):
    return ' '.join(list)

In [14]:
data.text = data.text.apply(remove_punctuation)

data.text = data.text.apply(lower_case)

data.text = data.text.apply(remove_numbers)

data.text = data.text.apply(remove_stopwords)

data.text = data.text.apply(lemmatize_text)

data.text = data.text.apply(list_to_string)

data

Unnamed: 0,text
0,gld cunixb cc columbia edu gary l dare subject...
1,atterlep vela ac oakland edu cardinal ximenez ...
2,miner kuhub cc ukans edu subject ancient book ...
3,atterlep vela ac oakland edu cardinal ximenez ...
4,vzhivov superior carleton ca vladimir zhivov s...
...,...
1194,jerryb eskimo com jerry kaufman subject prayer...
1195,golchowy alchemy chem utoronto ca gerald olcho...
1196,jayne mmalt guild org jayne kulikauskas subjec...
1197,sclark epa utoronto ca susan clark subject pic...


## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [16]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer


vectorizer = TfidfVectorizer().fit(data['text'])

data_vectorized = vectorizer.transform(data['text'])

lda_model = LatentDirichletAllocation(n_components=2).fit(data_vectorized)

def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

Topic 0:
[('gvg', 5.42645010327821), ('petch', 4.552404301099757), ('grass', 3.8698085719522024), ('valley', 3.5096547827309537), ('tek', 2.8435373135434814), ('daily', 2.3494610869038564), ('chuck', 2.332465004863031), ('holger', 1.484173816954514), ('testing', 1.4385746971874616), ('ohlwein', 1.3327751911272423)]
Topic 1:
[('edu', 43.31930415873053), ('god', 35.197689883535546), ('game', 25.009839588616114), ('ca', 24.956791857759306), ('would', 24.45170486070744), ('team', 23.83613538988385), ('one', 23.098251202621928), ('christian', 22.772603777878984), ('line', 21.18525458515692), ('subject', 20.809276295407223)]




## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [17]:
print_topics(lda_model, vectorizer)

Topic 0:
[('gvg', 5.42645010327821), ('petch', 4.552404301099757), ('grass', 3.8698085719522024), ('valley', 3.5096547827309537), ('tek', 2.8435373135434814), ('daily', 2.3494610869038564), ('chuck', 2.332465004863031), ('holger', 1.484173816954514), ('testing', 1.4385746971874616), ('ohlwein', 1.3327751911272423)]
Topic 1:
[('edu', 43.31930415873053), ('god', 35.197689883535546), ('game', 25.009839588616114), ('ca', 24.956791857759306), ('would', 24.45170486070744), ('team', 23.83613538988385), ('one', 23.098251202621928), ('christian', 22.772603777878984), ('line', 21.18525458515692), ('subject', 20.809276295407223)]


## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [26]:
example = ["Théodore a mangé une mitraillette Poulycroc et un Quick"]

example_vectorized = vectorizer.transform(example)

lda_vectors = lda_model.transform(example_vectorized)

print("topic 0 :", lda_vectors[0][0])
print("topic 1 :", lda_vectors[0][1])

topic 0 : 0.21415131913817329
topic 1 : 0.7858486808618267
