# Latent Dirichlet Allocation

In [44]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [45]:
import string

data["clean_text"] = data["text"].apply(lambda x: ''.join([c for c in x if c not in string.punctuation])).str.lower()
data.head()

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,from gldcunixbcccolumbiaedu gary l dare\nsubje...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,from atterlepvelaacsoaklandedu cardinal ximene...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,from minerkuhubccukansedu\nsubject re ancient ...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,from atterlepvelaacsoaklandedu cardinal ximene...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,from vzhivovsuperiorcarletonca vladimir zhivov...


In [46]:
def remove_numbers(text):
    return ''.join([c for c in text if not c.isdigit()])

data = data.assign(clean_text = data["clean_text"].apply(remove_numbers))

In [47]:
data.head()

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,from gldcunixbcccolumbiaedu gary l dare\nsubje...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,from atterlepvelaacsoaklandedu cardinal ximene...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,from minerkuhubccukansedu\nsubject re ancient ...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,from atterlepvelaacsoaklandedu cardinal ximene...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,from vzhivovsuperiorcarletonca vladimir zhivov...


In [48]:
import pandas as pd
import nltk
from nltk.corpus import stopwords

def remove_stopwords(language, column):
    stop_words = set(stopwords.words(language))
    return data[column].apply(lambda x: " ".join([word for word in x.split() if word not in stop_words]))

data = data.assign(clean_text = remove_stopwords("english", "clean_text"))


In [49]:
data.head()

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,gldcunixbcccolumbiaedu gary l dare subject sta...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,minerkuhubccukansedu subject ancient books org...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,vzhivovsuperiorcarletonca vladimir zhivov subj...


## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [84]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer().fit(data['clean_text'])
data_vectorizer = vectorizer.transform(data['clean_text'])
lda_model = LatentDirichletAllocation(n_components = 2).fit(data_vectorizer)



## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [57]:
def priny_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

priny_topics(lda_model, vectorizer)
               

topic 0:
[('game', 19.518837013686543), ('team', 19.35168649503004), ('hockey', 18.415962080806214), ('go', 14.814848045833779), ('play', 13.459821404572315), ('nhl', 13.450223043959225), ('university', 13.396509076764461), ('players', 12.99419274015108), ('organization', 12.723747301762781), ('subject', 12.34155883901159)]
topic 1:
[('god', 29.88723551938872), ('jesus', 18.53783469174047), ('people', 16.529843284555813), ('would', 16.161730392178875), ('one', 14.890336908384578), ('church', 14.720485676230803), ('christians', 13.94100791235681), ('bible', 13.519324074147754), ('believe', 13.44194354167586), ('christian', 12.593664715532347)]


## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [85]:
# Preprocess the new text
text = "Dans la ville de Paris, il y a environ 2,148,271 habitants. La tour Eiffel, construite en 1889 pour l'Exposition Universelle, mesure environ 324 mètres de haut. Chaque année, environ 7 millions de personnes visitent la tour pour profiter de la vue panoramique sur la ville. Le Louvre, l'un des plus grands musées du monde, possède plus de 35,000 œuvres d'art, y compris la fameuse Joconde de Leonardo da Vinci. Le nombre de touristes visitant le Louvre chaque année est d'environ 10 millions."

new_text = ''.join([c for c in text if c not in string.punctuation]).lower()
new_text = remove_numbers(new_text)

preprocessed_text = " ".join([word for word in new_text.split() if word not in stop_words])
 
preprocessed_text


'dans la ville de paris il environ habitants la tour eiffel construite en pour lexposition universelle mesure environ mètres de haut chaque année environ millions de personnes visitent la tour pour profiter de la vue panoramique sur la ville le louvre lun des plus grands musées du monde possède plus de œuvres dart compris la fameuse joconde de leonardo da vinci le nombre de touristes visitant le louvre chaque année est denviron millions'

In [88]:
# Vectorize the preprocessed text
vectorized_text = vectorizer.transform([preprocessed_text])

# Predict the topic of the vectorized text
topic = lda_model.transform(vectorized_text)

# Print the topic
print("Topic:", topic) 


Topic: [[0.84025849 0.15974151]]
