# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [14]:
import pandas as pd

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()


Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [15]:
data.shape


(1199, 1)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [16]:
import re
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

data['clean_text'] = data['text'].apply(clean_text)

data.head()


Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,from gldcunixbcccolumbiaedu gary l dare subjec...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,from atterlepvelaacsoaklandedu cardinal ximene...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,from minerkuhubccukansedu subject re ancient b...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,from atterlepvelaacsoaklandedu cardinal ximene...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,from vzhivovsuperiorcarletonca vladimir zhivov...


## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [17]:
import pandas as pd
import re
from gensim import corpora, models
from nltk.corpus import stopwords

import nltk
nltk.download('stopwords')

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

data['clean_text'] = data['text'].apply(clean_text)

stop_words = set(stopwords.words('english'))
data['tokenized_text'] = data['clean_text'].apply(lambda x: [word for word in x.split() if word not in stop_words])

dictionary = corpora.Dictionary(data['tokenized_text'])

corpus = [dictionary.doc2bow(text) for text in data['tokenized_text']]

lda_model = models.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15)

topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/reecepalmer/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [None]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])


❓ **Question** ❓ Print the topics extracted by your LDA.

In [None]:
topics = lda_model.print_topics(num_words=5)
for topic_number, words in topics:
    print(f"topic #{topic_number + 1}: {words}")


topic #1: 0.014*"god" + 0.007*"one" + 0.006*"would" + 0.006*"people" + 0.006*"subject"
topic #2: 0.007*"jesus" + 0.007*"subject" + 0.006*"would" + 0.006*"lines" + 0.006*"organization"
topic #3: 0.008*"team" + 0.007*"hockey" + 0.007*"subject" + 0.007*"organization" + 0.007*"lines"
topic #4: 0.006*"vs" + 0.006*"game" + 0.005*"flyers" + 0.004*"puck" + 0.004*"team"
topic #5: 0.006*"church" + 0.006*"would" + 0.005*"one" + 0.004*"subject" + 0.004*"lines"


## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [None]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]


In [None]:
example_vectorized = dictionary.doc2bow(clean_text(example[0]).split())

example_topics = lda_model[example_vectorized]

print(example_topics)


[(0, 0.01683226), (1, 0.016753277), (2, 0.9325972), (3, 0.016957618), (4, 0.016859582)]


🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!