# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [1]:
import pandas as pd

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [2]:
data.shape

(1199, 1)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [3]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

nltk.download('stopwords')
nltk.download('punkt')

def clean_text(text):
    text = text.lower()
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    return ' '.join(tokens)

data['clean_text'] = data['text'].apply(clean_text)
data.head()

[nltk_data] Downloading package stopwords to /home/baska/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/baska/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,gld cunixb cc columbia edu gary l dare subject...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlep vela acs oakland edu cardinal ximenez...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,miner kuhub cc ukans edu subject ancient books...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlep vela acs oakland edu cardinal ximenez...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,vzhivov superior carleton ca vladimir zhivov s...


## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
text_vectorized = vectorizer.fit_transform(data['clean_text'])

lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(text_vectorized)

##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [5]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

❓ **Question** ❓ Print the topics extracted by your LDA.

In [6]:
print_topics(lda, vectorizer)

Topic 0:
[('play', 346.3803377974522), ('team', 330.86558283898074), ('hockey', 321.11934305638624), ('nhl', 308.55591004390305), ('game', 260.02205886084647), ('season', 237.69022515103765), ('edu', 223.21930492405681), ('ca', 214.62761623970732), ('new', 212.2337268066501), ('players', 204.36196510242644)]
Topic 1:
[('god', 1500.1953047357692), ('edu', 977.2294807086852), ('people', 660.2259652481604), ('jesus', 623.1982519387108), ('organization', 520.1930178492269), ('think', 472.22729371665156), ('know', 439.50952376391575), ('believe', 437.86871580943136), ('church', 435.1987062948817), ('christians', 427.1991125552108)]
Topic 2:
[('10', 386.92292918683523), ('25', 357.66252981779274), ('11', 299.5597229629951), ('16', 265.13697798994497), ('14', 264.8151001153211), ('55', 264.43720969374), ('12', 262.8717117068822), ('15', 242.51150644088307), ('13', 233.8351261083762), ('la', 223.19007324678344)]
Topic 3:
[('edu', 558.8749523547375), ('ca', 421.2892221587034), ('organization', 

## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [7]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [8]:
example_vectorized = vectorizer.transform(example)

topic_distribution = lda.transform(example_vectorized)

print("Topic distribution for the example text:")
print(topic_distribution)

Topic distribution for the example text:
[[0.71623526 0.02010986 0.02108491 0.22224424 0.02032573]]


🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!