# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [14]:
import warnings
warnings.filterwarnings('ignore')

In [1]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [2]:
data.shape

(1199, 1)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [3]:
import string
from nltk.corpus import stopwords 
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence):
    # Basic cleaning
    sentence = sentence.strip() ## remove whitespaces
    sentence = sentence.lower() ## lowercasing 
    sentence = ''.join(char for char in sentence if not char.isdigit()) ## removing numbers
    # Advanced cleaning
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '') ## removing punctuation
    tokenized_sentence = word_tokenize(sentence) ## tokenizing 
    stop_words = set(stopwords.words('english')) ## defining stopwords
    tokenized_sentence_cleaned = [w for w in tokenized_sentence 
                                  if not w in stop_words] ## remove stopwords
    # 1 - Lemmatizing the verbs
    verb_lemmatized = [WordNetLemmatizer().lemmatize(word, pos = "v")  # v --> verbs
              for word in tokenized_sentence_cleaned]
    # 2 - Lemmatizing the nouns
    noun_lemmatized = [WordNetLemmatizer().lemmatize(word, pos = "n")  # n --> nouns
                for word in verb_lemmatized]
    cleaned_sentence= ' '.join(w for w in noun_lemmatized)
    return cleaned_sentence


In [4]:
data['clean_text']=data.text.apply(preprocessing)
data

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,gldcunixbcccolumbiaedu gary l dare subject sta...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,minerkuhubccukansedu subject ancient book orga...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,vzhivovsuperiorcarletonca vladimir zhivov subj...
...,...,...
1194,From: jerryb@eskimo.com (Jerry Kaufman)\nSubje...,jerrybeskimocom jerry kaufman subject prayer a...
1195,From: golchowy@alchemy.chem.utoronto.ca (Geral...,golchowyalchemychemutorontoca gerald olchowy s...
1196,From: jayne@mmalt.guild.org (Jayne Kulikauskas...,jaynemmaltguildorg jayne kulikauskas subject q...
1197,From: sclark@epas.utoronto.ca (Susan Clark)\nS...,sclarkepasutorontoca susan clark subject pick ...


## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import LatentDirichletAllocation

vectorizer = TfidfVectorizer().fit(data.clean_text)
vectorized_text = vectorizer.transform(data.clean_text)
vectorized_text = pd.DataFrame(vectorized_text.toarray(), 
                                    columns = vectorizer.get_feature_names_out())
vectorized_text

Unnamed: 0,aa,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg,aacc,aadams,aafreenetcarletonca,aargh,aaron,aaronbinahccbrandeisedu,aaroncathenamitedu,aassists,...,zombo,zone,zoo,zoom,zorasterism,zubov,zupancic,zurich,zwart,zzzzzz
0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.088609,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.07373,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1194,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1195,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1196,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1197,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
# Instantiating the LDA 
n_components = 10
lda_model = LatentDirichletAllocation(n_components=n_components, max_iter = 100)

# Fitting the LDA on the vectorized documents
lda_model.fit(vectorized_text)

In [7]:
text_topics=lda_model.transform(vectorized_text)
pd.DataFrame(text_topics)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.008702,0.008698,0.217207,0.008696,0.008698,0.538584,0.008696,0.183323,0.008696,0.008701
1,0.009275,0.009275,0.009275,0.009275,0.009275,0.916525,0.009275,0.009275,0.009275,0.009275
2,0.008665,0.008665,0.008665,0.008664,0.008664,0.684096,0.008664,0.008665,0.246587,0.008665
3,0.011165,0.011165,0.011165,0.011165,0.011165,0.899513,0.011165,0.011165,0.011165,0.011165
4,0.009927,0.009924,0.009946,0.009921,0.009921,0.630647,0.009921,0.009925,0.009928,0.289940
...,...,...,...,...,...,...,...,...,...,...
1194,0.013542,0.013542,0.013542,0.013545,0.013542,0.686593,0.013542,0.170535,0.048076,0.013543
1195,0.085547,0.014098,0.014095,0.014094,0.014094,0.566585,0.249197,0.014097,0.014094,0.014098
1196,0.010479,0.010465,0.160847,0.010465,0.010465,0.755416,0.010465,0.010465,0.010465,0.010468
1197,0.018415,0.018414,0.441247,0.018414,0.018414,0.411433,0.018414,0.018416,0.018414,0.018419


##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [8]:
def print_topics(model, vectorizer):
    topic_mixture = pd.DataFrame(lda_model.components_,
                                 columns = vectorizer.get_feature_names_out())
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        topic_df = topic_mixture.iloc[idx].sort_values(ascending = False).head(10)
        print(round(topic_df,3))
        print("-"*10)

❓ **Question** ❓ Print the topics extracted by your LDA.

In [9]:
print(print_topics(lda_model,vectorizer))

Topic 0:
valley                         4.119
grass                          4.065
finland                        3.849
maynardramseycslaurentianca    3.703
maynard                        3.378
chuck                          3.142
petchgvggvgtekcom              3.050
gilmour                        2.682
daily                          2.371
petch                          2.161
Name: 0, dtype: float64
----------
Topic 1:
colon           1.551
goodbye         1.094
finalswho       1.035
statemaine      1.035
probert         1.004
finalswinner    1.003
maine           0.959
irvin           0.948
dineen          0.936
iskander        0.911
Name: 1, dtype: float64
----------
Topic 2:
keller                    4.045
kkellermailsasupennedu    3.664
period                    3.142
pp                        3.041
shark                     2.971
disappointment            1.916
quaker                    1.903
ivy                       1.903
jose                      1.869
ticket                   

## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [10]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [15]:
clean_example=preprocessing(example[0])
vectorized_example=vectorizer.transform([clean_example])
lda_model.transform(vectorized_example)

array([[0.02463703, 0.02463703, 0.02463703, 0.02463703, 0.02463703,
        0.77826675, 0.02463703, 0.02463703, 0.02463703, 0.02463703]])

🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!