# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [1]:
import pandas as pd

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [3]:
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [4]:
import string
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
punctuation = string.punctuation

In [8]:
def preprocessing(text):
    text = text.strip() #remove whitespace
    text = text.lower() #lowerchase characters
    text = "".join(char for char in text if not char.isdigit()) #remove numbers
    for punc in punctuation:
        text = text.replace(punc,"") #remove puncutuation
    text_toke = word_tokenize(text) #tokenizing
    text_lem = [WordNetLemmatizer().lemmatize(w,pos='n') for w in text_toke] #lemmatizing
    text = " ".join(w for w in text_lem) #assmebling back
    return text

In [53]:
cleaned_text  = data.text.apply(preprocessing)

In [54]:
cleaned_text

0       from gldcunixbcccolumbiaedu gary l dare subjec...
1       from atterlepvelaacsoaklandedu cardinal ximene...
2       from minerkuhubccukansedu subject re ancient b...
3       from atterlepvelaacsoaklandedu cardinal ximene...
4       from vzhivovsuperiorcarletonca vladimir zhivov...
                              ...                        
1194    from jerrybeskimocom jerry kaufman subject re ...
1195    from golchowyalchemychemutorontoca gerald olch...
1196    from jaynemmaltguildorg jayne kulikauskas subj...
1197    from sclarkepasutorontoca susan clark subject ...
1198    from lmvecwestminsteracuk william hargreaves s...
Name: text, Length: 1199, dtype: object

## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [14]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation


In [95]:
vectorizer = TfidfVectorizer()
vectorized_text = vectorizer.fit_transform(cleaned_text)
vectorized_text = pd.DataFrame(vectorized_text.toarray(),\
        columns=vectorizer.get_feature_names_out())

In [79]:
#creating an lda model
n_components = 2
lda_model = LatentDirichletAllocation(n_components=n_components,max_iter=100)


lda_pipeline = make_pipeline(
    TfidfVectorizer(),
    lda_model
)

param_grid = {
    'latentdirichletallocation__n_components': [2,3,4]
}
search_lda = GridSearchCV(
    lda_pipeline,
    param_grid=param_grid,
    cv=5,
    scoring='neg_log_loss',
    n_jobs=-1
)

In [96]:
lda_model.fit(vectorized_text)
topics = lda_model.transform(vectorized_text)

In [102]:
topics_df = pd.DataFrame(topics,columns=["topic1",'topic2'])

topics_df['original_text'] = data.text

topics_df

Unnamed: 0,topic1,topic2,original_text
0,0.044220,0.955780,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,0.042848,0.957152,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,0.041038,0.958962,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,0.049608,0.950392,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,0.051970,0.948030,From: vzhivov@superior.carleton.ca (Vladimir Z...
...,...,...,...
1194,0.065033,0.934967,From: jerryb@eskimo.com (Jerry Kaufman)\nSubje...
1195,0.066824,0.933176,From: golchowy@alchemy.chem.utoronto.ca (Geral...
1196,0.047694,0.952306,From: jayne@mmalt.guild.org (Jayne Kulikauskas...
1197,0.081409,0.918591,From: sclark@epas.utoronto.ca (Susan Clark)\nS...


##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [98]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

❓ **Question** ❓ Print the topics extracted by your LDA.

In [104]:
topic_word_mixture = pd.DataFrame(
    lda_model.components_,
    columns = vectorizer.get_feature_names_out()
)

topic_word_mixture

Unnamed: 0,aa,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg,aacc,aadams,aafreenetcarletonca,aargh,aaron,aaronbinahccbrandeisedu,aaroncathenamitedu,aassists,...,zombo,zone,zoo,zoomed,zorasterism,zubov,zupancic,zurich,zwart,zzzzzz
0,0.50421,0.502096,0.500812,0.500108,0.505643,0.50528,0.505974,0.504594,0.506026,0.503334,...,0.504409,0.506137,0.505938,0.504795,0.501357,0.506829,0.503018,0.503206,0.507414,0.503968
1,0.865259,0.58752,0.527935,0.505275,2.01616,1.209159,1.651557,0.918679,0.832784,0.574094,...,0.744227,2.406766,0.654683,0.740579,0.565536,1.686846,0.583599,0.648868,0.817817,0.660034


## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [107]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [108]:
def vec_fun(text):
    vectorizer = TfidfVectorizer()
    vectorized_text = vectorizer.fit_transform(text)
    vectorized_text = pd.DataFrame(vectorized_text.toarray(),\
            columns=vectorizer.get_feature_names_out())
    return vectorized_text

In [110]:
example_vectorized = vec_fun(example)

example_vectorized

Unnamed: 0,and,best,game,injured,last,my,one,only,out,performed,played,player,poorly,season,team,their,was
0,0.242536,0.242536,0.242536,0.242536,0.242536,0.242536,0.242536,0.242536,0.242536,0.242536,0.242536,0.242536,0.242536,0.242536,0.242536,0.242536,0.242536


In [111]:
lda_model.fit(example_vectorized)
topics_mix_example = lda_model.transform(example_vectorized)

In [112]:
topics_mix_example

array([[0.13588294, 0.86411706]])

🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!