# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [1]:
import pandas as pd

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [2]:
data.shape

(1199, 1)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [5]:
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

In [6]:
# YOUR CODE HERE
def cleaning(sentence:str):
    sentence = sentence.lower()
    sentence = sentence.strip()
    sentence = ''.join(char for char in sentence if not char.isdigit())
    
    for k in string.punctuation:
        sentence = sentence.replace(k,'')
    
    tokens = word_tokenize(sentence)
    stop_words = set(stopwords.words('english'))
    cleared = [w for w in tokens if not w in stop_words]
    
    verb_lemmatized = [
        WordNetLemmatizer().lemmatize(word,pos='v')
        for word in cleared
    ]
    noun_lemmatized = [
        WordNetLemmatizer().lemmatize(word,pos='n')
        for word in verb_lemmatized
    ]
    
    sentence = ' '.join(word for word in noun_lemmatized)
    return sentence

In [9]:
data['clean_text']=data.text.map(lambda x: cleaning(x))

In [10]:
data.head()

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,gldcunixbcccolumbiaedu gary l dare subject sta...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,minerkuhubccukansedu subject ancient book orga...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,vzhivovsuperiorcarletonca vladimir zhivov subj...


## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [11]:
# YOUR CODE HERE
vectorizer = TfidfVectorizer()
vectorized_text = vectorizer.fit_transform(data.clean_text)

##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [15]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

❓ **Question** ❓ Print the topics extracted by your LDA.

In [25]:
# YOUR CODE HERE
model = LatentDirichletAllocation(n_components=4,max_iter=100)
model.fit(vectorized_text)
print_topics(model,vectorizer)

Topic 0:
[('rickemmatfbbswimseybcca', 0.7478800685045739), ('arsenault', 0.7460530333460249), ('gakwrscom', 0.7456001987716223), ('michel', 0.7164926841723602), ('sy', 0.709529671883099), ('mvscecwustledu', 0.709529671883099), ('boxscores', 0.6734477126934293), ('ladwig', 0.6198347218732809), ('drbombaynetlinkctscom', 0.6198347218732809), ('stueven', 0.5776472280955836)]
Topic 1:
[('god', 35.872832240682634), ('game', 26.83848055671105), ('go', 26.335478508231333), ('would', 26.089715832403627), ('team', 25.508290225359435), ('one', 24.2712971782817), ('write', 23.563931271075948), ('say', 23.51818071540969), ('line', 23.03261199434653), ('subject', 22.866788595692018)]
Topic 2:
[('sturm', 0.8458097174935922), ('barbara', 0.7153682331118651), ('dee', 0.6796762000877701), ('gifford', 0.596192045593144), ('giffordoasysdtnavymil', 0.596192045593144), ('pbaronexaessharriscom', 0.5812573746359629), ('barone', 0.5812573746359629), ('paradox', 0.5729996899547987), ('bucknell', 0.5438433898404

## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [26]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [27]:
# YOUR CODE HERE
example_tmd=vectorizer.transform(example)
model.transform(example_tmd)

array([[0.06971608, 0.79085163, 0.06971618, 0.0697161 ]])

🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!