# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [37]:
import pandas as pd

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [38]:
data.shape

(1199, 1)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [39]:
# YOUR CODE HERE
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
stopwords = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
def clean_text(text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.lower()
    tokens = word_tokenize(text)
    clean_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stopwords]
    clean_text = ' '.join(clean_tokens)  
    return clean_text

data['clean_text'] = data['text'].apply(clean_text)
data.head()

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,gldcunixbcccolumbiaedu gary l dare subject sta...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,minerkuhubccukansedu subject ancient book orga...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,vzhivovsuperiorcarletonca vladimir zhivov subj...


## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [40]:
# YOUR CODE HERE
import gensim
from gensim import corpora

documents = [doc.split() for doc in data['clean_text']]
dictionary = corpora.Dictionary(documents)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in documents]
lda_model = gensim.models.LdaModel(
    doc_term_matrix,
    num_topics=10,
    id2word=dictionary,
    passes=10,
    random_state=42
)

##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [51]:
def print_topics(model, dictionary):
    for idx, topic in model.show_topics(formatted=False):
        print(f"Topic {idx}:")
        words = [dictionary.get(word_id) for word_id, _ in topic]
        words = [word for word in words if word is not None]
        print(words)
        print()

❓ **Question** ❓ Print the topics extracted by your LDA.

In [52]:
# YOUR CODE HERE
print_topics(lda_model, dictionary)

Topic 0:
[]

Topic 1:
[]

Topic 2:
[]

Topic 3:
[]

Topic 4:
[]

Topic 5:
[]

Topic 6:
[]

Topic 7:
[]

Topic 8:
[]

Topic 9:
[]



## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [53]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [55]:
# YOUR CODE HERE
example_vectorized = dictionary.doc2bow(example[0].split())
predicted_topics = lda_model[example_vectorized]
for topic in predicted_topics:
    print(f"Topic {topic[0]}: {topic[1]}")

Topic 4: 0.9181379079818726


🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!