# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [1]:
import pandas as pd

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [2]:
data.shape

(1199, 1)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [3]:
# YOUR CODE HERE

import string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords  # Import the stopwords module

def preprocessing(sentence):
    # Basic cleaning
    sentence = sentence.strip()  # remove whitespaces
    sentence = sentence.lower()  # lowercase
    sentence = ''.join(char for char in sentence if not char.isdigit())  # remove numbers

    # Advanced cleaning
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '')  # remove punctuation

    tokenized_sentence = word_tokenize(sentence)  # tokenize
    stop_words = set(stopwords.words('english'))  # define stopwords

    tokenized_sentence_cleaned = [
        w for w in tokenized_sentence if not w in stop_words
    ]  # remove stopwords

    lemmatized = [
        WordNetLemmatizer().lemmatize(word, pos="v")
        for word in tokenized_sentence_cleaned
    ]

    cleaned_sentence = ' '.join(word for word in lemmatized)

    return cleaned_sentence

data['clean_text'] = data['text'].apply(preprocessing)
data.head()

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,gldcunixbcccolumbiaedu gary l dare subject sta...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,minerkuhubccukansedu subject ancient book orga...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,vzhivovsuperiorcarletonca vladimir zhivov subj...


## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [7]:
# YOUR CODE HERE
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

vectorized_documents = vectorizer.fit_transform(
data['clean_text'])
vectorized_documents = pd.DataFrame(
    vectorized_documents.toarray(), 
    columns = vectorizer.get_feature_names_out()
)

vectorized_documents

# Instantiate the LDA 
n_components = 2
lda_model = LatentDirichletAllocation(n_components=n_components, max_iter = 100)

# Fit the LDA on the vectorized documents
lda_model.fit(vectorized_documents)
document_topic_mixture = lda_model.transform(vectorized_documents)
document_topic_mixture


array([[0.95084356, 0.04915644],
       [0.95107376, 0.04892624],
       [0.95239968, 0.04760032],
       ...,
       [0.94332662, 0.05667338],
       [0.90131988, 0.09868012],
       [0.93808503, 0.06191497]])

##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [8]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

❓ **Question** ❓ Print the topics extracted by your LDA.

In [9]:
# YOUR CODE HERE
topic_word_mixture = pd.DataFrame(
    lda_model.components_, 
    columns = vectorizer.get_feature_names_out()
)
topic_word_mixture

Unnamed: 0,aa,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg,aacc,aadams,aafreenetcarletonca,aargh,aaron,aaronbinahccbrandeisedu,aaroncathenamitedu,aarons,...,zombo,zone,zoo,zoom,zorasterism,zubov,zupancic,zurich,zwart,zzzzzz
0,0.903066,0.599675,0.534012,0.506364,2.222249,1.272725,1.740343,0.982526,0.863949,0.606616,...,0.762087,2.587021,0.659256,0.759127,0.583427,1.773141,0.585211,0.658669,0.832058,0.679747
1,0.504836,0.502668,0.501027,0.500118,0.506516,0.505932,0.506723,0.505158,0.506724,0.503302,...,0.504344,0.506431,0.506679,0.505466,0.501773,0.5072,0.502729,0.503487,0.509026,0.504999


## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [11]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [15]:

# Vectorize the example text using the same vectorizer
example_vectorized = vectorizer.transform(example)

# Use the LDA model to get the topic distribution for the example
lda_result = lda_model.transform(example_vectorized)

# Print the topics and their probabilities
for topic in lda_result:
    print(f"Topic {topic.argmax()}: {topic.max()}")

# If you want the most probable topic, you can get the index of the maximum probability
most_probable_topic = lda_result.argmax()
print(f"The most probable topic is: {most_probable_topic} with probability {lda_result.max()}")


Topic 0: 0.8515939050445317
The most probable topic is: 0 with probability 0.8515939050445317




🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!