# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [99]:

import numpy as np

from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import recall_score

In [100]:
import pandas as pd

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [101]:
data.shape

(1199, 1)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [102]:
import string
from nltk.corpus import stopwords 
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence):
     #remove whitespace
    sentence = sentence.strip()
    #lowercase characters
    sentence=sentence.lower()
    #remove numbers
    sentence = "".join(char for char in sentence if not char.isdigit())
    #remove punctuation
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation,'')
    #tokenize
    sentence = word_tokenize(sentence)
    #lemmatize
    # Lemmatizing the verbs
    verb_lemmatized = [                  
    WordNetLemmatizer().lemmatize(word, pos = "v") # v --> verbs
    for word in sentence
]

# 2 - Lemmatizing the nouns
    noun_lemmatized = [                 
    WordNetLemmatizer().lemmatize(word, pos = "n") # n --> nouns
    for word in verb_lemmatized
]
    return " ".join(noun_lemmatized)


In [103]:
data['clean_text'] = data['text'].apply(preprocessing)

In [104]:
data

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,from gldcunixbcccolumbiaedu gary l dare subjec...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,from atterlepvelaacsoaklandedu cardinal ximene...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,from minerkuhubccukansedu subject re ancient b...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,from atterlepvelaacsoaklandedu cardinal ximene...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,from vzhivovsuperiorcarletonca vladimir zhivov...
...,...,...
1194,From: jerryb@eskimo.com (Jerry Kaufman)\nSubje...,from jerrybeskimocom jerry kaufman subject re ...
1195,From: golchowy@alchemy.chem.utoronto.ca (Geral...,from golchowyalchemychemutorontoca gerald olch...
1196,From: jayne@mmalt.guild.org (Jayne Kulikauskas...,from jaynemmaltguildorg jayne kulikauskas subj...
1197,From: sclark@epas.utoronto.ca (Susan Clark)\nS...,from sclarkepasutorontoca susan clark subject ...


In [105]:
#vectorize 
vectorizer = TfidfVectorizer()

vectorized_documents = vectorizer.fit_transform(data['clean_text'])
vectorized_documents = pd.DataFrame(
    vectorized_documents.toarray(), 
    columns = vectorizer.get_feature_names_out()
)

vectorized_documents

Unnamed: 0,aa,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg,aacc,aadams,aafreenetcarletonca,aargh,aaron,aaronbinahccbrandeisedu,aaroncathenamitedu,aassists,...,zombo,zone,zoo,zoom,zorasterism,zubov,zupancic,zurich,zwart,zzzzzz
0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.074328,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.06924,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1194,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1195,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1196,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1197,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [106]:
from sklearn.decomposition import LatentDirichletAllocation

# Instantiate the LDA 
n_components = 2
lda_model = LatentDirichletAllocation(n_components=n_components, max_iter = 100)

# Fit the LDA on the vectorized documents
lda_model.fit(vectorized_documents)

In [107]:
document_topic_mixture = lda_model.transform(vectorized_documents)

In [108]:
document_topic_mixture

array([[0.95558417, 0.04441583],
       [0.95633674, 0.04366326],
       [0.95724774, 0.04275226],
       ...,
       [0.95098623, 0.04901377],
       [0.91715289, 0.08284711],
       [0.947919  , 0.052081  ]])

##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [109]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

In [114]:
lda_model.components_

array([[0.86830601, 0.58829614, 0.52761208, ..., 0.6496436 , 0.82124351,
        0.66404156],
       [0.5044349 , 0.50226793, 0.50087435, ..., 0.50346287, 0.50804547,
        0.50418996]])

❓ **Question** ❓ Print the topics extracted by your LDA.

In [115]:
print_topics(lda_model,vectorizer)

Topic 0:
[('the', 194.7317042293653), ('be', 158.3756878929337), ('to', 98.03485916354676), ('of', 94.41925312617882), ('in', 76.1837191009817), ('and', 75.66833100495926), ('that', 68.66893565121416), ('it', 52.09242562409766), ('have', 51.98563894182657), ('you', 45.43072200981281)]
Topic 1:
[('wsh', 1.1086748641273485), ('hfd', 0.897697325910374), ('howell', 0.8973866358907111), ('dee', 0.8961707722380698), ('wpg', 0.8477088083413895), ('mtl', 0.796575891191681), ('edm', 0.7808216089853854), ('cgy', 0.7703518385264086), ('nyr', 0.7418383256500523), ('phi', 0.7248128314685192)]


## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [None]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [118]:
vectorized_example=vectorizer.transform(example)

In [120]:
lda_example = lda_model.transform(vectorized_example)
lda_example



array([[0.87583032, 0.12416968]])

##### 🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!