# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [11]:
import pandas as pd

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [12]:
data.shape

(1199, 1)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [13]:
import string
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

In [14]:
def cleaning(sentence):
    sentence = sentence.strip()
    sentence = sentence.lower()
    sentence = ''.join(char for char in sentence if not char.isdigit())
    sentence = ''.join(char for char in sentence if char not in string.punctuation)

    tokenized_sentence = word_tokenize(sentence)
    stop_words = set(stopwords.words('english'))

    tokenized_sentence_cleaned = [
        word for word in tokenized_sentence if word not in stop_words
    ]

    lemmatized = [
        WordNetLemmatizer().lemmatize(word, pos='v') for word in tokenized_sentence_cleaned
    ]

    return ' '.join(lemmatized)


In [15]:
data['clean_text'] = data['text'].apply(cleaning)
data.head()

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,gldcunixbcccolumbiaedu gary l dare subject sta...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,minerkuhubccukansedu subject ancient book orga...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,vzhivovsuperiorcarletonca vladimir zhivov subj...


## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [39]:
vectorizer = TfidfVectorizer()
dtm = vectorizer.fit_transform(data['clean_text'])

dtm = pd.DataFrame(
    dtm.toarray(),
    columns=vectorizer.get_feature_names_out()
)
dtm

Unnamed: 0,aa,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg,aacc,aadams,aafreenetcarletonca,aargh,aaron,aaronbinahccbrandeisedu,aaroncathenamitedu,aarons,...,zombo,zone,zoo,zoom,zorasterism,zubov,zupancic,zurich,zwart,zzzzzz
0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.086661,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.073976,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1194,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1195,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1196,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1197,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [40]:
lda = LatentDirichletAllocation(n_components=2, max_iter = 100)
lda.fit(dtm)

In [41]:
topic_mixture = lda.transform(dtm)
topic_mixture

array([[0.93970397, 0.06029603],
       [0.05207369, 0.94792631],
       [0.06167687, 0.93832313],
       ...,
       [0.06984111, 0.93015889],
       [0.89929106, 0.10070894],
       [0.07403738, 0.92596262]])

In [42]:
topic_word= pd.DataFrame(
    lda.components_,
    columns = vectorizer.get_feature_names_out()
)
topic_word

Unnamed: 0,aa,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg,aacc,aadams,aafreenetcarletonca,aargh,aaron,aaronbinahccbrandeisedu,aaroncathenamitedu,aarons,...,zombo,zone,zoo,zoom,zorasterism,zubov,zupancic,zurich,zwart,zzzzzz
0,0.524709,0.596719,0.500622,0.506384,0.616844,1.274551,0.504432,0.503363,0.504678,0.502035,...,0.763971,2.588878,0.662154,0.760805,0.502512,1.775606,0.586832,0.659228,0.504898,0.681574
1,0.883193,0.505624,0.534417,0.500097,2.111921,0.504105,1.742634,0.984322,0.865995,0.607883,...,0.502459,0.504575,0.50378,0.503788,0.582688,0.504735,0.501108,0.502929,0.836185,0.503173


topic = pd.DataFrame(
    topic_mixture,
    columns=[f"Topic {i+1}" for i in range(lda.n_components)]
)
topic

##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [43]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

❓ **Question** ❓ Print the topics extracted by your LDA.

In [48]:
print_topics(lda, vectorizer)

Topic 0:
[('game', 26.927328463583247), ('team', 25.714177990820026), ('play', 19.97599754240888), ('go', 18.929363482241293), ('hockey', 18.703105008489917), ('get', 14.59823702870196), ('win', 14.066505747662719), ('nhl', 13.626553785970538), ('players', 13.180025501353908), ('university', 12.8351293209889)]
Topic 1:
[('god', 30.539992093831376), ('jesus', 18.91673166035435), ('say', 17.921594174967506), ('people', 17.862400658640183), ('would', 16.98134958005468), ('church', 16.72638522193773), ('believe', 16.630340837619624), ('one', 15.762716633125335), ('know', 15.710348486622566), ('christians', 14.250657916810532)]


## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [None]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [None]:
# YOUR CODE HERE

🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!