# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [1]:
import string
from nltk.corpus import stopwords 
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import set_config; set_config("diagram")
import matplotlib.pyplot as plt
from sklearn.decomposition import LatentDirichletAllocation

In [2]:
import pandas as pd

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [3]:
data.shape

(1199, 1)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [4]:
def preprocessing(sentence):
    sentence= sentence.lower()
    sentence= ''.join(w for w in sentence if not w.isdigit())
    for punc in string.punctuation:
        sentence= sentence.replace(punc, ' ')
    sentence= sentence.strip()
    token= word_tokenize(sentence)
    stop_words= set(stopwords.words('english'))
    token_stopw_removed= [w for w in token if not w in stop_words]
    clean_review= ' '.join(WordNetLemmatizer().lemmatize(w) for w in token_stopw_removed)
    return clean_review

In [5]:
data['clean_text']= data['text'].apply(preprocessing)

In [6]:
data.head()

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,gld cunixb cc columbia edu gary l dare subject...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlep vela ac oakland edu cardinal ximenez ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,miner kuhub cc ukans edu subject ancient book ...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlep vela ac oakland edu cardinal ximenez ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,vzhivov superior carleton ca vladimir zhivov s...


In [7]:
data['clean_text'][0]

'gld cunixb cc columbia edu gary l dare subject stan fischler summary devil pregame show prior hosting penguin nntp posting host cunixb cc columbia edu reply gld cunixb cc columbia edu gary l dare organization phd hall line lester patrick award lunch bill torrey mentioned one option next season president miami team bob clarke working dinner clarke said worst mistake philadelphia letting mike keenan go retrospect almost player came realize keenan knew took win rumour circulating keenan back flyer nick polano sick scapegoat schedule made red wing bryan murray approved gerry meehan john muckler worried sabre prospect assistant lever say sabre get share quebec dynasty emerging mighty duck declared throw money around loosely buy team oiler coach ted green remarked guy around fill tie domi skate none fill helmet senator andrew mcbain told security guard chicago stadium warned stair leading locker room mcbain mouthed seasoned professional tumbled entire steep flight gld je souviens gary l dar

## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [8]:
vectorizer= TfidfVectorizer()
vectorizded_doc= vectorizer.fit_transform(data['clean_text'])

In [9]:
vectorizer.get_feature_names_out()

array(['aa',
       'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg',
       'aacc', ..., 'zurich', 'zwart', 'zzzzzz'], dtype=object)

In [10]:
vectorizded_doc= pd.DataFrame(vectorizded_doc.toarray(), columns=vectorizer.get_feature_names_out())

In [11]:
vectorizded_doc.head()

Unnamed: 0,aa,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg,aacc,aargh,aaron,aaronc,aatchoo,ab,abandon,abandond,...,zombo,zone,zoo,zoomed,zorasterism,zubov,zupancic,zurich,zwart,zzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.080838,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.072433,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
lda_model= LatentDirichletAllocation(n_components=3, max_iter=100)
lda_model.fit(vectorizded_doc)

In [13]:
doc_topic_mix= lda_model.transform(vectorizded_doc)

In [14]:
topic_df= pd.DataFrame(doc_topic_mix)
topic_df.head()

Unnamed: 0,0,1,2
0,0.938375,0.030813,0.030812
1,0.939114,0.030443,0.030443
2,0.941855,0.029072,0.029072
3,0.925147,0.037426,0.037426
4,0.934127,0.032937,0.032936


##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [15]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

❓ **Question** ❓ Print the topics extracted by your LDA.

In [16]:
print_topics(lda_model, vectorizer)

Topic 0:
[('edu', 43.1601053551246), ('god', 35.039744291014905), ('game', 24.84922993708771), ('ca', 24.814172535390128), ('would', 24.290377499358637), ('team', 23.676964652038528), ('one', 22.939857742880523), ('christian', 22.611036489777828), ('line', 21.03894830497023), ('subject', 20.663313684147184)]
Topic 1:
[('holger', 1.3686609984666025), ('ohlwein', 1.214566276834859), ('ap', 0.9056706549368789), ('arsenault', 0.7838940385179013), ('mchp', 0.7509366342445354), ('sni', 0.7509366342445354), ('michel', 0.710483106481123), ('boxscores', 0.7074004583990405), ('howell', 0.6984719054818977), ('gilligan', 0.6984719054818977)]
Topic 2:
[('wpi', 1.4687743189333464), ('testing', 1.296128523534936), ('utk', 1.1716264171930615), ('ching', 1.168536205853562), ('gak', 0.9415640814146806), ('logistician', 0.9157881917397994), ('tennessee', 0.9054236075861237), ('khettry', 0.9054236075861237), ('rfl', 0.9054236075861237), ('rw', 0.8258753078808252)]


## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [17]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]
example[0]

'My team performed poorly last season. Their best player was out injured and only played one game'

In [18]:
preproc= [preprocessing(example[0])]
preproc

['team performed poorly last season best player injured played one game']

In [19]:
vec= vectorizer.transform(preproc)

In [20]:
vec.toarray()

array([[0., 0., 0., ..., 0., 0., 0.]])

In [22]:
topic_proba= lda_model.transform(vec.toarray())



In [23]:
topic_proba

array([[0.83072032, 0.08464087, 0.08463881]])

In [25]:
topic_proba.argmax(axis=1)[0]

0

🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!