# Introduction to Latent Dirichlet Allocation
Author: Caroline Schmitt, ATL

---

## What is LDA?

Latent Dirichlet allocation is a type of **topic modeling**. A topic model is a statistical model of what themes appear in a collection of documents.

Imagine you have a collection of webpages from a pet care website. Each webpage is considered a document. Each document is about different things. One webpage might be about picking a good veterinarian, and another webpage might be about vaccination schedules for your pets.

This pet care website might discuss lots of different types of pets. One webpage might be mostly about dogs, but it mentions cats, too. Another webpage might be mostly about reptiles, but also about amphibians and insects.

We can think of each type of pet as a potential topic. Some webpages might be 80% about dogs and 20% about cats, and others might be 75% about reptiles, 10% about cats, 10% about dogs, and 5% about insects.

LDA infers the underlying (latent!) topics in a collection of documents. It is unsupervised because there is no set `y`. The number of topics to search for is a hyperparameter we can tune, and it's up to the modeler to interpret the results.

---

## How does it work?

LDA pretends each document is generated in the following way:

1. Choose N ∼ Poisson(ξ).
2. Choose θ ∼ Dir(α).
3. For each of the $N$ words $w_n$:
    * Choose a topic $z_n$ ∼ Multinomial($\theta$).
    
    * Choose a word wn from $p(w_n | z_n ,\beta)$, a multinomial probability conditioned on the topic $z_n$.

(Source: [Latent Dirichlet Allocation](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf).)

Or,

1. Choose a number of words that the document will have.
2. Choose a θ, which is the topic-document distribution. What percentage is the document "built from" various topics?
3. For each word in the document:
    * Choose a topic according to the topic-document distribution.
    * Choose a word from the topic, according to the probabilities of words from that topic.
    
Each topic is a distribution across words. **Every word appears in every topic, but with a different probability.**

Since LDA is pretending each document is generated this way, it can reverse-engineer the topics and the word-probabilities per topic.

---

## Codealong:

We're going to topic model on Donald Trump's tweets from 2016.

## Imports:

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

np.random.seed(42)

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [3]:
df = pd.read_csv('data/trump_tweets.csv')[['text']]
df.head()

Unnamed: 0,text
0,RT @realDonaldTrump: Happy Birthday @DonaldJTr...
1,Happy Birthday @DonaldJTrumpJr!\nhttps://t.co/...
2,"Happy New Year to all, including to my many en..."
3,Russians are playing @CNN and @NBCNews for suc...
4,"Join @AmerIcan32, founded by Hall of Fame lege..."


## Preprocess text:

In [49]:
cv =CountVectorizer(min_df=2,stop_words='english')

In [50]:
cv.fit(df['text'])
text = cv.transform(df['text'])
features = cv.get_feature_names()

In [51]:
df_text = pd.DataFrame(text.todense(), columns=features)

## Fit an LDA model:

In [46]:
lda = LatentDirichletAllocation(n_components=5, random_state=42) #Arguments go here
lda.fit(text)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=5, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=42,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

## Look at components:

In [29]:
lda.n_components

5

In [30]:
lda.components_

array([[43.93793042, 12.86242469,  0.20070232, ...,  1.92425968,
         0.2000196 ,  0.20147219],
       [ 0.20537154, 49.26501402,  0.20009155, ...,  0.20067156,
         1.35695758,  0.20273325],
       [ 0.2037069 ,  0.20394793, 11.04003523, ...,  0.20000534,
         1.39888098,  0.20020789],
       [ 0.20282203,  0.20565637,  0.20017756, ...,  0.20277023,
         0.24199685,  8.18655524],
       [ 0.20378987, 19.32277882,  0.20010374, ...,  0.20000695,
         0.20002324,  0.2018195 ]])

In [31]:
for component in lda.components_:
    print(len(component))

3234
3234
3234
3234
3234


In [32]:
pd.DataFrame({'words':features,
             'LDA score': lda.components_[4]}).set_index('words').sort_values('LDA score', ascending =False).head(5)

Unnamed: 0_level_0,LDA score
words,Unnamed: 1_level_1
great,525.465569
america,441.946024
make,393.843754
new,216.525679
11,192.119313


## Write a function to display components:

In [33]:
lda = LatentDirichletAllocation(n_components=8)
lda.fit(text)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=8, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [34]:
lda.components_[0].argsort()[-1]

1352

In [35]:
features[1352] #This new LDA model has it's most important word in that first topic

'hillary'

In [38]:
def print_words(model, words, num_words=5):
    
    for ix, topic in enumerate(model.components_):
        print('Topic ', ix)
        top_words = [words[i] for i in model.components_[ix].argsort()[:-num_words - 1:-1]]
        # argsort() says "wort all components and instead of returing the values, return the indicies"
        print('\n'.join(top_words))
        print('\n')
print_words(lda, features, 10)

Topic  0
hillary
clinton
crooked
https
debate
amp
let
bernie
total
jeb


Topic  1
people
obama
just
good
president
morning
great
really
florida
kasich


Topic  2
https
realdonaldtrump
rt
trump
join
amp
tomorrow
campaign
trump2016
totally


Topic  3
trump
bad
speech
https
like
support
beat
just
marco
watch


Topic  4
new
year
50
members
poll
great
11
https
big
american


Topic  5
https
thank
great
america
make
trump2016
makeamericagreatagain
family
live
11


Topic  6
cruz
ted
people
country
job
look
did
video
realdonaldtrump
say


Topic  7
foxnews
trump
tonight
cnn
enjoy
rubio
donald
realdonaldtrump
interviewed
media




## Looking at different numbers of topics:

In [41]:
lda = LatentDirichletAllocation(n_components=5)
lda.fit(text)
print_words(lda, features,5)

Topic  0
just
trump
win
people
great


Topic  1
https
thank
trump2016
makeamericagreatagain
great


Topic  2
great
america
make
https
realdonaldtrump


Topic  3
hillary
clinton
https
crooked
people


Topic  4
cruz
cnn
ted
enjoy
tonight


