# Topic Modeling using Latent Dirichlet Allocation

### Prerequisites:

- Natural Language Processing Fundamentals in Python

- Things to be familiar with: 
    - tokenization
    - stopwords
    - term frequency
    - Bag-Of-Words representation

### Going to discuss:

- What is topic modeling?

- How does Latent Dirichlet Allocation (LDA) work?

- How to train and use LDA with gensim?

## What is topic modeling? 

- **topic**: a collection of related words

- a document can be composed of several topics

### Given a collection of documents, we can ask:

- What words make up each topic?

- What topics make up each document?

<img src="http://deliveryimages.acm.org/10.1145/2140000/2133826/figs/f1.jpg">

David Blei

### First, a simple example:

In [1]:
corpus = ['the dog and cat played tennis',
          'tennis and baseball are sports',
          'a dog or a cat can be a pet']

In [2]:
M = len(corpus) # size of corpus

print(M)

3


In [3]:
vocab = ['baseball','cat','dog','pet','played','tennis']

In [4]:
V = len(vocab) # size of vocabulary

print(V)

6


In [5]:
corpus = ['the dog and cat played tennis',
          'tennis and baseball are sports',
          'a dog or a cat can be a pet']

vocab = ['baseball','cat','dog','pet','played','tennis']

In [6]:
tf = [[0,1,1,0,1,1],
      [1,0,0,0,1,1],
      [0,1,1,1,0,0]]

In [7]:
import numpy as np
print(np.array(tf).shape) # M x V

(3, 6)


### What words make up each topic?

In [84]:
corpus = ['the dog and cat played tennis',
          'tennis and baseball are sports',
          'a dog or a cat can be a pet']

vocab = ['baseball','cat','dog','pet','played','tennis']

In [85]:
K = 2 # number of topics

In [86]:
topic_1 = [.33,   0,   0,   0, .33, .33]

In [87]:
topic_2 = [  0, .25, .25, .25, .25,   0]

In [88]:
# per topic word distributions
phi = [topic_1, topic_2]

In [89]:
print(np.array(phi).shape) # K x V

(2, 6)


### What topics make up each document?

In [14]:
corpus = ['the dog and cat played tennis',
          'tennis and baseball are sports',
          'a dog or a cat can be a pet']

vocab = ['baseball','cat','dog','pet','played','tennis']

phi = [[.33,   0,   0,   0, .33, .33],
       [  0, .25, .25, .25, .25,   0]]

In [15]:
# per document topic distributions
theta = [[.50, .50],
         [.99, .01],
         [.01, .99]]

In [16]:
print(np.array(theta).shape) # M x K

(3, 2)


### Uses for $\phi$ (phi), the per topic word distributions:

- infering labels for topics
- word clouds

### Uses for $\theta$ (theta), the per document topic weights:

- dimentionality reduction
- clustering
- similarity

### How do we learn phi ($\phi$) and theta ($\theta$)?

### Latent Dirichlet Allocation (LDA)

 - generative statistical model
 - *Blei, D., Ng, A., Jordan, M. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (Jan 2003)*
 

### Dirichlet Distribution

- Conjugate prior to the Multinomial Distribution

- Multinomial is like a "die"

- Dirichlet is like a "die factory"

<img src="https://upload.wikimedia.org/wikipedia/commons/4/4d/Smoothed_LDA.png" style="width: 30%">

In [None]:
K     # number of topics

phi   # per topic word distributions

beta  # parameters for word distribution die factory, length = V

In [None]:
M     # number of documents
N     # number of words/tokens in each document

theta # per document topic distributions

alpha # parameters for topic die factory, length = K

In [None]:
z     # topic indexes

In [None]:
Dirichlet   # dirichlet distribution (aka die factory)

<img src="https://upload.wikimedia.org/wikipedia/commons/4/4d/Smoothed_LDA.png" style="width: 30%">

In [None]:
phi = []  # word distribution die, 1 per topic

# pseudocode to generate topic word distributions
for k in range(K):
    phi.append(Dirichlet(beta,V).get_die())  # generate a word distribution die

In [None]:
corpus = []

# pseudocode to generate corpus
for m in range(M):
    document_m = []
    
    theta_m = Dirichlet(alpha,K).get_die()   # generate a topic die
    
    for n in range(N):
        z_mn = theta_m.get_topic()     # roll topic die
        w_mn = phi[z_mn].get_word()    # roll word distribution die
        
        document_m.append(w_mn)
    
    corpus.append(document_m)

## Review

### Things we know: 

 - M : the number of documents
 - N : the lengths of document
 

### Things we choose:

 - K : the number of topics
 - V : our vocabulary

### Things we want to learn: 

 - $\theta$'s (theta's) : the per document topic weights
 - $\phi$'s (phi's) : the per topic word weights

#### Note:

We may want to infer $\alpha$ and $\beta$ as well

## Example using gensim

In [17]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups()
corpus_fname = '../../../scikit_learn_data/20news-bydate_py3.data.txt'
with open(corpus_fname,'w') as f:
    for doc in newsgroups.data[:1000]:
        f.write(doc.replace('\n',' ') + '\n')

In [18]:
newsgroups.data[4].replace('\n',' ')[:200]

'From: jcm@head-cfa.harvard.edu (Jonathan McDowell) Subject: Re: Shuttle Launch Question Organization: Smithsonian Astrophysical Observatory, Cambridge, MA,  USA Distribution: sci Lines: 23  From artic'

In [19]:
from gensim.corpora import TextCorpus

In [20]:
%time corpus = TextCorpus(input=corpus_fname)

CPU times: user 1.6 s, sys: 8 ms, total: 1.61 s
Wall time: 1.61 s


In [21]:
corpus.length # M

1000

In [22]:
len(corpus.dictionary) # V

24635

In [23]:
from gensim.models.ldamodel import LdaModel

In [48]:
%%time 

K = 20

lda = LdaModel(corpus=corpus,
               id2word=corpus.dictionary,
               num_topics=K,
               passes=2, chunksize=100)

CPU times: user 39.3 s, sys: 268 ms, total: 39.6 s
Wall time: 27.8 s


### What words make up each topic?

In [70]:
lda.show_topic(15) # phi

[('health', 0.015285139358335125),
 ('medical', 0.011163811500712313),
 ('doctor', 0.009365141347999404),
 ('period', 0.0083968345000975091),
 ('sandvik', 0.0079942412591927099),
 ('pitt', 0.0079152770134412586),
 ('edu', 0.0070091597665548639),
 ('pgh', 0.0068247605143894581),
 ('coli', 0.0065856909931286253),
 ('medicine', 0.0063201670897790234)]

### What topics make up each document?

In [None]:
text = next(corpus.sample_texts(1))

In [76]:
lda[corpus.dictionary.doc2bow(text)] # theta

[(1, 0.38659449392494483),
 (4, 0.065591725589101937),
 (9, 0.44432883640685006),
 (11, 0.065389705776472948)]

### Topics covered:

- What is topic modeling?

- How does Latent Dirichlet Allocation (LDA) work?

- How to train and use LDA with gensim?

## Thank you!