# LDA Assignment

LDA is a probabilistic topic model. The joint distribution is given by:
$$p(W, Z, \Theta, \Phi | \alpha, \eta) = \prod_{d=1}^D \Big( p(\theta_d | \alpha) \prod_{n=1}^{N_d} p(w_{d,n} | z_{d,n}, \Phi) p(z_{d,n} | \theta_d) \Big) \prod_{t=1}^T p(\phi_t | \eta) $$

Where:
$$p(\theta_d | \alpha) = Dir(\theta_d | \alpha)$$

$$p(w_{d,n} | z_{d,n}, \Phi) = Mult(W_{d,n} | \phi_{z_{d,n}})$$

$$p(z_{d,n} | \theta_d) = Mult(z_{d,n} | \theta_d)$$

$$p(\phi_t | \eta) = Dir(\phi_t | \eta)$$

In this assignment, you will apply Latent Dirichlet Allocation topic model to the dataset of NIPS papers. You will need `gensim` Python library which can be installed via `pip`.

In [3]:
import numpy as np
import scipy.io
from matplotlib import pyplot
%matplotlib inline

import gensim

import logging

gensim.models.ldamodel.logger.setLevel(logging.ERROR)

Download the dataset prepared by Sam Roweis and put it into the folder with the IPython Notebook: http://www.cs.nyu.edu/~roweis/data/nips12raw_str602.mat

The following code performs the necessary preprocessing.

In [5]:
nips12 = scipy.io.loadmat('nips12raw_str602.mat', squeeze_me=True)

# num documents x num words matrix
counts = nips12['counts'].T

# leave only 2013 (~2k) most frequent words
words_mask = np.ravel(counts.sum(axis=0) >= 121)
counts = counts[:, words_mask]

# id -> word mapping (required by gensim)
nips12_id2word = {i: w for (i, w) in enumerate(nips12['wl'][words_mask])}

# word -> id mapping (required by pyLDAvis)
nips12_word2id = {w: i for (i, w) in enumerate(nips12['wl'][words_mask])}

# NIPS issue for each document. Issue 0 is NIPS proceeding of the year 1988, issue 1 is year 1989, etc.
nips12_issue = np.array([int(name[4:6]) for name in nips12['docnames']])

# Titles of papers
nips12_titles = nips12['ptitles']

# Full corpus in gensim format
full_corpus = gensim.matutils.Scipy2Corpus(counts)

stream = np.random.RandomState(seed=123)
subset_mask = stream.rand(counts.shape[0]) <= 0.1

# Small corpus of 10% random papers for quick experiments
small_corpus = gensim.matutils.Scipy2Corpus(counts[subset_mask, :])

Gensim uses iterative approach to LDA inference. First, variational inference is run for `iterations` number of iterations to produce the new values of the variational parameters. Then, the new values are "blended" with the old ones (the values from the previuos iteration of the EM-algorithm) by taking a weighted average. This is done `num_passes` times. This procedure allows to better escape the local optima for the variational parameters.

Use the following code template to run LDA model in Gensim. Additionally, you will get the value of the variational lower bound after every pass. For now, we use small corpus to make computations faster.

Note: the lower bound is related to the perplexity measure commonly used in natural language procesing: $perplexity = exp(-bound)$

In [6]:
num_topics = 10  # number of topics in LDA model
alpha = [0.1] * num_topics  # parameter of the Dirichlet prior for document/topic distribution
iterations = 50  # number of Variational Inference passes
num_passes = 10  # number of passes over the dataset

small_lda = gensim.models.LdaModel(passes=1, num_topics=num_topics, alpha=alpha, iterations=iterations, id2word=nips12_id2word, eval_every=None)
for iter in xrange(num_passes):
    small_lda.update(small_corpus)
    print small_lda.bound(small_corpus)

-832804.509748
-809916.454539
-803203.478327
-800583.39914
-799440.726993
-798870.506038
-798584.572835
-798454.431585
-798403.07618
-798413.488461


Tune the number of `iterations` and `num_passes`. Try to maximize the variational lower bound.

Study the sensitivity of the lower bound value to the prior parameter $\alpha$. Use symmetric values of $\alpha$.

Fit the best model to the whole corpus.

Extract the variational parameters $\gamma$ - parameters of the variational approximation to the posterior probability of a topic for a document: $q(\Theta_d) = Dir(\Theta_d | \gamma_d)$.

Normalize them to a get a probability distribution over the topics for each document (the mean probability distribution according to the Dirichlet distribution).

In [None]:
gamma, _ = lda.inference(full_corpus)
# Normalize gammas here.

Visualize the approximate posterior distributions of probability of topics for some documents. Do this for documents from different years. Does the sparsity of the topics change over time? If so, can you explain this?

Write the code to print the most probable words and most probable documents for each topic. You may need to use `lda.num_topics`, `lda.show_topic(topic, topn=10)` and normalized gammas computed in the previous task.

Analyze the results. Can you interpret the topics? Write your interpretation to at least 3 topics.

Note. If you find an interesting paper in the list, you can download it online, as NIPS proceedings are freely available!

Write the code to calculate the mean probability of a topic in a given year. Analyze which topics become more popular over the years, and which less popular.

Use the following code to print topics found by LSI (Latent Semantic Indexing) model, a non-probabilistic topic model. What can you say about the interpretability of the topics? What about the running time?

In [None]:
lsi = gensim.models.LsiModel(full_corpus, num_topics=num_topics, id2word=nips12_id2word)
lsi.print_topics(10, num_words=20)

# Visualization

Run the following code to visialize the topics of your best model.

**Find two most simular topics.**

In [22]:
class MyDictionary():
    def __init__(self, word2id):
        self.token2id = word2id
    
    def __len__(self):
        return len(self.token2id)

    
class MyScipy2Corpus(gensim.matutils.Scipy2Corpus):
    def __len__(self):
        return self.vecs.shape[0]



In [23]:
lda.save('model.dat')

You will need to install `pyLDAvis` library e.g. via `pip`.

In [25]:
import pyLDAvis.gensim

lda = gensim.models.LdaModel.load('model.dat')
my_full_corpus = MyScipy2Corpus(counts[subset_mask, :])
my_dictionary = MyDictionary(nips12_word2id)
data = pyLDAvis.gensim.prepare(lda, my_full_corpus, my_dictionary)
pyLDAvis.display(data)