<img src="img/ff-logo.png" width="30%">

## LDA workshop

### Mike Williams • Director of Research • Fast Forward Labs

#### [github.com/fastforwardlabs/ldaworkshop](https://github.com/fastforwardlabs/ldaworkshop)

# Text

There are lots of things you might want to do with documents
 - group them together
 - summarize one or several of them
 - classify them (e.g. sentiment, tagging)
 - explore, e.g. temporal trends in what they discuss

To do any of these things using a computer, you first need to convert words into numbers.

## Bags of words

The "bag of words" approach is the simplest method for this.

![Bag of words](img/bagwords.png)

There are at least three problems with this approach:

 - **long, sparse data**: a short text is turned into a very long series of numbers, most of which are zero ("sparse").
 
 - **synonyms and multiple meanings**: "movies" and "films" are different (they should not be), and "bow" (that shoots an arrow) and "bow" (you make to a King) are not (they should be)
 
 - **word order**: totally lost ("man bites dog" and "dog bites man" are the same).

We're going to look at a technique that addresses the first of these two problems.

# Topic modeling

Topic modeling is a statistical method to find groups of words that tend to co-occur in a corpus of documents.

For example, maybe the words "movie", "film" and "director" often occur in the same documents. That would make them a "topic".

Topic modeling algorithms find these groups automatically. They are an instance of the class of algorithms known as "unsupervised machine learning".

In doing this, we become able to express documents as a combination of a relatively small number (~100) of topics, rather than thousands of words (most of which don't occur), and we can treat documents about "films" and "movies" similarly.

# Topic modeling workflow

Topic modeling has two steps:

 - learn topics from a corpus of representative documents
 - figure out which of these topics occur in the particular document(s) you're interested in
 
If you're a machine learning person, you'll recognize these as training and evaluation.

<img src="img/lda_topics.png" width=50%>

<img src="img/lda_evaluate.png" width=50%>

Once you've done the second step, you've expressed your new document as a short vector of numbers that you can now do all sorts of things with:

 - group documents together
 - summarize one or several documents
 - classify it (e.g. sentiment, tagging)
 - explore, e.g. temporal trends in what people are discussing, e.g. [Time-series plots of 1000 topics extracted from 20 years of the New York Times.](http://christo.cs.umass.edu/NYT/)

## Latent Dirichlet Allocation

The best known and best algorithm for finding topics in a corpus is Latent Dirichlet Allocation. It's got a complicated name and, to be frank, it's a complicated algorithm. If you'd like to begin to dig into the details, there are two resources I recommend very highly!

 - [Tim Hopper's PyData NYC 2015 talk](https://www.youtube.com/watch?v=_R66X_udxZQ)
 - [David Blei's ACM article](https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf)
 
For the purposes of this talk, I'm just going to say it finds groups of words that co-occur by magic!

The good news is, there are several excellent open source implementations of the algorithm, and we're going to use one of those today.

We're going to apply it to a public dataset of Amazon product reviews.

# Load data

The cell below opens the file and loads it into a pandas dataframe.

There's a lot going on in this line, all of which is useful to understand if you're a Python programmer, but none of which is necessary to understand if you're only interested in LDA.

If you would like a more detailed explanation of what's going on, please see [the Data notebook](data.ipynb)

In [None]:
import json
import numpy as np
import pandas as pd
from __future__ import print_function

with open("reviews.json", 'rb') as f:
    reviews = pd.DataFrame(json.loads(l.decode()) for l in f)

A pandas dataframe is a structured table-like object that, among many other things, supports a bunch of SQL-like operations and handles fiddly data types like data and times well. I don't know much about R, but I understand R has objects like this too.

`head` allows us to see the first few rows:

In [None]:
reviews.head(2)

Individual columns can be accessed as keys:

In [None]:
reviews['asin'].head()

In [None]:
print("{} reviews".format(len(reviews)))
print("of {} products ".format(len(reviews.asin.unique())))
print("by {} unique authors ".format(len(reviews.reviewerID.unique())))

In [None]:
texts = reviews.loc[:19999, 'reviewText']

In [None]:
print(texts[0])

# Vectorize reviews

You have to preprocess the text a little before you apply LDA: you need to split documents into words, and you need to turn words into vectorized numbers.

Ironically, in order to get the benefits of LDA, we first need to run bag of words on our data!

The code to do this is built into scikit-learn.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words='english', max_features=10000)
vectorizer.fit(texts)
X = vectorizer.transform(texts)

In [None]:
X.shape

# Learn topics using LDA

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
%%time
lda = LatentDirichletAllocation(n_topics=100, learning_method='batch', n_jobs=-2, random_state=0)
lda.fit(X)

# Inspect topics

The first thing you should do when you fit a topic model is inspect a few of the words that dominate each topic to check that the topics are coherent.

To do this, we need to look at the `components_` attribute now attached to `lda`. This is an array with `n_topics` rows and a number of columns equal to the size of the vocabulary.

In [None]:
print(lda.components_.shape)

Each number in this array is the weight of the corresponding word in the corresponding topic.

The weights of each topic should add up to one, i.e. each row of `lda.components_` should add up to 1.

In [None]:
lda.components_.sum(axis=1)

Oh dear! It turns out there's a bug in scikit-learn's implementation of LDA. Let's fix it here. This should be fixed in the next version of scikit-learn.

In [None]:
lda.components_ /= lda.components_.sum(axis=1)[:, None]

This array of topics and words (or terms) is usually called the topic-term matrix, so let's save it under that name:

In [None]:
topic_term = lda.components_

The word corresponding to each column in this array, which we'll call the `vocabulary`, is available as a list from `vectorizer.get_feature_names()`

In [None]:
vocabulary = vectorizer.get_feature_names()
print(vocabulary[:10000:1000]) # print the 0th, 999th, 1999th, 2999th, etc. item in the vocabulary

Now let's look at the top 10 words that dominate each topic.

It does that by going through each row (i.e. each topic) in the `topic_term` array, finding the biggest numbers in that row, then finding the corresponding word.

In [None]:
def print_topic(topic_term, topic_id, vocabulary):
    print("{:2d} ".format(topic_id) + " ".join(vocabulary[i] for i in topic_term[topic_id].argsort()[:-11:-1]))

for i, _  in enumerate(topic_term):
    print_topic(topic_term, i, vocabulary)

In this case, the topics are reasonably coherent, so I'm going to move on.

If things look messy:
 - n_topics might be too large, either for the diversity in the corpus (maybe there really aren't 1000 topics), or for the number of documents you have (you just don't have enough data)
 - n_topics might be too low (real topics have to be merged together by the algorithm, which doesn't work well)
 - you've got a bug!
 
Setting n_topics very small (say 5) or very high (say 1000) is a good way of building up some intuition for what works (although beware `n_topics=1000` will take a long time to run).

# Inspect a document in the corpus

The topic model is the lens through which we're going to view future documents.

But let's first look at our existing documents through this lens.

To do that we have to transform the documents we trained on to be distributions of topics (e.g. document 1 is 20% topic A, 30% topic B, etc.)

We do that by running the `lda.transform` method on the vectorized documents `X`:

In [None]:
doc_topic = lda.transform(X)

`doc_topic` has a row for each document, and a column for each topic.

In [None]:
doc_topic.shape

Finally let's look at the topic distribution of a random document

In [None]:
texts[1234]

In [None]:
top_topics = (doc_topic[1234]).argsort()[:-6:-1]
print(top_topics)

What are these topics?

In [None]:
for i in top_topics:
    print_topic(topic_term, i, vocabulary)

# Visualization

pyLDAvis is a comprehensive package for visualizing the results of a topic model. It's useful for understanding the structure of the model you've just discovered. The topics exist in a huge space. This package squeezes things down to 2D so we can look at it on the screen.

In my experience, it generates a ton of spurious warnings, so let's disable warnings for this package when we import it (a useful trick!)

In [None]:
import warnings

with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=DeprecationWarning)
    try:
        import pyLDAvis
    except ImportError:
        print('ERROR: pyLDAvis not installed! Skip to next section!')

In addition to the `topic_term` and `doc_topic` matrices, pyLDAvis needs to know how often each word occurs in the entire corpus, and how long each document is. Here are calculations that give those.

In [None]:
term_frequency = np.asarray(X.sum(axis=0)).squeeze()
doc_lengths = [len(t) for t in texts]

In [None]:
lda_vis = pyLDAvis.prepare(topic_term_dists=topic_term,
                           doc_topic_dists=doc_topic,
                           doc_lengths=doc_lengths,
                           vocab=vocabulary,
                           term_frequency=term_frequency)

In [None]:
pyLDAvis.display(lda_vis)

# Put all this together in a Pipeline and persist the model

The process of getting from document to topic distribution is a little fiddly. We need to:
 - Vectorize the document (using the same vocabulary we used when training above)
 - Transform the document using the LDA object
 
scikit-learn allows us to bundle these steps (and more!) together in an object called a `Pipeline`, which we can save to disk, reload, and work with again. Let's build one, train it, and save it.

**WARNING**: this next cell will take a while to execute the first time you run it. After that though, the model will be loaded from disk.

In [None]:
import pickle
from sklearn.pipeline import make_pipeline

try:
    with open('topic_model.pkl', 'rb') as f:
        topic_pipeline = pickle.load(f)
    pipeline_vocabulary = topic_pipeline.steps[0][1].get_feature_names()
except IOError:
    topic_pipeline = make_pipeline(
        CountVectorizer(stop_words='english', max_features=10000),
        LatentDirichletAllocation(n_topics=100, learning_method='batch', n_jobs=-2, random_state=0)
    )
    topic_pipeline.fit(texts)
    with open('topic_model.pkl', 'wb') as f:
        pickle.dump(topic_pipeline, f)
        
pipeline_vocabulary = topic_pipeline.steps[0][1].get_feature_names()
pipeline_topic_term = topic_pipeline.steps[1][1].components_

# Determine topics of a new document

The single document we looks at above was a pretty short document. Let's make a more interesting, longer document out of all the reviews of that product.

In [None]:
randomreviews = " ".join(texts[5000:5005])

In [None]:
print(randomreviews)

In [None]:
doc_topic = topic_pipeline.transform([randomreviews])

In [None]:
print(doc_topic.shape)

In [None]:
top_topics = (doc_topic[0]).argsort()[:-6:-1]
print(top_topics)

In [None]:
for i in top_topics:
    print_topic(pipeline_topic_term, i, pipeline_vocabulary)

# What next?

Use the `doc_topic` array for a downstream task, e.g.
 - corpus exploration (remember the [NYT visualization](http://christo.cs.umass.edu/NYT/))
 - document clustering, e.g. use something like `KMeans` (in scikit-learn) to visualize which documents are most similar in terms of their topics, which may surface groups of topics or groups of documents

## Summarization

Here's a short algorithm, but see [Fast Forward Labs Report 04](http://ff04.fastforwardlabs.com) for details:
  - Train LDA on all products of a certain type (e.g. all the books)
  - Treat all the reviews of a particular product as one document, and infer their topic distribution
  - Infer the topic distribution for each sentence
  - For each topic that dominates the reviews of a product, pick some sentences that are themselves dominated by that topic

<img src="img/strain.png">

Be aware of limitations:
 - Choosing `n_topics` is an art rather than a science!
 - The topics don't come with names. Sometimes they overlap. Sometimes they're not what you want them to be. For example, if you run a topic model on the NYT corpus, there's no guarantee you'll get topics that correspond to the sections of the newspaper (business, metro, world, sport, etc.!)

## Further reading

The way you used `fit` and `transform` for both the `vectorizer`, `lda`, and `topic_pipeline` objects is generic across scikit-learn, so play with scikit-learn, e.g. [Andreas Mueller's presentation](https://www.youtube.com/watch?v=8CzwlZbwDkI) is a good place to start.
 
Remember if you're interested in the LDA algorithm itself, take a look at

 - [Tim Hopper's PyData NYC 2015 talk](https://www.youtube.com/watch?v=_R66X_udxZQ)
 - [David Blei's ACM article](https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf)
 
[Fast Forward Labs Report 4](http://ff04.fastforwardlabs.com/) goes into more detail on the summarization use case in particular, and also talks about the latest and greatest approach: recurrent neural networks and neural language embeddings. You may find this [PyGotham talk](https://www.youtube.com/watch?v=y7XoypvQRhY&feature=youtu.be) a useful semi-technical alternative to the report.

And our demo of [Luhn's Algorithm](http://fastforwardlabs.github.io/luhn/) is a fun and concrete example of the simplest possible thing you could do.