# Chapter 4: Latent Dirichlet Allocation (LDA)

## Instructions

- Run the cells with "assert" statements to see if your answer's output matches what the output should be. If it runs without error, your answer matches! If your output is different, you'll get a hint.

In this notebook, you'll get to practice topic modeling with LDA.


In [1]:
import pandas as pd
from sklearn import datasets
from sklearn.feature_extraction.text import CountVectorizer

from gensim import corpora, models, matutils
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

_A few notes:_

_You can run first the line of code below if you'd like to suppress a warning that appears upon importing gensim._

_Additionally, if you get an error about numpy or cannot run the code above, it is related to importing gensim and is because you have an older version of numpy on your computer. Run the code in the second cell, restart your kernel, and that should remove the older version of numpy and install the latest one. All of this code is currently commented out._

In [2]:
#!pip install python-Levenshtein

In [3]:
#!pip uninstall numpy -y
#!pip install numpy

We are going to be using data from the famous 20 news groups dataset, specifically the motorcycles data.

In [4]:
news = datasets.fetch_20newsgroups(subset='train', categories=['rec.motorcycles'], remove=('headers', 'footers', 'quotes'))
news.data[0]

"Now, I am jumping into the middle of this thread so I may not know\nwhat y'all been talking about, but I have a few comments:\n\n\nThere are a number of other factors that are very important, the three\nbiggest being air velocity, air momentum and shock waves.\nVelocity stacks have been used for years and are now being used inside\nof stock airboxes on a number of bikes.  At a tuned engine rpm, the\nstacks can greatly increase the speed, and thus momentum of the air\nrushing in.\nAir momentum is critical in getting good air intake: the momentum of\nthe air stack outside the combustion chamber will force its way inside\nlong after the piston has begun its compressive up-stroke.\nShock waves are used to induce air intake and to prevent fresh air from\nescaping out the exzhaust ports.  Shock waves are the product of expansion\nchambers or any other means of presenting a 'wall' (opening or closing)\nto the air in motion.  Beyond this I am lost in the mystery of how they\ndesign for shock 

Before topic modeling, the first step is to get your text data in a format that is ready for modeling. Use `CountVectorizer` with English stop words and both unigrams and bigrams to turn the corpus into a document-term matrix. Save the matrix as `doc_term`.

In [5]:
### BEGIN SOLUTION
vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 2))
doc_term = vectorizer.fit_transform(news.data)
### END SOLUTION
doc_term.shape

(598, 34997)

In [6]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert doc_term.shape == (598, 34997), "The doc_term matrix should have 598 documents and 34997 terms."
### END HIDDEN TESTS

Turn the `doc_term` matrix into a dataframe. Modify the code below to do so and save the output as `doc_term_df`.

```
pd.DataFrame(INSERT_VALUE_HERE.toarray(), index=ex_label, columns=vectorizer.get_feature_names())
```

In [7]:
### BEGIN SOLUTION
doc_term_df = pd.DataFrame(doc_term.toarray(), columns=vectorizer.get_feature_names())
# ### END SOLUTION
doc_term_df



Unnamed: 0,00,00 23,00 42,00 battery,00 best,00 clothing,00 evening,00 pm,00 wasn,000,...,zx cbr,zx engine,zx spend,zx tried,zx900,zx900 payload,zx900a,zx900a supertrapp,zygot,zygot ati
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
593,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
594,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
595,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
596,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert type(doc_term_df) == pd.DataFrame, "The output should be a dataframe."
assert doc_term_df.shape == (598, 34997), "The output should have 598 documents and 34997 terms."
### END HIDDEN TESTS

Fit an LDA model using `LdaModel` with three topics. Set the `passes` hyperparameter to 10 so that the corpus will be scanned 10 times. Save the fitted model as `lda`.

NOTE: This may take a few minutes to run. Take a look at the log while you're waiting.

In [9]:
### BEGIN SOLUTION
term_doc = doc_term.transpose()
corpus = matutils.Sparse2Corpus(term_doc)
id2word = dict((v, k) for k, v in vectorizer.vocabulary_.items())
lda = models.LdaModel(corpus=corpus, num_topics=3, id2word=id2word, passes=10)
### END SOLUTION
lda

2021-11-11 17:14:53,006 : INFO : using symmetric alpha at 0.3333333333333333
2021-11-11 17:14:53,007 : INFO : using symmetric eta at 0.3333333333333333
2021-11-11 17:14:53,013 : INFO : using serial LDA version on this node
2021-11-11 17:14:53,027 : INFO : running online (multi-pass) LDA training, 3 topics, 10 passes over the supplied corpus of 598 documents, updating model once every 598 documents, evaluating perplexity every 598 documents, iterating 50x with a convergence threshold of 0.001000
2021-11-11 17:14:53,769 : INFO : -11.654 per-word bound, 3222.1 perplexity estimate based on a held-out corpus of 598 documents with 61523 words
2021-11-11 17:14:53,770 : INFO : PROGRESS: pass 0, at document #598/598
2021-11-11 17:14:54,151 : INFO : topic #0 (0.333): 0.004*"bike" + 0.002*"just" + 0.002*"like" + 0.002*"dod" + 0.001*"ride" + 0.001*"don" + 0.001*"time" + 0.001*"know" + 0.001*"riding" + 0.001*"good"
2021-11-11 17:14:54,153 : INFO : topic #1 (0.333): 0.003*"bike" + 0.002*"dod" + 0.00

2021-11-11 17:14:59,044 : INFO : topic diff=0.014165, rho=0.316228
2021-11-11 17:14:59,466 : INFO : -10.263 per-word bound, 1228.4 perplexity estimate based on a held-out corpus of 598 documents with 61523 words
2021-11-11 17:14:59,467 : INFO : PROGRESS: pass 9, at document #598/598
2021-11-11 17:14:59,587 : INFO : topic #0 (0.333): 0.004*"bike" + 0.002*"just" + 0.002*"like" + 0.001*"good" + 0.001*"don" + 0.001*"got" + 0.001*"helmet" + 0.001*"know" + 0.001*"way" + 0.001*"dod"
2021-11-11 17:14:59,588 : INFO : topic #1 (0.333): 0.003*"bike" + 0.002*"like" + 0.002*"just" + 0.002*"know" + 0.002*"motorcycle" + 0.001*"bikes" + 0.001*"right" + 0.001*"bmw" + 0.001*"don" + 0.001*"think"
2021-11-11 17:14:59,589 : INFO : topic #2 (0.333): 0.004*"dod" + 0.003*"bike" + 0.002*"like" + 0.002*"just" + 0.002*"ride" + 0.002*"don" + 0.002*"know" + 0.001*"dog" + 0.001*"time" + 0.001*"ve"
2021-11-11 17:14:59,590 : INFO : topic diff=0.009487, rho=0.301511
2021-11-11 17:14:59,594 : INFO : LdaModel lifecycle 

<gensim.models.ldamodel.LdaModel at 0x11b2bd580>

In [10]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert lda.num_topics == 3 and lda.passes == 10, "The output should be a fitted LDA model with three topics and 10 passes."
### END HIDDEN TESTS

Let's print the top words of each of the 3 topics.

In [11]:
lda.print_topics();

2021-11-11 17:14:59,631 : INFO : topic #0 (0.333): 0.004*"bike" + 0.002*"just" + 0.002*"like" + 0.001*"good" + 0.001*"don" + 0.001*"got" + 0.001*"helmet" + 0.001*"know" + 0.001*"way" + 0.001*"dod"
2021-11-11 17:14:59,633 : INFO : topic #1 (0.333): 0.003*"bike" + 0.002*"like" + 0.002*"just" + 0.002*"know" + 0.002*"motorcycle" + 0.001*"bikes" + 0.001*"right" + 0.001*"bmw" + 0.001*"don" + 0.001*"think"
2021-11-11 17:14:59,637 : INFO : topic #2 (0.333): 0.004*"dod" + 0.003*"bike" + 0.002*"like" + 0.002*"just" + 0.002*"ride" + 0.002*"don" + 0.002*"know" + 0.001*"dog" + 0.001*"time" + 0.001*"ve"


Take a look at the top words in each of these topics. Each time you run the LDA model, the results will be slightly different because of the random initiation of topic assignments.

This is the interpretation of one set of results. Yours will likely be different.
* Topic 1: certain brands of bikes
* Topic 2: new bikes and helments
* Topic 3: good times riding bikes

The results may look fuzzy though, so to clean them up, you have several options:
* Increase the number of passes to get more stable results.
* Change the number of topics.
* Clean up the text more in the CountVectorizer step, such as adding to the stop word list, removing common words, etc.

Spend a few minutes doing at least one of these things to make your model better before moving on.

After you are satisfied with final topics, your task is to figure out which topics are in each document. Transform the original `doc_term` matrix into a document-topic matrix and save it as `doc_topic`.

In [12]:
### BEGIN SOLUTION
doc_topic = [doc for doc in lda[corpus]]
### END SOLUTION
doc_topic[0:5]

[[(1, 0.99593365)],
 [(0, 0.030585783), (1, 0.028712576), (2, 0.9407016)],
 [(0, 0.33333334), (1, 0.33333334), (2, 0.33333334)],
 [(2, 0.9843929)],
 [(1, 0.9939523)]]

In [13]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert len(doc_topic) == 598, "The doc_topic matrix should have 598 items."
### END HIDDEN TESTS

Let's take a look at a document and see if the topic distribution makes sense.

The list of tuples is the list of topics in the document. The first value of a tuple is the topic, and the second value is the percent of the document that is that topic. If a topic is less than a percent of a document, that tuple is left out.

Display the document distribution for the 0th document and name it `doc_0_topics`.

In [14]:
### BEGIN SOLUTION
doc_0_topics = doc_topic[0]
### END SOLUTION
doc_0_topics

[(1, 0.99593365)]

In [15]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert type(doc_0_topics) == list, "The doc_0_topics variable should be a list."
### END HIDDEN TESTS

Take a look at the full text for the document. Do the topic assignments make sense?

In [16]:
news.data[0]

"Now, I am jumping into the middle of this thread so I may not know\nwhat y'all been talking about, but I have a few comments:\n\n\nThere are a number of other factors that are very important, the three\nbiggest being air velocity, air momentum and shock waves.\nVelocity stacks have been used for years and are now being used inside\nof stock airboxes on a number of bikes.  At a tuned engine rpm, the\nstacks can greatly increase the speed, and thus momentum of the air\nrushing in.\nAir momentum is critical in getting good air intake: the momentum of\nthe air stack outside the combustion chamber will force its way inside\nlong after the piston has begun its compressive up-stroke.\nShock waves are used to induce air intake and to prevent fresh air from\nescaping out the exzhaust ports.  Shock waves are the product of expansion\nchambers or any other means of presenting a 'wall' (opening or closing)\nto the air in motion.  Beyond this I am lost in the mystery of how they\ndesign for shock 

If it makes some sense, great! If it doesn't, that indicates that you'll need to further tune your topic model.