# Supplementary: Topic Modeling

Objectives:
- To demonstrate students how to apply topic modeling to real-world data.
- Students will gain hands-on experience through this example.

Create a data directory and store the downloaded data (state-of-the-union.csv) in it. The data is about State of the Union addresses from 1970 to 2012.

In [1]:
import os
import wget

os.system('mkdir data')
os.chdir('data')
wget.download('https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/text-analysis/data/state-of-the-union.csv')

'state-of-the-union.csv'

Read data

In [2]:
import pandas as pd

df = pd.read_csv("state-of-the-union.csv")

# Clean it up a little bit, removing non-word characters (numbers and ___ etc)
df.content = df.content.str.replace("[^A-Za-z ]", " ")

df.head()

Unnamed: 0,year,content
0,1790,"George Washington\nJanuary 8, 1790\n\nFellow-C..."
1,1790,\nState of the Union Address\nGeorge Washingto...
2,1791,\nState of the Union Address\nGeorge Washingto...
3,1792,\nState of the Union Address\nGeorge Washingto...
4,1793,\nState of the Union Address\nGeorge Washingto...


In [3]:
df.shape

(226, 2)

Using Gensim to perform topic modeling

In [4]:
# Run this cell if gensim has not been installed yet.
# !pip install gensim

Apply `simple_process` to convert a document into a list of tokens. The input will be lowercased, tokenized, and de-accented (optional).



In [6]:
from gensim.utils import simple_preprocess

texts = df.content.apply(simple_preprocess)
texts

0      [george, washington, january, fellow, citizens...
1      [state, of, the, union, address, george, washi...
2      [state, of, the, union, address, george, washi...
3      [state, of, the, union, address, george, washi...
4      [state, of, the, union, address, george, washi...
                             ...                        
221    [state, of, the, union, address, george, bush,...
222    [address, to, joint, session, of, congress, ba...
223    [state, of, the, union, address, barack, obama...
224    [state, of, the, union, address, barack, obama...
225    [state, of, the, union, address, barack, obama...
Name: content, Length: 226, dtype: object

Create a dictionary, using the texts that have already been preprocessed.

The method `doc2bow` is for converting document (a list of words) into the bag-of-words format.

In [None]:
from gensim import corpora

dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=2000)
corpus = [dictionary.doc2bow(text) for text in texts]

Specify number of topics manually. Let's try 2 topics first.

In [None]:
from gensim import models

n_topics = 2

lda_model = models.LdaModel(corpus=corpus, num_topics=n_topics)
lda_model.print_topics()

Let's try 5 topics.

In [None]:
n_topics = 5

lda_model = models.LdaModel(corpus=corpus, num_topics=n_topics)
lda_model.print_topics()

Let's try 15 topics.

In [None]:
n_topics = 15

lda_model = models.LdaModel(corpus=corpus, num_topics=n_topics)
lda_model.print_topics()

In [None]:
# Run this cell if pyLDAvis has never been installed
# !pip install pyLDAvis

In [None]:
import pyLDAvis
import pyLDAvis.gensim

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
vis