LDA Topic Modeling

Eric Fan 
9/26/2021

# Prepare libraries

In [None]:
! pip install pandas --upgrade
! pip install NumPy --upgrade
! pip install pyLDAvis --upgrade
! pip install gensim --upgrade
! pip install spacy --upgrade
! python -m spacy download en_core_web_sm

In [151]:
import csv
import pprint
from gensim import corpora
from gensim import utils
from gensim.parsing.preprocessing import remove_stopwords, preprocess_string
from gensim import models
from gensim.models.coherencemodel import CoherenceModel
# from gensim.parsing.porter import PorterStemmer
import spacy
import pyLDAvis.gensim_models 
import pickle 
import pyLDAvis
import random
import sys

# Part 1: AP Stories

## Load documents

In [152]:
## import documents ##
documents = {}

with open('/content/ap.csv', mode='r') as inp:
    reader = csv.reader(inp)
    documents = {rows[0]:rows[1] for rows in reader}

ap_documents_list = list(documents.values())

## Pre-processing

In [201]:
# lemmatize using spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
ap_documents_list_lammatized = [ " ".join([token.lemma_ for token in nlp(doc)]) for doc in ap_documents_list]

# printing out first story as example
print(ap_documents_list_lammatized[0])

a 16 - year - old student at a private Baptist school who allegedly kill one teacher and wound another before fire into a fill classroom apparently ` ` just snap , '' the school 's pastor say . ` ` I do n't know how it could have happen , '' say George Sweet , pastor of Atlantic Shores Baptist Church . ` ` this be a good , christian school . we pride ourselves on discipline . our kid be good kid . '' the Atlantic Shores Christian School sophomore be arrest and charge with first - degree murder , attempt murder , malicious assault and related felony charge for the Friday morning shooting . Police would not release the boy 's name because he be a juvenile , but neighbor and relative identify he as Nicholas Elliott . Police say the student be tackle by a teacher and other student when his semiautomatic pistol jam as he fire on the classroom as the student cower on the floor cry ` ` Jesus save we ! God save we ! '' friend and family say the boy apparently be trouble by his grandmother 's d

In [203]:
# tokenize and remove stop words
ap_words = [utils.simple_preprocess(item) for item in ap_documents_list_lammatized]
ap_wordsToRemove = ['pron', '']
ap_words = [[remove_stopwords(word) for word in lst if remove_stopwords(word) not in ap_wordsToRemove] for lst in ap_words]

# printing out the first 20 words of the first story as example
pprint.pprint(ap_words[0][0:20])

# generate gensim dictionary 
ap_dictionary = corpora.Dictionary(ap_words)

# generate gensim bag-of-words vectors
ap_bow_corpus = [ap_dictionary.doc2bow(text) for text in ap_words]

['year',
 'old',
 'student',
 'private',
 'baptist',
 'school',
 'allegedly',
 'kill',
 'teacher',
 'wound',
 'classroom',
 'apparently',
 'snap',
 'school',
 'pastor',
 'know',
 'happen',
 'george',
 'sweet',
 'pastor']


## Generate LDA model

To determine the number of topics the LDA model should generate, I use a combination of coherence score and manual experimentation. Here I use the c_v coherence score, which consists of "a sliding window, a one-set segmentation of the top words and an indirect confirmation measure" that uses normalized pointwise mutual information (NPMI) and cosine similarity. 

The c_v score retrieves cooccurrence counts for the given words using a sliding window. The counts are used to calculate the NPMI of every top word to every other top word, resulting in a set of vectors. The one-set segmentation of the top words leads to the calculation of the similarity between every top word vector and the sum of all top word vectors. The coherence score is the arithmetic mean of these similarities. 

(There exist many different formulas/measures for coherence. I choose the c_v measusure because research found it to track closely with the human understanding of coherrence. It is also one of the most popular measures. I'm primarily relying on this research paper: https://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf)"

I also relied on some other sources: 
* https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0 
* https://developer.squareup.com/blog/topic-modeling-optimizing-for-human-interpretability/
* https://www.youtube.com/watch?v=UkmIljRIG_M


Below, I try running LDA models using 10, 20, 30...200 topics and calculate their coherence scores. Note that each time the LDA model runs, there's some randomness. I set seed=2021 for reproducibility.

In [205]:
# compute coherence scores
scores = {}

for i in range(20):
  num_topics = 10+i*10
  ap_model = models.LdaModel(ap_bow_corpus, id2word=ap_dictionary, num_topics = num_topics, random_state=2021) # set seed=2021 for reproducibility
  cm = CoherenceModel(ap_model, texts=ap_words, corpus=ap_bow_corpus, dictionary=ap_dictionary, coherence='c_v')
  score = cm.get_coherence()
  scores[num_topics] = score
  print(f'{num_topics}: {score}')

## optimal number of topics: ~110

10: 0.34680436710212603
20: 0.35866206144564355
30: 0.3596870892607408
40: 0.33724455832795786
50: 0.3586648957123019
60: 0.36038911355767084
70: 0.3604646528145694
80: 0.3498792206426544
90: 0.36587780387436913
100: 0.3646391373629878
110: 0.37605928810880845
120: 0.3716048719496098
130: 0.35958518744095
140: 0.3609688826064688
150: 0.36226840621376744
160: 0.3614273990771643
170: 0.35708623556482344
180: 0.36322068918828404
190: 0.37030850795037246
200: 0.3648977769120733


From reseraching online, I understood that any improvement > 0.01 is considered significant, and that, as I increase the number of topics, the coherrence score will stall at a relatively stable level at some point. As shown above, the c_v score stopped improving around 110. I then ran LDA models with 30, 60, 110, 200 topics. I manually observed the topics they produced. There were a significant improvement of interpretability from 30 to 110, but no significant improvement from 110 to 200. I decided to go with 110 topics.

Now I generate a model with my desired number of topics. Again, I set a fixed seed for this exsercise.

In [222]:
## build LDA model ##
ap_model = models.LdaModel(ap_bow_corpus, id2word=ap_dictionary, num_topics=110, random_state=2021) # set seed=2021 for reproducibility

We can print out all the topics and their most significant words.

In [223]:
pprint.pprint(ap_model.print_topics(num_topics=-1))

[(0,
  '0.029*"book" + 0.024*"novel" + 0.019*"circle" + 0.018*"fiction" + '
  '0.015*"author" + 0.012*"write" + 0.007*"hirohito" + 0.007*"emperor" + '
  '0.006*"year" + 0.005*"national"'),
 (1,
  '0.012*"year" + 0.006*"black" + 0.006*"people" + 0.005*"family" + '
  '0.005*"life" + 0.004*"woman" + 0.004*"school" + 0.004*"victim" + '
  '0.004*"state" + 0.004*"mrs"'),
 (2,
  '0.042*"study" + 0.017*"harvard" + 0.014*"percent" + 0.009*"dea" + '
  '0.009*"carnegie" + 0.009*"year" + 0.007*"cell" + 0.006*"university" + '
  '0.006*"cause" + 0.005*"power"'),
 (3,
  '0.032*"steinberg" + 0.008*"institutional" + 0.007*"ludwig" + '
  '0.005*"thursday" + 0.004*"official" + 0.003*"new" + 0.003*"school" + '
  '0.003*"harold" + 0.003*"dollar" + 0.003*"year"'),
 (4,
  '0.025*"iraq" + 0.018*"iraqi" + 0.017*"iran" + 0.007*"attack" + '
  '0.007*"earthquake" + 0.007*"aug" + 0.007*"report" + 0.007*"force" + '
  '0.006*"deadline" + 0.005*"statement"'),
 (5,
  '0.008*"engine" + 0.006*"bangkok" + 0.006*"ratify" 

## Visualize topics

Now I visualzize the topics using pyLDAvis. Note that I set `sort_topics=False` to avoid pyLDAvis reshuffling topic ID numbers. If set `True` by default, pyLDAvis will re-order the IDs by significance instead of keeping the original order from the LDA model. However, pyLDAvis ID starts from 1 while model IDs start from 0, so they are still off by 1 (model #0 = pyLDAvis #1).

In [225]:
# Visualize the topics
pyLDAvis.enable_notebook()
ap_LDAvis_prepared = pyLDAvis.gensim_models.prepare(ap_model, ap_bow_corpus, ap_dictionary, sort_topics=False) # set sort_topics=False to avoid pyLDAvis reshuffling topic IDs
ap_LDAvis_prepared

  by='saliency', ascending=False).head(R).drop('saliency', 1)


In the visualization, the left panel shows topic bubbles that are arranged on a quadrant as an Intertopic Distance Map. On the right side is a bar plot ranking down the most relevant words within a selected topic. The ranking of the elements in the bar plot is based on a combination of two metrics (both measured on a log scale): 

1. The probability of a word appearing under the selected topic; 
2. The “lift” effect of the topic on the word. 

“Lift” is defined as the “ratio of a term’s probability within a topic to its marginal probability across the corpus.”  The lift effect is higher when a larger proportion of a word occurrences take place under the given topic. A high lift effect is presented by a high ratio of red to gray in the bar plot, and vice versa. 

Now, the formula behind the overall ranking is:

`relevance(term w | topic t) = λ * p(w | t) + (1 - λ) * p(w | t)/p(w)`

where p(w | t) is the probability of word w given topic t; p(w | t)/p(w) is the lift; and λ is a weight. Setting λ = 1 results in a ranking of terms solely in their topic-specific probability and setting λ = 0 ranks terms solely by their lift. After some experimentation, I decided to use λ = 0.6 for my project.

Source: Sievert, Carson and Kenneth E. Shirley. “LDAvis: A method for visualizing and interpreting topics.” 2014. https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf.


## Interpret 10 random topics

In [224]:
# generate 10 random integers from 0 to 109
random.seed(2021)
random10 = random.sample(range(110), 10)
print(f'Topic IDs selected: {random10}')

# let's see the 10 random topics we selected
for id in random10:
  pprint.pprint(ap_model.print_topic(id, topn=20))

Topic IDs selected: [107, 51, 109, 80, 69, 35, 31, 81, 4, 56]
('0.013*"opera" + 0.011*"hughes" + 0.010*"met" + 0.008*"metropolitan" + '
 '0.008*"character" + 0.007*"year" + 0.007*"labor" + 0.006*"secretary" + '
 '0.006*"portrayal" + 0.006*"fictional" + 0.005*"bird" + 0.005*"thatcher" + '
 '0.005*"grand" + 0.005*"rogue" + 0.005*"mrs" + 0.005*"quit" + 0.004*"plea" + '
 '0.004*"general" + 0.004*"new" + 0.004*"percent"')
('0.026*"german" + 0.021*"east" + 0.017*"germany" + 0.016*"billion" + '
 '0.011*"west" + 0.010*"year" + 0.010*"adapt" + 0.008*"unification" + '
 '0.008*"hopkins" + 0.008*"maiziere" + 0.007*"export" + 0.007*"official" + '
 '0.006*"united" + 0.006*"report" + 0.006*"socialism" + 0.005*"party" + '
 '0.005*"wheelchair" + 0.005*"sanitation" + 0.004*"plot" + 0.004*"fiscal"')
('0.010*"argentina" + 0.008*"cent" + 0.007*"elderly" + 0.006*"impending" + '
 '0.005*"social" + 0.005*"new" + 0.005*"persist" + 0.005*"list" + '
 '0.005*"issue" + 0.005*"bluff" + 0.005*"stock" + 0.004*"court"

Here, I loop through the 10 random topics and all 2250 documents to find out which documents were assigned these topics. I set a cutoff probablity of 0.5.

In [217]:
# code used to looking for specific topics in docuements
for id in random10: # loop through my 10 random topics 
  for i in range(2249): # loop through all documents
    if ap_model.get_document_topics(ap_bow_corpus, 0)[i][id][1] > 0.5:
      print(f'Topic {id}, Document {i+1}, Likelihood {ap_model.get_document_topics(ap_bow_corpus, 0)[i][id][1]}')

Topic 107, Document 599, Likelihood 0.5824383497238159
Topic 107, Document 606, Likelihood 0.5821968913078308
Topic 107, Document 1277, Likelihood 0.52851802110672
Topic 107, Document 1877, Likelihood 0.6887549757957458
Topic 107, Document 2170, Likelihood 0.5210728049278259
Topic 51, Document 359, Likelihood 0.5224511027336121
Topic 51, Document 383, Likelihood 0.6119205951690674
Topic 51, Document 1313, Likelihood 0.672356367111206
Topic 51, Document 1679, Likelihood 0.6041008830070496
Topic 51, Document 1703, Likelihood 0.5608970522880554
Topic 51, Document 1728, Likelihood 0.5750467777252197
Topic 51, Document 1903, Likelihood 0.6021069288253784
Topic 51, Document 1951, Likelihood 0.6297021508216858
Topic 51, Document 2118, Likelihood 0.722801148891449
Topic 109, Document 398, Likelihood 0.5419617295265198
Topic 109, Document 831, Likelihood 0.6700919270515442
Topic 109, Document 2013, Likelihood 0.7876754403114319
Topic 80, Document 1069, Likelihood 0.7083643674850464
Topic 80, Do

Here are my interpretations and comments:


1.   Topic 107: This topic seems to be about performing arts, with key words related to the Metropolitan Opera, portraying fictional characters, and the artist Holly Hughes. I verified that the content of article #606 and #1277 are indeed about performing arts. However, Document 599 and 2170 had nothing do to with arts, but they both had female main characters. It would make sense that this topic is also generally related to woman, as indicated in its key words "mrs" and "thatcher".
2.   Topic 51: A topic about East Germany, West Germany, Germany unification and politics. The unusual key word "maiziere" refers to german politician Thomas de Maizière. Related documents support this interpretation.
3.   Topic 109: This topics seems to be about Argentina, Brazil and social issues. But the three related documents #398, #831 and #2013 had nothing to do with Argentina or Brazil. I am confused.
4.   Topic 80: A topic about emperor, princess and imperial palace. Document #1069 and #2240 were indeed news reports about the Japanese royal family.
5.   Topic 69: A topic about South Africa, its racial and economic issues, and apartheid. The unusual key word "pretoria" refers to one of the three South Africa capital cities Pretoria.
6.   Topic 35: A topic about the World Chess Championship and a game between Garry Kasparov and Anatoly Karpov, as in document #939 #1142. This topic also seems to have lots of France-related stuff, such as in document #902 and #1031.
7.   Topic 31: This topic seems to be about the Supreme Court Justice David Souter and court issues. However, the four associated documents were unrelated. I am confused.
8.   Topic 81: A topic about military spending, buying planes, and military presence in Saudi Arabia (see document #744). The keyword "lafayette" refers to the Lafayette Square outside The White House.
9.   Topic 4: This seems to be two distinct topics, one about earthquakes, as in document #650 #780 #1092 #1620 #1662 #2179. A separate topic about Iraq and Iran as in document #1713 #2050.
10.   Topic 56: A topic about contracts, workers, wages, and labor. Associated documents were mostly reports about government regulating private industry and how contracts and wages changed.

Additional notes: Topics are distinct and mostly interpretable. However, there were some inconsistencies that seemed to be caused by extremely short documents such as #1042 #709 #712. These often got very strange topic assignments.



# Part 2: SOTU Speeches

In [226]:
## import documents ##
documents = {}
csv.field_size_limit(sys.maxsize)
with open('/content/state-of-the-union.csv', mode='r') as inp:
    reader = csv.reader(inp)
    documents = {rows[0]:rows[1] for rows in reader}

sotu_documents_list = list(documents.values())

## Pre-processing

In [227]:
# lemmatize using spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
sotu_documents_list_lammatized = [ " ".join([token.lemma_ for token in nlp(doc)]) for doc in sotu_documents_list]

# printing out first speech as example
print(sotu_documents_list_lammatized[0])


 State of the Union Address 
 George Washington 
 December 8 , 1790 

 Fellow - Citizens of the Senate and House of Representatives : 

 in meet you again I feel much satisfaction in be able to repeat my 
 congratulation on the favorable prospect which continue to distinguish 
 our public affair . the abundant fruit of another year have bless our 
 country with plenty and with the mean of a flourish commerce . 

 the progress of public credit be witness by a considerable rise of 
 american stock abroad as well as at home , and the revenue allot for 
 this and other national purpose have be productive beyond the 
 calculation by which they be regulate . this latter circumstance be the 
 more pleasing , as it be not only a proof of the fertility of our resource , 
 but as it assure we of a further increase of the national respectability 
 and credit , and , let I add , as it bear an honorable testimony to the 
 patriotism and integrity of the mercantile and marine part of our citizen . 

Here, to make my results more interesting, I decided to remove not just standard stop words, but also some common words in State of the Union speeches, such as 'America', 'government', 'nation' etc. 

I refered to this QZ article to get a basic idea about which words to remove: https://qz.com/593570/the-most-frequently-used-words-in-every-state-of-the-union-speech-ever-made/ 

In [228]:
# tokenize and remove stop words
sotu_words = [utils.simple_preprocess(item) for item in sotu_documents_list_lammatized]
sotu_wordsToRemove = ['pron', '','united','states','year','government','country','federal','administration','congress','people','nation','power','public','senate','america','american','state','union','citizen','man','work','necessary','program','national','duty','act','right']
sotu_words = [[remove_stopwords(word) for word in lst if remove_stopwords(word) not in sotu_wordsToRemove] for lst in sotu_words]

# printing out the first 20 words of the first speech as example
pprint.pprint(sotu_words[0][100:120])

# generate gensim dictionary 
sotu_dictionary = corpora.Dictionary(sotu_words)

# generate gensim bag-of-words vectors
sotu_bow_corpus = [sotu_dictionary.doc2bow(text) for text in sotu_words]

['appear',
 'district',
 'kentucky',
 'present',
 'virginia',
 'concur',
 'certain',
 'proposition',
 'contain',
 'law',
 'consequence',
 'district',
 'distinct',
 'member',
 'case',
 'requisite',
 'sanction',
 'add',
 'sanction',
 'application']


## Generate LDA model

Again, I use a combination of c_v coherence score and mutual experimentation to determine the optimal number of topics.

In [229]:
# compute coherence scores
scores = {}

for i in range(20):
  num_topics = 10+i
  model = models.LdaModel(sotu_bow_corpus, id2word=sotu_dictionary, num_topics = num_topics, random_state=2021) # set seed=2021 for reproducibility
  cm = CoherenceModel(model, texts=sotu_words, corpus=sotu_bow_corpus, dictionary=sotu_dictionary, coherence='c_v')
  score = cm.get_coherence()
  scores[num_topics] = score
  print(f'{num_topics}: {score}')

## optimal number of topics: ~15

10: 0.24766801935276755
11: 0.24487066078015005
12: 0.24625296921200035
13: 0.24640195800590428
14: 0.2441091731817808
15: 0.24648561408181469
16: 0.24350609293447706
17: 0.24403099666978592
18: 0.24432908592192945
19: 0.24505624692634187
20: 0.24407117355818914
21: 0.2437481840375933
22: 0.24542027157916949
23: 0.24384063135854384
24: 0.2442026446728562
25: 0.24494060027914974
26: 0.2456377797173445
27: 0.24582435124610505
28: 0.245269568093361
29: 0.24589888900221527


Unlike the AP stories, the SOTU speeches did not see significant improvements as I increase the number of topics. I tried 100 topics, 200 topics, and 300 topics. None performed better than just 15 topics, so I decided to go with 15 topics.

In [230]:
## build LDA model ##
sotu_model = models.LdaModel(sotu_bow_corpus, id2word=sotu_dictionary, num_topics=15, random_state=2021) # set seed=2021 for reproducibility

In [231]:
pprint.pprint(sotu_model.print_topics(num_topics=-1, num_words=20))

[(0,
  '0.005*"increase" + 0.005*"war" + 0.005*"great" + 0.005*"new" + 0.004*"time" '
  '+ 0.004*"law" + 0.004*"world" + 0.003*"present" + 0.003*"force" + '
  '0.003*"general" + 0.003*"shall" + 0.003*"need" + 0.003*"foreign" + '
  '0.003*"purpose" + 0.003*"subject" + 0.003*"good" + 0.003*"large" + '
  '0.002*"treaty" + 0.002*"condition" + 0.002*"legislation"'),
 (1,
  '0.006*"great" + 0.005*"time" + 0.005*"law" + 0.004*"good" + 0.004*"service" '
  '+ 0.003*"increase" + 0.003*"present" + 0.003*"war" + 0.003*"world" + '
  '0.003*"subject" + 0.003*"large" + 0.003*"treaty" + 0.003*"result" + '
  '0.003*"general" + 0.002*"peace" + 0.002*"department" + 0.002*"report" + '
  '0.002*"continue" + 0.002*"officer" + 0.002*"territory"'),
 (2,
  '0.007*"great" + 0.006*"law" + 0.005*"time" + 0.005*"new" + 0.005*"war" + '
  '0.004*"world" + 0.004*"present" + 0.004*"shall" + 0.003*"service" + '
  '0.003*"good" + 0.003*"peace" + 0.003*"increase" + 0.003*"general" + '
  '0.003*"end" + 0.003*"force" + 0.0

## Visualize topics

In [232]:
# Visualize the topics
pyLDAvis.enable_notebook()
sotu_LDAvis_prepared = pyLDAvis.gensim_models.prepare(sotu_model, sotu_bow_corpus, sotu_dictionary, sort_topics=False) # set sort_topics=False to avoid pyLDAvis reshuffling topic IDs
sotu_LDAvis_prepared

  by='saliency', ascending=False).head(R).drop('saliency', 1)


## Interpret 10 random topics

In [235]:
# generate 10 random integers from 0 to 14
random.seed(2021)
random10 = random.sample(range(15), 10)
print(f'Topic IDs selected: {random10}')

# let's see the 10 random topics we selected
for id in random10:
  pprint.pprint(sotu_model.print_topic(id, topn=20))

Topic IDs selected: [13, 6, 10, 8, 4, 3, 0, 7, 9, 12]
('0.005*"time" + 0.005*"great" + 0.004*"war" + 0.004*"present" + 0.004*"force" '
 '+ 0.004*"increase" + 0.004*"need" + 0.003*"general" + 0.003*"law" + '
 '0.003*"provide" + 0.003*"new" + 0.003*"good" + 0.003*"subject" + '
 '0.003*"large" + 0.003*"world" + 0.003*"peace" + 0.003*"policy" + '
 '0.003*"condition" + 0.002*"land" + 0.002*"continue"')
('0.006*"great" + 0.005*"time" + 0.004*"new" + 0.004*"present" + 0.004*"world" '
 '+ 0.004*"increase" + 0.004*"law" + 0.004*"war" + 0.003*"continue" + '
 '0.003*"peace" + 0.003*"shall" + 0.003*"subject" + 0.003*"service" + '
 '0.003*"good" + 0.003*"condition" + 0.003*"general" + 0.002*"force" + '
 '0.002*"high" + 0.002*"need" + 0.002*"provide"')
('0.007*"great" + 0.005*"law" + 0.005*"present" + 0.005*"time" + '
 '0.004*"increase" + 0.004*"war" + 0.003*"good" + 0.003*"new" + 0.003*"world" '
 '+ 0.003*"provide" + 0.003*"service" + 0.003*"shall" + 0.003*"subject" + '
 '0.003*"treasury" + 0.003*"


1.   Topic 13: Topic about force, war, land.
2.   Topic 6: Topic about how great things were, and the need to 'increase' and 'continue'
3.   Topic 10: Topic about goods and services, and the Treasury.
4.   Topic 8: Topic about the importance of the current time, and concerns about foreign relations and war.
5.   Topic 4: Topic about the President.
6.   Topic 3: Topic about war and peace, and foreign relations.
7.   Topic 0: Topic about wars and laws, particularly the need to 'increase' something.
8.   Topic 7: Topic about treaties, laws, land.
9.   Topic 9: Topic about how good and great things are, very similar to Topic 6. 
10.   Topic 12: Topic about something new and great.

Unfortunately, the topics produced by the LDA model were generic and not very helpful. I spent a lot of time tweaking various parameters, such as the number of topics, the number of iterations, the number of passes, the minimum topic probability limit, etc. I also tried removing more common words during pre-processing, such as 'great', 'new', 'law', 'time', 'war', etc. But nothing worked. My guess is because all SOTU addresses are so long and so similar to each other in structure and content, I might be better off using TF-IDF or some other hierarchical models that can deal with repeated words better then simple LDA. I hope to get a chance to do TF-IDF for my final project.