# Topic Modelling Presidential Inaugural Addresses

In [438]:
%matplotlib inline

import re
import pprint
import pandas as pd
import gensim
from nltk.corpus import stopwords, inaugural
import warnings

warnings.filterwarnings('ignore')
pd.options.display.max_colwidth = 50
pp = pprint.PrettyPrinter(indent=4)

# Data
The dataset we will use for topic modelling will be the presidential inaugural addresses, from George Washington to Barack Obama (2009)

The first thing we will do is process the raw speeches, and then putting them into a dataframe (think a database table). The dataframe will contain metadata about the speeches - year, historical era, and the president's name.

Our `process` function does a few key things:
* Filters out **stopwords**, which typically refers to the most common words in a language. e.g. "the", "is", "at", ...
* Filters out punctuation and numbers.
* Makes sure that no empty strings are present.
* makes every word lowercase.

In [421]:
# this will allow us to map inaugural years to historical eras.
def historical_era(year):
    if year < 1830:
        return "Early Republic"
    elif year < 1854:
        return "Jacksonian Democracy"
    elif year < 1882:
        return "Sectional Conflict"
    elif year < 1898:
        return "Gilded Age"
    elif year < 1923:
        return "Progressive Era"
    elif year < 1962:
        return "Depression and World Conflict"
    elif year < 1990:
        return "Social Change and Soviet Relations"
    else:
        return "Globalization"
    
def process(speech):
    stoplist = set(stopwords.words())
    return [re.sub(r'--|;|:|\(|\.|\,|\)|[0-9]*',"", word) for word in speech.lower().split() if word not in stoplist]

In [282]:
raw_speeches = [inaugural.raw(fileid) for fileid in inaugural.fileids()]
years = [int(fileid[:4]) for fileid in inaugural.fileids()]
presidents = [fileid.split("-")[1].replace(".txt","") for fileid in inaugural.fileids()]

# strip out stopwords 
speeches = [process(speech) for speech in raw_speeches]
speeches = [[word for word in speech if word !=""]for speech in speeches]

In [285]:
speeches_df = pd.DataFrame(zip(raw_speeches,speeches, years, presidents), columns = ["raw_speech","speech", "year", "president"])
speeches_df["era"] = speeches_df["year"].apply(historical_era) 

# we will use this later when analyzing how topics change over the course of time.
eras = list(set(list(speeches_df["era"])))

Here, we see what our dataframe looks like. 

| raw_speech | speech | year | president | era |
| --- | --- | --- | --- | --- | --- |
| exact speech | processed text | year of speech | speaker | historical era |


In [420]:
speeches_df.head(5)

Unnamed: 0,raw_speech,speech,year,president,era
0,Fellow-Citizens of the Senate and of the House...,"[fellow-citizens, senate, house, representativ...",1789,Washington,Early Republic
1,"Fellow citizens, I am again called upon by the...","[fellow, citizens, called, upon, voice, countr...",1793,Washington,Early Republic
2,"When it was first perceived, in early times, t...","[first, perceived, early, times, middle, cours...",1797,Adams,Early Republic
3,Friends and Fellow Citizens:\n\nCalled upon to...,"[friends, fellow, citizens, called, upon, unde...",1801,Jefferson,Early Republic
4,"Proceeding, fellow citizens, to that qualifica...","[proceeding, fellow, citizens, qualification, ...",1805,Jefferson,Early Republic


## Transformations

Next, we will transform all of our speeches using a document representation called **bag-of-words**. 

In this representation, each document is represented by one vector where each vector element represents a question-answer pair, in the style of:

>“How many times does the word *protest* appear in the document? Once.”

It is advantageous to represent the questions only by their (integer) ids. The mapping between the questions and ids is called a dictionary.

We see that after filtering out stopwords, and in/frequent words, we are left with a vocabulary consisting of unique tokens.

In [390]:
# create a dictionary that maps words to integers.
dictionary = gensim.corpora.Dictionary(speeches)

# filter out really frequent and infrequent words
dictionary.filter_extremes(no_below=2)

dictionary.save("../data/reviews.dict")
print(dictionary)

Dictionary(4934 unique tokens: [u'aided', u'limited', u'dissolution', u'comparatively', u'desirable']...)


## Inspection

>What are the 3 most common words in Washington's first inaugural address? How about Obama's 2009 address?

In [391]:
print "Washington, 1789: ", pp.pformat([(dictionary[word[0]], word[1]) for word in sorted(dictionary.doc2bow(speeches[0]), key=lambda x: x[1], reverse=True)[:3]])
print "Obama, 2009: ", pp.pformat([(dictionary[word[0]], word[1]) for word in sorted(dictionary.doc2bow(speeches[-1]), key=lambda x: x[1], reverse=True)[:3]])

Washington, 1789:  [(u'me', 5), (u'ought', 4), (u'nature', 3)]
Obama, 2009:  [(u'america', 7), (u'today', 6), (u'cannot', 6)]


Once we have our dictionary, we can now convert each document into it's **bag-of-words** representation and save the collection of these vectors as our corpus. Each entry in our corpus will indicate which words from our dictioary appear in the address, and how often.

In [426]:
corpus = [dictionary.doc2bow(speech) for speech in speeches]
gensim.corpora.MmCorpus.serialize("../data/reviews.mm", corpus)
print "First 10 Tokens - Obama, 2009"
print "Bag-of-Words: ", pp.pformat(corpus[-1][:10])
print "Processed: ", pp.pformat([dictionary[id] for id,_ in corpus[-1][:10]])

First 10 Words - Obama, 2009
Bag-of-Words:  [   (32, 3),
    (37, 1),
    (41, 1),
    (44, 1),
    (57, 1),
    (71, 1),
    (72, 2),
    (76, 1),
    (82, 3),
    (112, 1)]
Processed:  [   u'carried',
    u'emerged',
    u'oceans',
    u'homes',
    u'cause',
    u'enjoy',
    u'charter',
    u'tolerate',
    u'across',
    u'join']


# Topic Modelling

We are now equipped to perform topic modelling on the speeches. However, our current corpus treats all words equally. That is, all terms carry equal weight across all of the speeches. We apply another transformation called **td-idf** (term frequency - inverse document frequency) which will update the weights of each word so as to capture how important it is to the speech.

In [393]:
tfidf = gensim.models.TfidfModel(corpus, id2word=dictionary)
# corpus_tfidf = tfidf[corpus]

Using our td-idf representation, we will perform **Latent Dirichlet Allocation (LDA)**, which will allow us to discover topics in the inaugural speeches. 

The model outputs a series of "topics" which are comprised of words and weights corresponding to the "influence" that word has on the topic. 

In [394]:
lda = gensim.models.ldamodel.LdaModel(corpus, id2word = dictionary, num_topics = 30, passes=4)
corpus_lda = lda[corpus_tfidf]

# print 10 of our 30 topics.
pp.pprint(lda.print_topics(10, num_words=3))

[   (9, u'0.001*"executive" + 0.001*"congress" + 0.001*"subject"'),
    (25, u'0.006*"clothed" + 0.004*"housed" + 0.004*"fed"'),
    (7, u'0.009*"sides" + 0.008*"pledge" + 0.006*"ask"'),
    (11, u'0.012*"democracy" + 0.006*"america" + 0.006*"will"'),
    (17, u'0.002*"america" + 0.002*"congress" + 0.001*"executive"'),
    (24, u'0.006*"address" + 0.005*"neither" + 0.005*"slaves"'),
    (18, u'0.001*"executive" + 0.001*"congress" + 0.001*"general"'),
    (21, u'0.004*"congress" + 0.003*"republic" + 0.003*"general"'),
    (16, u'0.007*"america" + 0.006*"today" + 0.004*"things"'),
    (26, u'0.014*"america" + 0.014*"responsibility" + 0.010*"abroad"')]


This allows us to inspect the topic that each of the documents belongs to

In [410]:
check = 1
print speeches_df.iloc[check]["president"], ",", speeches_df.iloc[check]["year"]
print sorted(lda.get_document_topics(corpus[check]), key = lambda x: x[1], reverse=True)
print pp.pformat(lda.print_topic(22))

Washington , 1793
[(4, 0.97156862745096717)]
u'0.006*"false" + 0.006*"reason" + 0.005*"limits" + 0.005*"due" + 0.005*"press" + 0.005*"therefore" + 0.004*"measures" + 0.004*"truth" + 0.004*"expenses" + 0.004*"revenue"'


In [437]:
# check how many topics each speech belongs to
topics_per_speech = {
    str(speeches_df.iloc[i]["year"]) + "-" + str(speeches_df.iloc[i]["president"]) :{
        "topics": len(lda.get_document_topics(corpus[i]))
    }
    for i, _ in enumerate(corpus)
}

pp.pprint(topics_per_speech)

{   '1789-Washington': {   'topics': 1},
    '1793-Washington': {   'topics': 1},
    '1797-Adams': {   'topics': 1},
    '1801-Jefferson': {   'topics': 1},
    '1805-Jefferson': {   'topics': 1},
    '1809-Madison': {   'topics': 2},
    '1813-Madison': {   'topics': 1},
    '1817-Monroe': {   'topics': 3},
    '1821-Monroe': {   'topics': 3},
    '1825-Adams': {   'topics': 3},
    '1829-Jackson': {   'topics': 1},
    '1833-Jackson': {   'topics': 2},
    '1837-VanBuren': {   'topics': 5},
    '1841-Harrison': {   'topics': 4},
    '1845-Polk': {   'topics': 3},
    '1849-Taylor': {   'topics': 1},
    '1853-Pierce': {   'topics': 3},
    '1857-Buchanan': {   'topics': 3},
    '1861-Lincoln': {   'topics': 5},
    '1865-Lincoln': {   'topics': 1},
    '1869-Grant': {   'topics': 1},
    '1873-Grant': {   'topics': 2},
    '1877-Hayes': {   'topics': 2},
    '1881-Garfield': {   'topics': 3},
    '1885-Cleveland': {   'topics': 2},
    '1889-Harrison': {   'topics': 6},
    '1893-Cl

# Exercises

1. What happens if you split up the data by era and run an LDA on each of the eras separately? How do the topics change over time?
2. Label the topics!
3. Play around with parameters, how do the topics change?
  * try filtering the dictionary differently.
  * try changing the number of topics that you are hardcoding.
4. How can you change the processing step to add more intelligent parsing of the speeches?

# Introducing New Sentences, and Document Similarity.

For those interested, you can read about further application of gensim here: https://radimrehurek.com/gensim/tut3.html

## Appendix

Gensim Link
https://algobeans.com/2015/06/21/laymans-explanation-of-topic-modeling-with-lda-2/
http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
http://www.enchantedlearning.com/wordlist/