# U.S.A. Presidential Vocabulary

Whenever a United States of America president is elected or re-elected, an inauguration ceremony takes place to mark the beginning of the president’s term. During the ceremony, the president gives an inaugural address to the nation, dictating the tone and focus of the next four years of leadership.

The goal of this project is to analyze the inaugural addresses of the presidents of the United States of America using word embeddings. By training sets of word embeddings on subsets of inaugural addresses, we can learn about the different ways in which the presidents use language to convey their agenda.


## Data Loading

The texts are are stored in separate files in the current directory. In order to create word embeddings on the corpus of all the presidents’ speeches, we need to read the text data from each file.

In [5]:
import os
from nltk.tokenize import PunktSentenceTokenizer
from collections import Counter
from nltk.corpus import stopwords
import gensim

# Find txt files in directory
files = sorted([file for file in os.listdir() if file[-4:] == '.txt'])
# Print first 10 presidents
files[0:10]

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-John-Adams.txt',
 '1801-Jefferson.txt',
 '1805-Jefferson.txt',
 '1809-Madison.txt',
 '1813-Madison.txt',
 '1817-Monroe.txt',
 '1821-Monroe.txt',
 '1825-John-Q-Adam.txt']

In [6]:
def read_file(file_name):
  with open(file_name, 'r+', encoding='utf-8') as file:
    file_text = file.read()
  return file_text

In [7]:
# Upload files to speaches list
speeches = [read_file(file) for file in files]
# Print 1 sentence of the 1st speech
speeches[0][0:278]

'Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month.'

## Text Preprocessing

Now we need to separate the files into sentences on a word by word basis and then merge all the sentences across the speeches into one big list of lists.

In [8]:
def process_speeches(speeches):
  word_tokenized_speeches = list()
  for speech in speeches:
    sentence_tokenizer = PunktSentenceTokenizer()
    sentence_tokenized_speech = sentence_tokenizer.tokenize(speech)
    word_tokenized_sentences = list()
    for sentence in sentence_tokenized_speech:
      word_tokenized_sentence = [word.lower().strip('.').strip('?').strip('!') for word in sentence.replace(",","").replace("-"," ").replace(":","").split()]
      word_tokenized_sentences.append(word_tokenized_sentence)
    word_tokenized_speeches.append(word_tokenized_sentences)
  return word_tokenized_speeches

In [9]:
# Breakdown the speeches into words on a sentence by sentence basis
processed_speeches = process_speeches(speeches)
# Print 1 sentence of the 1st speech
processed_speeches[0][0]

['fellow',
 'citizens',
 'of',
 'the',
 'senate',
 'and',
 'of',
 'the',
 'house',
 'of',
 'representatives',
 'among',
 'the',
 'vicissitudes',
 'incident',
 'to',
 'life',
 'no',
 'event',
 'could',
 'have',
 'filled',
 'me',
 'with',
 'greater',
 'anxieties',
 'than',
 'that',
 'of',
 'which',
 'the',
 'notification',
 'was',
 'transmitted',
 'by',
 'your',
 'order',
 'and',
 'received',
 'on',
 'the',
 '14th',
 'day',
 'of',
 'the',
 'present',
 'month']

In order to build a custom set of word embeddings using `gensim`, we need to convert our data into a list of lists, where each inner list is a sentence and each item in the inner list is a word token.

In [10]:
def merge_speeches(speeches):
  all_sentences = list()
  for speech in speeches:
    for sentence in speech:
      all_sentences.append(sentence)
  return all_sentences

In [32]:
# Create list of lists where each inner list is a sentence and each item in the inner list is a word token
all_sentences = merge_speeches(processed_speeches)
# Print 1 sentence of the 1st speech
all_sentences[0][0:278]

['fellow',
 'citizens',
 'of',
 'the',
 'senate',
 'and',
 'of',
 'the',
 'house',
 'of',
 'representatives',
 'among',
 'the',
 'vicissitudes',
 'incident',
 'to',
 'life',
 'no',
 'event',
 'could',
 'have',
 'filled',
 'me',
 'with',
 'greater',
 'anxieties',
 'than',
 'that',
 'of',
 'which',
 'the',
 'notification',
 'was',
 'transmitted',
 'by',
 'your',
 'order',
 'and',
 'received',
 'on',
 'the',
 '14th',
 'day',
 'of',
 'the',
 'present',
 'month']

To make sure we don't analyse common words let's remove them from the dataset.

In [12]:
# Remove common english words
stop_words = set(stopwords.words('english'))
all_sentences_filtered = [[word for word in sentence if not word in stop_words] for sentence in all_sentences]

## Investigation of Most Common

To get a better understanding of the data, let’s take a look at the most frequently used words across all the inaugural addresses.

In [13]:
def most_frequent_words(list_of_sentences):
  all_words = [word for sentence in list_of_sentences for word in sentence]
  return Counter(all_words).most_common()

In [36]:
most_freq_words = most_frequent_words(all_sentences_filtered)
most_freq_words[0:30]

[('government', 575),
 ('people', 556),
 ('us', 469),
 ('upon', 365),
 ('must', 362),
 ('great', 336),
 ('may', 334),
 ('world', 310),
 ('states', 309),
 ('shall', 309),
 ('every', 299),
 ('country', 296),
 ('nation', 288),
 ('one', 254),
 ('new', 251),
 ('peace', 250),
 ('citizens', 241),
 ('power', 230),
 ('public', 221),
 ('time', 213),
 ('would', 209),
 ('constitution', 201),
 ('united', 198),
 ('nations', 187),
 ('america', 184),
 ('free', 183),
 ('freedom', 182),
 ('union', 171),
 ('war', 168),
 ('american', 159)]

## Word Embedding Models

New we are going to create our first word embedding model that contains word embeddings of all presidents, with `gensim`.

In [15]:
all_presidents_embeddings = gensim.models.Word2Vec(all_sentences_filtered, window=5, min_count=1, workers=2, sg=1)

Now that we have our word embeddings, let’s have some fun exploring them. The concept of “freedom” is prevalent in the speeches made by the presidents. Let's find the top 20 words that are used in similar contexts to “freedom”.

In [16]:
# Find words similar to "freedom"
similar_to_freedom = all_presidents_embeddings.wv.most_similar("freedom", topn=20)
similar_to_freedom

[('human', 0.996030330657959),
 ('order', 0.995844841003418),
 ('love', 0.9955887794494629),
 ('history', 0.9955615401268005),
 ('mankind', 0.9955421686172485),
 ('happiness', 0.9955199360847473),
 ('prosperity', 0.9955078363418579),
 ('come', 0.9954904317855835),
 ('lasting', 0.9954744577407837),
 ('return', 0.995394229888916),
 ('result', 0.9953935146331787),
 ('greatest', 0.9953813552856445),
 ('greater', 0.9953355193138123),
 ('progress', 0.9952577948570251),
 ('promote', 0.9952285289764404),
 ('civilization', 0.9952123165130615),
 ('stability', 0.995141863822937),
 ('advancement', 0.9951395988464355),
 ('universal', 0.9951261281967163),
 ('courage', 0.9951251745223999)]

What about some controversial terms in human history: "religion" and "God"?

In [17]:
# Find words similar to "religion"
similar_to_religion = all_presidents_embeddings.wv.most_similar("religion", topn=20)
similar_to_religion

[('real', 0.9990594983100891),
 ('offer', 0.9990562796592712),
 ('operation', 0.9990289211273193),
 ('tranquillity', 0.9989761710166931),
 ('devoted', 0.9989373087882996),
 ('show', 0.998914361000061),
 ('accomplish', 0.9989013075828552),
 ('favored', 0.9988955855369568),
 ('abuses', 0.9988954663276672),
 ('railroads', 0.9988946914672852),
 ('expression', 0.9988880753517151),
 ('held', 0.998881459236145),
 ('crisis', 0.9988803267478943),
 ('necessity', 0.9988754391670227),
 ('permit', 0.9988753795623779),
 ('enlightened', 0.9988750219345093),
 ('effected', 0.9988733530044556),
 ('purposes', 0.9988710880279541),
 ('renew', 0.9988645315170288),
 ('scarcely', 0.9988604187965393)]

In [18]:
# Find words similar to "God"
similar_to_god = all_presidents_embeddings.wv.most_similar("god", topn=20)
similar_to_god

[('solemn', 0.9954308271408081),
 ('bless', 0.9952143430709839),
 ('taken', 0.9945979714393616),
 ('take', 0.994469940662384),
 ('support', 0.9943830966949463),
 ('ask', 0.9943631291389465),
 ('suffrages', 0.9941916465759277),
 ('office', 0.9941672682762146),
 ('almighty', 0.9940996766090393),
 ('today', 0.994083821773529),
 ('confidence', 0.9940829873085022),
 ('trust', 0.9939519762992859),
 ('entitled', 0.9939282536506653),
 ('day', 0.9939256906509399),
 ('friends', 0.9938493371009827),
 ('called', 0.9938339591026306),
 ('high', 0.9937537312507629),
 ('portion', 0.9937481880187988),
 ('duty', 0.9937167167663574),
 ('distinguished', 0.9937167167663574)]

Quite a revelation...


## One President
An interesting aspect of word embeddings is to see how different corpora result in different word embeddings, alluding to differences in how words are used between writers/authors/speakers.

Let’s train a word embedding model on a single president and see how their word embeddings differ from the collection of all presidents. We are going to use President Franklin D. Roosevelt’s speeches as a example here. 

In [19]:
def get_president_sentences(president):
  files = sorted([file for file in os.listdir() if president.lower() in file.lower()])
  speeches = [read_file(file) for file in files]
  processed_speeches = process_speeches(speeches)
  all_sentences = merge_speeches(processed_speeches)
  all_sentences_filtered = [[word for word in sentence if not word in stop_words] for sentence in all_sentences]
  return all_sentences_filtered

In [20]:
roosevelt_sent = get_president_sentences('Franklin-D-Roosevelt')

To get a better understanding of President Franklin D Roosevelt’s speeches, let’s take a look at the most frequently used words across his inaugural addresses.

In [21]:
most_frequent_words(roosevelt_sent)[0:30]

[('nation', 26),
 ('people', 25),
 ('government', 23),
 ('us', 20),
 ('shall', 20),
 ('democracy', 20),
 ('men', 18),
 ('must', 17),
 ('know', 16),
 ('life', 15),
 ('spirit', 15),
 ('upon', 13),
 ('national', 12),
 ('years', 12),
 ('may', 12),
 ('new', 12),
 ('world', 12),
 ('every', 11),
 ('states', 11),
 ('way', 11),
 ('good', 11),
 ('today', 10),
 ('great', 10),
 ('power', 10),
 ('progress', 10),
 ('old', 10),
 ('united', 10),
 ('see', 10),
 ('time', 9),
 ('first', 9)]

Like with our previous word embedding model, let’s explore roosevelt_embeddings! We are going to find the top 20 words that are used in similar contexts to "freedom" and "god".

In [22]:
roosevelt_embeddings = gensim.models.Word2Vec(roosevelt_sent, window=5, min_count=1, workers=2, sg=1)

In [23]:
roosevelt_similar_to_freedom = roosevelt_embeddings.wv.most_similar("freedom", topn=20)
roosevelt_similar_to_freedom


[('ideals', 0.3555612564086914),
 ('challenge', 0.2985445261001587),
 ('engaging', 0.28151440620422363),
 ('alike', 0.2713252305984497),
 ('faithful', 0.27098703384399414),
 ('employment', 0.26910820603370667),
 ('secure', 0.25979623198509216),
 ('elementary', 0.2574740946292877),
 ('pall', 0.24631841480731964),
 ('fatalistic', 0.2450205534696579),
 ('proclaimed', 0.24280594289302826),
 ('irresistible', 0.2389444261789322),
 ('experiment', 0.23624424636363983),
 ('almighty', 0.23192434012889862),
 ('performance', 0.2289603352546692),
 ('emerson', 0.22857506573200226),
 ('remaining', 0.22777263820171356),
 ('creeds', 0.2275572270154953),
 ('ahead', 0.2275373935699463),
 ('recent', 0.2264631986618042)]

In [24]:
roosevelt_similar_to_god = roosevelt_embeddings.wv.most_similar("god", topn=20)
roosevelt_similar_to_god

[('reason', 0.35487619042396545),
 ('patchwork', 0.3050941824913025),
 ('speaks', 0.29313454031944275),
 ('wage', 0.28905582427978516),
 ('require', 0.27758654952049255),
 ('lending', 0.26661258935928345),
 ('declaration', 0.2642853260040283),
 ('greatest', 0.25546571612358093),
 ('away', 0.2506450116634369),
 ('tried', 0.2505795359611511),
 ('productiveness', 0.24845187366008759),
 ('warm', 0.23641154170036316),
 ('changed', 0.23600950837135315),
 ("country's", 0.23365409672260284),
 ('kept', 0.22523440420627594),
 ('earth', 0.22172710299491882),
 ('host', 0.22088958323001862),
 ('translated', 0.2172420769929886),
 ('move', 0.2169693410396576),
 ('lines', 0.2168179303407669)]

The results we see from speeches of one president (even with 4 speeches to work with) compose not enough data to produce robust word embeddings. So in the next section, we will be looking at the speeches of multiple presidents to increase our corpus size and produce better word embeddings.

## Selection of Presidents

Let’s increase our corpus size and find more defined word embeddings by training a word embedding model on the inaugural addresses of a collection of four presidents featured on Mount Rushmore in Keystone: George Washington, Thomas Jefferson, Theodore Roosevelt, and Abraham Lincoln.

In [25]:
def get_presidents_sentences(presidents):
  all_sentences = list()
  for president in presidents:
    files = sorted([file for file in os.listdir() if president.lower() in file.lower()])
    speeches = [read_file(file) for file in files]
    processed_speeches = process_speeches(speeches)
    all_prez_sentences = merge_speeches(processed_speeches)
    all_sentences.extend(all_prez_sentences)
    all_sentences_filtered = [[word for word in sentence if not word in stop_words] for sentence in all_sentences]
  return all_sentences_filtered

In [26]:
rushmore_presidents_sent = get_presidents_sentences(["washington","jefferson","lincoln","theodore-roosevelt"])

To get a better understanding of President Washington, Jefferson, Lincoln, and Theodore Roosevelt’s speeches, let’s take a look at the most frequently used words across their inaugural addresses.

In [27]:
most_frequent_words(rushmore_presidents_sent)[0:30]

[('government', 46),
 ('shall', 42),
 ('may', 41),
 ('constitution', 34),
 ('people', 33),
 ('us', 32),
 ('citizens', 30),
 ('union', 29),
 ('one', 28),
 ('public', 28),
 ('would', 25),
 ('must', 25),
 ('every', 24),
 ('states', 24),
 ('fellow', 23),
 ('right', 21),
 ('law', 20),
 ('upon', 19),
 ('state', 18),
 ('war', 18),
 ('among', 17),
 ('good', 17),
 ('country', 16),
 ('great', 16),
 ('power', 16),
 ('peace', 15),
 ('present', 14),
 ('nation', 13),
 ('let', 13),
 ('others', 13)]

Now that we have the sentences from the presidents featured on Mount Rushmore, let's create another word embedding model with gensim.

In [28]:
rushmore_embeddings = gensim.models.Word2Vec(rushmore_presidents_sent, window=5, min_count=1, workers=2, sg=1)

Like with our previous word embedding models, let’s explore rushmore_embeddings. We will find words that are used in similar contexts to "freedom" and "god".

In [29]:
rushmore_similar_to_freedom = rushmore_embeddings.wv.most_similar("freedom", topn=20)
rushmore_similar_to_freedom

[('given', 0.404973566532135),
 ('shall', 0.40488311648368835),
 ('union', 0.3774382472038269),
 ('first', 0.36965152621269226),
 ('made', 0.36015552282333374),
 ('power', 0.3592836856842041),
 ('war', 0.3586719036102295),
 ('foreign', 0.35098040103912354),
 ('happiness', 0.34960997104644775),
 ('treaties', 0.3417947292327881),
 ('duty', 0.3409098982810974),
 ('equal', 0.3394305408000946),
 ('therefore', 0.33861058950424194),
 ('demanded', 0.33720552921295166),
 ('every', 0.331023246049881),
 ('understand', 0.33005741238594055),
 ('authority', 0.32939761877059937),
 ('great', 0.32142385840415955),
 ('pretext', 0.32085418701171875),
 ('right', 0.3182717263698578)]

In [30]:
rushmore_similar_to_god = rushmore_embeddings.wv.most_similar("god", topn=20)
rushmore_similar_to_god

[('government', 0.5442899465560913),
 ('war', 0.5255405306816101),
 ('public', 0.5180131196975708),
 ('without', 0.5116174221038818),
 ('would', 0.49037110805511475),
 ('may', 0.47593018412590027),
 ('one', 0.4694865643978119),
 ('union', 0.4660675525665283),
 ('us', 0.4656261205673218),
 ('every', 0.4608868360519409),
 ('must', 0.4579213559627533),
 ('shall', 0.4526175856590271),
 ('constitution', 0.4501025378704071),
 ('citizens', 0.44717520475387573),
 ('best', 0.4436720609664917),
 ('duty', 0.43555450439453125),
 ('among', 0.4316003620624542),
 ('present', 0.43008318543434143),
 ('happiness', 0.4292805790901184),
 ('nation', 0.42901843786239624)]

In [31]:
rushmore_similar_to_government = rushmore_embeddings.wv.most_similar("government", topn=20)
rushmore_similar_to_government

[('public', 0.737883448600769),
 ('union', 0.7208378911018372),
 ('law', 0.7080079317092896),
 ('shall', 0.6866411566734314),
 ('may', 0.6857576370239258),
 ('would', 0.6808280348777771),
 ('must', 0.6720712780952454),
 ('citizens', 0.669371485710144),
 ('nation', 0.669234037399292),
 ('national', 0.6657862067222595),
 ('us', 0.657419741153717),
 ('one', 0.6548988223075867),
 ('duty', 0.6467574834823608),
 ('war', 0.645211398601532),
 ('best', 0.6441528797149658),
 ('without', 0.6408768892288208),
 ('among', 0.6387364864349365),
 ('people', 0.6376473307609558),
 ('states', 0.6195920705795288),
 ('every', 0.6142016649246216)]

## Conclusion: 

We have trained sets of 'word embeddings' on the whole set and some subsets of inaugural addresses and revealed that presidents as a whole, a "Rushmore group" of presidents and even one particular president use different lexicon and emphasise different things in their speeches. Such common used words for American people as "freedom" and "god" are also used by different groups in different contexts. 