# U.S.A. Presidential Vocabulary

My Codecademy portfolio project from the <a href='https://www.codecademy.com/learn/paths/data-science'>Data Scientist Path</a> Natural Languages Processing (NLP) Course, Word Embeddings Section.

## Overview

Whenever a United States of America president is elected or re-elected, an inauguration ceremony takes place to mark the beginning of the president’s term. During the ceremony, the president gives an inaugural address to the nation, dictating the tone and focus of the next four years of leadership.

In this project you will have the chance to analyze the inaugural addresses of the presidents of the United States of America, as collected by the <a href="https://www.nltk.org/book/ch02.html">Natural Language Toolkit</a>, using word embeddings.

By training sets of word embeddings on subsets of inaugural address versus the collection of presidents as a whole, we can learn about the different ways in which the presidents use language to convey their agenda.

#### Project Goal:

Analyze USA presidential inaugural speeches using NLP word embeddings models.

#### Project Requirements

Be familiar with:
- Python3
- NLP (Natural Languages Processing)
<br><br>
- The Python Libraries:
    - re
    - Pandas
    - Json
    - Collections
    - NLKT
    - gensim
  

#### Links:

<a href='https://www.alex-ricciardi.com/post/u-s-a-presidential-vocabulary'>My Project Blog Presentation<a><br>
<br>
<a href='https://github.com/ARiccGitHub/us_presidential_vocabulary'>Project GitHub<a><br>

<h2 style='color : MediumBlue'>Preprocessing the Data</h2>

The project corpus data can be freely downloaded from <a href='http://www.nltk.org/nltk_data/'>NLTK Corpora</a> under the designation "68. C-Span Inaugural Address Corpus".

#### Libraries:

In [1]:
# Regex
import re
# Operating system dependent functionality
import os
# JSON encoder and decoder
import json
# Data manipulation tool
import pandas as pd
# Natural language processing
import nltk
# Tokenization into sentences
from nltk.tokenize import PunktSentenceTokenizer
# Stop words and lexical database of English  
from nltk.corpus import stopwords, wordnet
# lemmatization class
from nltk.stem import WordNetLemmatizer
# Counter Dictionary class - https://docs.python.org/3/library/collections.html#collections.Counter -
from collections import Counter

#### Saves list function:

In this project, I want to save lists of lists, a list of lists is a list of objects, I use <a hreff='https://docs.python.org/3/library/json.html'>json</a> to save the lists in my list as objects.

The save_list() function:

- Takes the arguments:
    - file_name, string data type
    - list_to_save, list data type
<br><br>
- Saves list_to_save into file_name.txt as json objects

In [3]:
def save_list(file_name, list_to_save):   
    with open(f'data/{file_name}.txt', 'w') as file:
        file.write(json.dumps(list_to_save))

To load the list files you can use

The load_list() function:

- Takes the arguments:
    - list_name, string data type
- Load list_name.txt
<br><br>
- Returns the list_name.txt as a list 

In [4]:
def load_list(list_name):
    with open(f'data/{file_name}.txt', 'r') as file:
        return json.loads(file.read())

<h3 style='color : DarkMagenta'>Converting files into a corpus</h3>

For this project, I decided to combine the files data into a corpus that I named ```speeches```.

In [5]:
# Project directory path
path = os.getcwd()
# Sorts and save files name from the corpus_data folder
file_names = sorted([file for file in os.listdir(f"{path}/corpus_data")])
# Creates a speeches list from files   
speeches = []
for name in file_names:  
    with open(f'corpus_data/{name}', 'r+') as file:  
        speeches.append(file.read())

Sample from the ```speeches``` corpus:

In [6]:
# 1793-Washington's speech
speeches[1]

'Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America.\n\nPrevious to the execution of any official act of the President the Constitution requires an oath of office. This oath I am now about to take, and in your presence: That if it shall be found during my administration of the Government I have in any instance violated willingly or knowingly the injunctions thereof, I may (besides incurring constitutional punishment) be subject to the upbraidings of all who are now witnesses of the present solemn ceremony.\n\n \n'

We can see from the ```speeches``` corpus sample that the speech's texts is not clean, it can not be processed properly by a NLT model.

<h3 style='color : DarkMagenta'>Preprocessing corpus</h3>

Before a text can be processed by a NLP model, the text data needs to be preprocessed.<br>
Text data preprocessing is the process of cleaning and prepping  the text data to be processed by NLP models.

Cleaning and prepping tasks:

- Noise removal is a text pre-processing step concerned with removing unnecessary formatting from our text.

- Tokenization is a text pre-processing step devoted to breaking up text into smaller units (usually words or discrete terms).
<br>

- Normalization is the name we give most other text preprocessing tasks, including stemming, lemmatization, upper and lowercasing, and stopword removal.

    - Stemming is the normalization preprocessing task focused on removing word affixes. 

	- Lemmatization is the normalization preprocessing task that more carefully brings words down to their root forms.

#### Tokenization

In this project, I break down the presidents' speeches into words on a sentence by sentence basis by using the sentence tokenizer <a href='https://www.nltk.org/_modules/nltk/tokenize/punkt.html'>nltk.tokenize.PunktSentenceTokenizer()</a> class.

#### Part-of-Speech Tagging

To improve the performance of <a href="https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html">lemmatization</a> (bring a word to his root), each word in the processed text is assigned parts of speech tag, 
<a href="https://nlp.stanford.edu/software/tagger.shtml#:~:text=A%20Part%2DOf%2DSpeech%20Tagger,like%20'noun%2Dplural'.">Part-of-Speech Tagging</a> is the process of reading text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc.

Part-of-Speech tagging function:

The ```get_part_of_speech()``` function:
- Takes the arguments:
    - ```word```, string data type.<br>
<br>
- Matches ```word``` with synonyms
- Tags ```word``` and count tags.<br> 
<br>
- Returns The most common tag, the tag with the highest count, ex: n for Noun, string data type.

In [7]:
def get_part_of_speech(word):
    # Synonyms matching
    probable_part_of_speech = wordnet.synsets(word)
    # Initializing Counter class object
    pos_counts = Counter()
    # Taging and counting tags
    pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  ) # Noun
    pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  ) # Verb
    pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  ) # Adjectif
    pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  ) # Adverb
    # The most common tag, the tag with the highest count, ex: n for Noun 
    most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
    
    return most_likely_part_of_speech

#### The word 'us':

The word 'us' is a commonly used word in presidential inauguration addresses.<br>
The word 'us' is a commonly used word in presidential inauguration addresses.
The result of preprocessing the word 'us' through lemmatizing with the part-of-speech tagging method <a href='https://www.nltk.org/api/nltk.corpus.reader.html#nltk.corpus.reader.wordnet.Lemma.synset'>nlt.corpus.reader.wordnet.synsets()</a> function and in conjunction with `stopwords` removal and the <a href='https://www.nltk.org/_modules/nltk/stem/wordnet.html'>nltk.stem.WordNetLemmatizer().lemmatize()</a> method, is that the word 'us' becomes 'u'.<br>
<br>
This happens because the lemmatize(word, get_part_of_speech(word)) method removes the character 's' at the end of words tagged as nouns. The word 'us', which is not part of the stopwords list, is tagged as a noun causing the lemmatization result of 'us' to be 'u'.

The word `'us'` is not a `stopword`:

In [8]:
'us' in set(stopwords.words('english'))

False

The `get_part_of_speech()` function tags the word `'us'` as a noun:

In [9]:
get_part_of_speech('us')

'n'

The `lemmatize(word, get_part_of_speech(word))` method removes the character `'s'` at the of `words` tags as noun, and the word `'us'` is tagged as a noun causing the lemmatization result of `'us'` to be `'u'`.   

In [10]:
normalizer_us = WordNetLemmatizer()
normalizer_us.lemmatize('us', get_part_of_speech('us'))

'u'

In [11]:
print(f'{normalizer_us.lemmatize("us", get_part_of_speech("us"))}\n')

u



#### Preprocessing:

In [12]:
# Stop words
stop_words = set(stopwords.words('english'))
# Initializes the lemmatizer
normalizer = WordNetLemmatizer()
# Creates an empty list of processed speeches
preprocessed_speeches = []
# ---------------------------------------------------- Preprocessing loop
for speech in speeches:
    # ------------------ Tokenizing
    # Initializes sentence tokenizer
    sentence_tokenizer = PunktSentenceTokenizer()
    # Tokenizes speech into sentences
    sentence_tokenized_speech = sentence_tokenizer.tokenize(speech)
    # ------------------ Normalizing loop
    # Creates an empty sentences list 
    word_sentences = [] 
    for sentence in sentence_tokenized_speech:
        # ----------- Removes noise from sentence and tokenizes the sentence into words  
        word_tokenized_sentence = [re.sub('[^a-zA-Z0-9]+', '', word.lower()) \
                                           for word in sentence.replace(",","").replace("-"," ").replace(":","").split()] 
        # ---------------- Removes stopwords from sentences
        sentence_no_stopwords = [word for word in word_tokenized_sentence if word not in stop_words]
        # ---------------- Before lemmatizing, adds a 's' to the word 'us'
        word_sentence_us = ['uss' if word == 'us' else word for word in sentence_no_stopwords]
        # ---------------- Lemmatizes
        word_sentence = [normalizer.lemmatize(word, get_part_of_speech(word)) \
                                                        for word in word_sentence_us if not re.match(r'\d+', word)]
        # Stores preprocessed word   
        word_sentences.append(word_sentence)     
    # Stores sentence tokenized into words     
    preprocessed_speeches.append(word_sentences) 
# Saves preprocessed corpus
save_list('preprocessed_speeches', preprocessed_speeches)

A personal preference <a href='https://docs.python.org/3/library/pprint.html'>pprint-Data pretty printer</a>

In [13]:
%pprint

Pretty printing has been turned OFF


Sample from the ```preprocessed_speeches``` list, preprocessed corpus:

In [14]:
# Dispays words from the second speech first sentences
preprocessed_speeches[1][0]

['fellow', 'citizen', 'call', 'upon', 'voice', 'country', 'execute', 'function', 'chief', 'magistrate']

Dictionary of the presidents' words speeches by sentences:

In [15]:
# Creates a list of the speech's names relative to the presidents' names and year of the speech 
year_president_speech_names = [name.lower().replace('.txt', '').replace('1989-bush', '1989-bush senior') for name in file_names]
# Creates a dictionary of the presidents preprocessed speeches
presidents_speeches = dict(zip(year_president_speech_names, preprocessed_speeches))

Presidents pre-processed speeches DataFrame:

In [16]:
df_presidents_speeches = pd.DataFrame({'Preprocessed Speech' : preprocessed_speeches}, index = year_president_speech_names)
df_presidents_speeches.to_csv('data/processed_presidents_speeches.csv')
df_presidents_speeches.head()

Unnamed: 0,Preprocessed Speech
1789-washington,"[[fellow, citizen, senate, house, representati..."
1793-washington,"[[fellow, citizen, call, upon, voice, country,..."
1797-adams,"[[first, perceive, early, time, middle, course..."
1801-jefferson,"[[friend, fellow, citizen, call, upon, underta..."
1805-jefferson,"[[proceed, fellow, citizen, qualification, con..."


Sample from the ```df_presidents_speeches``` DataFrame:

In [17]:
# Dispays words in each sentences from the 1793-Washington's speech 
df_presidents_speeches.loc['1793-washington'][0]

[['fellow', 'citizen', 'call', 'upon', 'voice', 'country', 'execute', 'function', 'chief', 'magistrate'], ['occasion', 'proper', 'shall', 'arrive', 'shall', 'endeavor', 'express', 'high', 'sense', 'entertain', 'distinguish', 'honor', 'confidence', 'repose', 'people', 'unite', 'america'], ['previous', 'execution', 'official', 'act', 'president', 'constitution', 'require', 'oath', 'office'], ['oath', 'take', 'presence', 'shall', 'find', 'administration', 'government', 'instance', 'violate', 'willingly', 'knowingly', 'injunction', 'thereof', 'may', 'besides', 'incur', 'constitutional', 'punishment', 'subject', 'upbraiding', 'witness', 'present', 'solemn', 'ceremony']]

In [18]:
# Dispays words in the first sentence from the 1793-Washington's speech 
df_presidents_speeches.loc['1793-washington'][0][0]

['fellow', 'citizen', 'call', 'upon', 'voice', 'country', 'execute', 'function', 'chief', 'magistrate']

Combine list of all the sentences from all the president speeches  

In [19]:
# Creates an empty list of all the stences in processed_speeches
all_sentences = [sentence for speech in preprocessed_speeches for sentence in speech]
# Saves all_sentences
save_list('all_sentences', all_sentences)

Sample from the ```all_sentences``` list:

In [20]:
all_sentences[23]

['fellow', 'citizen', 'call', 'upon', 'voice', 'country', 'execute', 'function', 'chief', 'magistrate']

Words in all sentences list:

In [21]:
all_words = [word for sentence in all_sentences for word in sentence]
# Saves all_words
save_list('all_words', all_words)

<h2 style='color : MediumBlue'>Word Embeddings</h2>

<a href='https://www.codecademy.com/learn/natural-language-processing/modules/nlp-word-embeddings'>Word embeddings</a> are a type of word representation that allows words with similar meaning to have a similar representation. In NLP words are often represented as numeric vectors, the algorithms used to vectorize words are referred to as "words to vectors"(<a href='https://en.wikipedia.org/wiki/Word2vec'>word2vec</a>).

#### Libraries:
In addition of the libraries imported for pre-processing the data, I use the fallowing libraries for Word Embeddings

In [22]:
# word2vec model library  
import gensim

Note: 
<a href='https://blog.usejournal.com/how-does-the-model-produce-different-results-on-same-dataset-54486f951dbf'>Machine learning models will produce different results on same dataset</a>, the models generate a sequence of random numbers called <a href='https://towardsdatascience.com/how-to-use-random-seeds-effectively-54a4cd855a79'>random seed</a> used within the process of generating test, validation and training datasets from a given dataset. Configurating a model's seed to a set value will ensure that the results are reproducible.<br>
The  python library `gensim` relies on different processes to initialize and train its word embeddings model class, if you need to generate more consistent results (not recommended), click on the following link:<br>
<a href='https://stackoverflow.com/questions/34831551/ensure-the-gensim-generate-the-same-word2vec-model-for-different-runs-on-the-sam'>Ensure the gensim generate the same Word2Vec model for different runs on the same data</a>

#### Input Function:

The input function is optional, a personal preference, I created the function to easily input different variable values without having to change the code.<br>
The option to use the function is by default turn off. 

In [23]:
input_option = False

The input_word() function:

- Takes the arguments:
    - input_subject, string data type
    - word_list, list data integer type
<br><br>
- Outputs on screen the input_subject
- Take a user input, inputted_word
- Compares inputted_word with items in the word_list
<br><br>
- Returns inputted_word.lower()

In [49]:
 def input_word(input_subject, word_list):
        # User Input a word
        inputted_word = input(f'\nEnter a {input_subject}: ')

        while inputted_word.lower() not in word_list:
            print(f'\n{inputted_word} is not in the {input_subject} list')
            inputted_word = input(f'\nPlease reenter a {input_subject}: ')    
        print()

        return inputted_word.lower()

<h3 style='color : DarkMagenta'>All Presidents</h3>

Analysis of the presidential vocabulary by looking at all the inaugural addresses.

Most frequently used terms:<br>
The following list of words are the ten most frequently used presidential inauguration speech terms.  
The numeric values represent the number of times the corresponding words appeared in the combined  presidential inauguration speeches.  

In [25]:
most_freq_words = Counter(all_words).most_common()
# Saves most_freq_words
save_list('most_freq_words', most_freq_words)

# 10 most frequently used words
most_freq_words[:10]

[('government', 651), ('people', 623), ('nation', 515), ('us', 480), ('state', 448), ('great', 394), ('upon', 371), ('must', 366), ('make', 357), ('country', 355)]

In [26]:
# 3 most frequently used words with count
most_freq_words[:3]

[('government', 651), ('people', 623), ('nation', 515)]

The fallowing list of words are the three most frequently used presidential inauguration speeches' terms.  

In [27]:
# 3 most frequently used words
[word[0] for word in most_freq_words[:3]]

['government', 'people', 'nation']

From the three most frequently used term results, we can see that the main topic within the combined inaugural addresses seem to be centered around the terms `government`, `people` and `nation`.<br>

#### Word2Vec, word embeddings model

The idea behind word embeddings is a theory known as the distributional hypothesis. This hypothesis states that words that co-occur in the same contexts tend to have similar meanings.
Word2Vec is a shallow neural network model that can build word embeddings using either continuous bag-of-words or continuous skip-grams.<br>
<br>
The word2vec method that I use to create word embeddings is based on continuous skip-grams. Skip-grams function similarly to n-grams, except instead of looking at groupings of n-consecutive words in a text, we can look at sequences of words that are separated by some specified distance between them.<br>
<br>
For this project, we want to create word embeddings model using the skip-grams word2vec model, within the USA presidential inaugural speeches context.    

In [28]:
word_embeddings = gensim.models.Word2Vec(all_sentences, size=96, window=5, min_count=1, workers=2, sg=1)

Note: ```gensim.models.Word2Vec()``` takes a text as an argument to give context to the words, for this project the sentences in ```all_sentences``` list are the context.

Vocabulary of terms:<br>
For this project, the vocabulary of terms is the list unique words from the words in all sentences list, ```all_words```.<br>
In other words, the vocabulary of terms is the list of words, which are not stop words, use within the inaugural speeches. 

In [29]:
# Removing duplicated words in all_words
vocabulary_of_terms = list(set(all_words))
# Saves vocabulary_of_terms
save_list('vocabulary_of_terms', vocabulary_of_terms)

A sample of a word vector representation generate by the ```word_embedding``` model.  

In [30]:
vec_word = 'us'

# Word verctor representation
print(vec_word)
print(word_embeddings.wv[vec_word])

us
[ 0.26552063  0.38270256  0.01712053  0.19299783 -0.13889301  0.118939
 -0.19792747 -0.24101111 -0.3297443   0.172899    0.08144623  0.15756929
  0.21964279 -0.25878748 -0.04133303 -0.00970211  0.18739106  0.15589143
 -0.22941768 -0.4307043   0.10464998  0.08081249 -0.2098069   0.15595986
  0.0677221  -0.22751325  0.1134905   0.00726796  0.09760104  0.05393142
 -0.22067231 -0.0713683   0.11761973 -0.12144969 -0.07166863  0.04331708
  0.26426578 -0.31814927  0.31505206  0.03957349 -0.26045322 -0.35896567
 -0.15110052  0.02246088  0.04070946  0.3254533  -0.07534913 -0.12527435
 -0.00893944 -0.2963276   0.02741916  0.00778137  0.23978262  0.31302834
 -0.11767249 -0.17107272  0.16745056  0.11205339  0.13693185  0.08523529
  0.0431967  -0.00858821 -0.11992967 -0.06488071 -0.21549924  0.00797113
  0.18548536  0.02269054  0.02282404 -0.08039232  0.03061922  0.03910693
  0.02918376 -0.02713229 -0.03112097 -0.05021107  0.00249148 -0.16663195
  0.06929523  0.09279496 -0.26356545 -0.02614382 -

Using the word vectors we can calculate the cosine distances between the vectors to find out how similar terms are within the USA presidential inaugural speeches context.

Similar terms sample:<br>
Using the word vectors created by word embeddings model, we can calculate the cosine distances between the vectors to find out how similar terms are within the USA presidential inaugural speeches context.

In [31]:
# Optional input function
if input_option:
    similar_to_word = input_word('similar word', vocabulary_of_terms)
else:
    similar_to_word = 'government'
    
# Similar to 
print(similar_to_word)
# Calculate the cosine distance between word vectors outputting the 20 most similar words to the inputted word
similar_word_dist_vec = word_embeddings.wv.most_similar(similar_to_word, topn=20)
# Saves vocabulary_of_terms
save_list('vocabulary_of_terms', vocabulary_of_terms)
# List of similar words and their vectors cosine distance relative to the inputted word
print(similar_word_dist_vec) 

government
[('power', 0.9976229667663574), ('federal', 0.9969825744628906), ('authority', 0.9964199066162109), ('executive', 0.9959920644760132), ('general', 0.9956094622612), ('support', 0.9955458641052246), ('within', 0.9953200221061707), ('grant', 0.9953069090843201), ('territory', 0.9948582649230957), ('exercise', 0.9948453903198242), ('department', 0.9947342276573181), ('establish', 0.9944707751274109), ('reserve', 0.9944655895233154), ('limit', 0.9944608807563782), ('union', 0.9943926930427551), ('administration', 0.9943851232528687), ('protect', 0.9940714836120605), ('defend', 0.9940675497055054), ('action', 0.9937326908111572), ('preserve', 0.9937060475349426)]


In [32]:
# List of the similar words no cosine distance
print(similar_to_word)
print([word[0] for word in similar_word_dist_vec])

government
['power', 'federal', 'authority', 'executive', 'general', 'support', 'within', 'grant', 'territory', 'exercise', 'department', 'establish', 'reserve', 'limit', 'union', 'administration', 'protect', 'defend', 'action', 'preserve']


From  the presidential inaugural addresses data and  by training a word embeddings model with it, I was able to create a some what accurate U.S.A. presidential vocabulary,  the small size of corpus limits how efficiently the model can be trained, nonetheless the model gives us good insight into how terms are connected to each other within the presidential inauguration addresses context. 

<h3 style='color : DarkMagenta'>One President</h3>

Analysis of a president vocabulary by looking at his inaugural addresses.

Preprocessing president names:<br>
From the list `year_president_speech_names`, I can extract the president names, but the list as duplicated president names, ex: `['1789-washington', '1792-washington']`<br>
After removing the the years from the `year_president_speech_names` values, I could used `set()` to remove duplicated values, but `set(`) does not preserve the list values insertion order and I want to keep the values insertion order as `['washington', ..., ..., ... , 'trump']`, the best method to remove a list duplicated values and preserve the values insertion order is to use a `dictionary` as follow:

In [33]:
president_names = list(dict.fromkeys([re.sub(r'^....-', '', name) for name in year_president_speech_names]))
# Saves president_names
save_list('president_names', president_names)

president_names

['washington', 'adams', 'jefferson', 'madison', 'monroe', 'jackson', 'vanburen', 'harrison', 'polk', 'taylor', 'pierce', 'buchanan', 'lincoln', 'grant', 'hayes', 'garfield', 'cleveland', 'mckinley', 'roosevelt', 'taft', 'wilson', 'harding', 'coolidge', 'hoover', 'truman', 'eisenhower', 'kennedy', 'johnson', 'nixon', 'carter', 'reagan', 'bush senior', 'clinton', 'bush', 'obama', 'trump']

In [34]:
# Optional input function
if input_option:
    president_name = input_word('president name', president_names)
else:
    president_name = 'madison'


# Speeches list
one_president_speeches = [presidents_speeches[name] for name in year_president_speech_names if president_name in name]
# Sentences list 
one_president_sentences = [sentence for speech in one_president_speeches for sentence in speech]
# Words list
one_president_all_words = [word for sentence in one_president_sentences for word in sentence]

The president most frequently used terms:

In [35]:
one_president_most_freq_words = Counter(one_president_all_words).most_common()
# 10 most frequently used words
print(president_name)
print(one_president_most_freq_words[:10])

madison
[('war', 17), ('nation', 13), ('country', 11), ('state', 10), ('public', 8), ('unite', 8), ('right', 7), ('every', 7), ('without', 6), ('long', 6)]


In [36]:
# 3 most frequently used words
print(president_name)
print([word[0] for word in one_president_most_freq_words[0:3]])

madison
['war', 'nation', 'country']


The president word embeddings model

In [37]:
one_president_word_embeddings = gensim.models.Word2Vec(one_president_sentences, size=96, window=5, min_count=1, workers=2, sg=1)

The president vocabulary of terms:

In [38]:
# Removing duplicated words in one_president_all_words
one_president_vocabulary_of_terms = list(set(one_president_all_words))

How similar terms are within the president's presidential inaugural speeches context.

In [39]:
# Optional input function
if input_option:
    one_president_similar_to_word = input_word('word', one_president_vocabulary_of_terms)
else:
    one_president_similar_to_word = 'government'
    
# Similar to 
print(f'{president_name}\'s {one_president_similar_to_word}')
# Calculate the cosine distance between word vectors outputting the 20 most similar words to the inputted word
one_president_similar_word_dist = one_president_word_embeddings.wv.most_similar(one_president_similar_to_word, topn=20)
# List of similar words and their vectors cosine distance relative to the inputted word
print(one_president_similar_word_dist)

madison's government
[('consideration', 0.28006595373153687), ('captive', 0.26598575711250305), ('department', 0.26537370681762695), ('authorize', 0.2532273232936859), ('advancement', 0.24378815293312073), ('would', 0.23810715973377228), ('vessel', 0.22875340282917023), ('conversion', 0.22632265090942383), ('bring', 0.22400909662246704), ('enemy', 0.21681468188762665), ('however', 0.21531297266483307), ('devote', 0.21416102349758148), ('sword', 0.21349751949310303), ('equality', 0.21096564829349518), ('torture', 0.2059607207775116), ('supplication', 0.20000843703746796), ('value', 0.19880428910255432), ('wrong', 0.19419923424720764), ('another', 0.19280079007148743), ('refuse', 0.19142824411392212)]


The cosine distance values of the most similar words are under 0.5, the result are less than satisfying due to the small size of the corpus, ```one_president_sentences```, used to train the word embeddings model.

The list of the similar terms with no cosine distance:

In [40]:
print(f'{president_name}\'s {one_president_similar_to_word}')
print([word[0] for word in one_president_similar_word_dist])

madison's government
['consideration', 'captive', 'department', 'authorize', 'advancement', 'would', 'vessel', 'conversion', 'bring', 'enemy', 'however', 'devote', 'sword', 'equality', 'torture', 'supplication', 'value', 'wrong', 'another', 'refuse']


I thought that it will good idea to create a presidents' vocabularies DataFrame :

In [41]:
# Creates a DataFrame 
df_presidents_vocabularies = pd.DataFrame(index=president_names)

# Sepeeches list
all_presidents_speeches = [[presidents_speeches[name] for name in year_president_speech_names if president in name] \
                                                                                        for president in president_names]
# Sentences list 
all_presidents_sentences = [[sentence for speech in speeches for sentence in speech] \
                                                                        for speeches in all_presidents_speeches]
# Words list
all_presidents_all_words = [[word for sentence in sentences for word in sentence] \
                                                                        for sentences in all_presidents_sentences]

# Each president most three recurrent words 
df_presidents_vocabularies['Three Most  Recurrent Terms'] = [[word[0] for word in Counter(words).most_common()[:3]] \
                                                                                    for words in all_presidents_all_words]
 
# Each president most 10 recurrent words 
df_presidents_vocabularies['Ten Most Recurrent Terms'] = [[word[0] for word in Counter(words).most_common()[:15]] \
                                                                                      for words in all_presidents_all_words]
# Each president vocabulary of terms
df_presidents_vocabularies['Terms List'] = [list(set(one_president_all_words)) \
                                                                   for presidents_all_words in all_presidents_all_words]
# Saves DataFrame
df_presidents_vocabularies.to_csv('data/presidents_vocabularies.csv')
df_presidents_vocabularies

Unnamed: 0,Three Most Recurrent Terms,Ten Most Recurrent Terms,Terms List
washington,"[government, every, may]","[government, every, may, citizen, present, cou...","[remember, let, isle, publicly, manufacture, p..."
adams,"[government, nation, people]","[government, nation, people, union, upon, coun...","[remember, let, isle, publicly, manufacture, p..."
jefferson,"[may, public, citizen]","[may, public, citizen, us, government, fellow,...","[remember, let, isle, publicly, manufacture, p..."
madison,"[war, nation, country]","[war, nation, country, state, public, unite, r...","[remember, let, isle, publicly, manufacture, p..."
monroe,"[state, great, government]","[state, great, government, war, citizen, unite...","[remember, let, isle, publicly, manufacture, p..."
jackson,"[government, people, state]","[government, people, state, power, public, uni...","[remember, let, isle, publicly, manufacture, p..."
vanburen,"[people, every, country]","[people, every, country, institution, governme...","[remember, let, isle, publicly, manufacture, p..."
harrison,"[power, people, state]","[power, people, state, government, upon, const...","[remember, let, isle, publicly, manufacture, p..."
polk,"[government, state, union]","[government, state, union, power, would, one, ...","[remember, let, isle, publicly, manufacture, p..."
taylor,"[shall, government, duty]","[shall, government, duty, interest, country, h...","[remember, let, isle, publicly, manufacture, p..."


<h3 style='color : DarkMagenta'>Selection of Presidents</h3>

We can analyze further, using word embeddings, the presidential vocabulary by combining the first five US presidents' inaugural speeches and compare the results with the results from last five US presidents' inaugural speeches.

Preprocessing the data:

In [42]:
# All words list
first_5_presidents_all_words = [word for words in all_presidents_all_words[:5] for word in words]
last_5_presidents_all_words = [word for words in all_presidents_all_words[len(all_presidents_all_words)-6:-1] \
                                                                                                       for word in words]

# Sentences list
first_5_presidents_sentences = [sentence for sentences in all_presidents_sentences[:5] for sentence in sentences]
last_5_presidents_sentences = [sentence for sentences in all_presidents_sentences[len(all_presidents_sentences)-6:-1] \
                                                                                                for sentence in sentences]
# Vocabulary of terms:
first_5_presidents_vocabulary = list(set(first_5_presidents_all_words))
last_5_presidents_vocabulary = list(set(last_5_presidents_all_words))

Most frequently used terms:

In [43]:
# First five presidents
first_5_presidents_most_freq_words = Counter(first_5_presidents_all_words).most_common()
# 10 most frequently used words
print('First Five Presidents')
print(first_5_presidents_most_freq_words[:10])

# Last five presidents
last_5_presidents_most_freq_words = Counter(last_5_presidents_all_words).most_common()
# 10 most frequently used words
print('\nLast Five Presidents')
print(last_5_presidents_most_freq_words[:10])

First Five Presidents
[('government', 105), ('state', 103), ('nation', 81), ('great', 75), ('may', 69), ('citizen', 66), ('country', 65), ('people', 64), ('war', 61), ('public', 60)]

Last Five Presidents
[('us', 176), ('america', 111), ('must', 105), ('nation', 104), ('world', 101), ('new', 101), ('american', 95), ('time', 91), ('people', 90), ('freedom', 81)]


In [44]:
# First five presidents
first_5_presidents_most_freq_words = Counter(first_5_presidents_all_words).most_common()
# 3 most frequently used words
print('First Five Presidents')
print(first_5_presidents_most_freq_words[:3])

# Last five presidents
last_5_presidents_most_freq_words = Counter(last_5_presidents_all_words).most_common()
# 3 most frequently used words
print('\nLast Five Presidents')
print(last_5_presidents_most_freq_words[:3])

First Five Presidents
[('government', 105), ('state', 103), ('nation', 81)]

Last Five Presidents
[('us', 176), ('america', 111), ('must', 105)]


In [45]:
# 3 first most frequently used words

print('First Five Presidents')
print([word[0] for word in first_5_presidents_most_freq_words[:3]])

print('\nLast Five Presidents')
print([word[0] for word in last_5_presidents_most_freq_words[:3]])

First Five Presidents
['government', 'state', 'nation']

Last Five Presidents
['us', 'america', 'must']


Word embeddings:

In [46]:
first_5_presidents_word_embeddings = gensim.models.Word2Vec(first_5_presidents_sentences, size=96, window=5, 
                                                                                                   min_count=1, workers=2, sg=1)
last_5_presidents_word_embeddings = gensim.models.Word2Vec(last_5_presidents_sentences, size=96, window=5, 
                                                                                                   min_count=1, workers=2, sg=1)

Similar words:

In [47]:
# Optional input function
input_option = False
if input_option:
    first_last_pre_voc = list(set(first_5_presidents_vocabulary + last_5_presidents_vocabulary))
    first_last_pre_similar_to_word = input_word('First and last four presidents word', first_last_pre_voc)
else:
    first_last_pre_similar_to_word = 'government'

# Calculate the cosine distance between word vectors outputting the 20 most similar words to the inputted word
first_5_pre_similar_word_dist = first_5_presidents_word_embeddings.wv.most_similar(first_last_pre_similar_to_word, topn=20)
last_5_pre_similar_word_dist = last_5_presidents_word_embeddings.wv.most_similar(first_last_pre_similar_to_word, topn=20)

# List of similar words and their vectors cosine distance relative to the inputted word
print(f'First five presidents {first_last_pre_similar_to_word}')
print(first_5_pre_similar_word_dist)
print(f'\nLast five presidents {first_last_pre_similar_to_word}')
print(last_5_pre_similar_word_dist)

First five presidents government
[('great', 0.9993847012519836), ('would', 0.9993748068809509), ('nation', 0.999360978603363), ('power', 0.9993358254432678), ('may', 0.9992808103561401), ('people', 0.9992750883102417), ('union', 0.9992700815200806), ('country', 0.9992687702178955), ('state', 0.9992679357528687), ('war', 0.9992614388465881), ('place', 0.9992455840110779), ('every', 0.9992408156394958), ('party', 0.9992370009422302), ('high', 0.99916011095047), ('right', 0.9991583228111267), ('principle', 0.9991539716720581), ('part', 0.9991317391395569), ('citizen', 0.9991250038146973), ('public', 0.999118447303772), ('us', 0.9991143345832825)]

Last five presidents government
[('america', 0.9995765089988708), ('work', 0.999546229839325), ('world', 0.9995343685150146), ('people', 0.9995193481445312), ('day', 0.9995089769363403), ('new', 0.9995080232620239), ('nation', 0.99950110912323), ('us', 0.999485433101654), ('citizen', 0.9994753003120422), ('great', 0.9994724988937378), ('know', 0

The cosine distance values of the most similar words are close to 1, the results are satisfying, better than ones from one president, the results are better due to the larger size of the corpus used to train the word embeddings models.

Most similar words, no cosine distances:

In [48]:
print(f'First five presidents {first_last_pre_similar_to_word}')
print([word[0] for word in first_5_pre_similar_word_dist])
print(f'\nLast five presidents {first_last_pre_similar_to_word}')
print([word[0] for word in last_5_pre_similar_word_dist])

First five presidents government
['great', 'would', 'nation', 'power', 'may', 'people', 'union', 'country', 'state', 'war', 'place', 'every', 'party', 'high', 'right', 'principle', 'part', 'citizen', 'public', 'us']

Last five presidents government
['america', 'work', 'world', 'people', 'day', 'new', 'nation', 'us', 'citizen', 'great', 'know', 'child', 'must', 'freedom', 'life', 'call', 'place', 'need', 'power', 'every']
