
WordNet is the lexical database i.e. dictionary for the English language,  specifically designed for natural language processing. And lemma is wordnet's version of an entry in a dictionary.  The following analysis will allow us to identify some characteristics of each speech.  We will get the number of words used, the average length of sentence in the speech and unique word used in the speech.


In [1]:
#!/usr/bin/env python3

#Import libraries

import glob
import nltk
import os
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer




In [2]:

def get_wordnet_tag(tag):

  if tag.startswith('JJ'):

    return 'a'

  elif tag.startswith('RB') or tag == "WRB":

    return 'r'

  elif tag.startswith('NN') or tag.startswith("WP"):

    return 'n'

  elif tag.startswith('VB'):

    return 'v'

  else:

    return None





In [3]:
#Form a dictionary of speeches 

speeches = dict()

for filename in glob.glob("Data/*.txt"):

  basename = os.path.basename(filename)

  name     = os.path.splitext(basename)[0]

  name     = name.replace("_", " ")

  year     = name[:4]

  name     = year+"-"+name[5:]

  with open(filename, 'r', encoding='utf8') as f:

    speech         = f.read()

    speeches[name] = speech



lemmatizer = WordNetLemmatizer()



num_sent     = dict()

num_words         = dict()

avg_sent_len  = dict()

num_uniq_lemmas = dict() 

num_uniq_words  = dict()



print("# YYYY name num_sentences num_words avg_sentence_len num_unique_words num_unique_lemmas")



for president in sorted(speeches.keys()):

  speech    = speeches[president]

  sentences = sent_tokenize(speech)

  words     = word_tokenize(speech)



  avg_sent_len[president]  = 0.0

  for sentence in sentences:

    avg_sent_len[president] += len(word_tokenize(sentence))

  avg_sent_len[president] /= len(sentences)



  num_sent[president]    = len(sentences)

  num_words[president]        = len(words)

  num_uniq_words[president] = len(set(words))



  tagged = nltk.pos_tag(words)

  lemmas  = set()



  for word,tag in tagged:

    pos = get_wordnet_tag(tag)

    if pos:

      lemmas.add(lemmatizer.lemmatize(word, pos=pos))

    else:

      lemmas.add(word)



  year = int(president[:4])

  name = president[5:]



  num_uniq_lemmas[president] = len(lemmas)

  print('%d "%s" %d %d %f %d %d' % (year, name, num_sent[president], num_words[president], avg_sent_len[president], num_uniq_words[president], num_uniq_lemmas[president] ) )

# YYYY name num_sentences num_words avg_sentence_len num_unique_words num_unique_lemmas
1789 "George Washington" 27 1580 58.518519 639 600
1793 "George Washington" 6 177 29.500000 110 105
1797 "John Adams" 39 2608 66.871795 833 768
1801 "Thomas Jefferson" 43 1951 45.372093 724 684
1805 "Thomas Jefferson" 47 2404 51.148936 814 739
1809 "James Madison" 23 1287 55.956522 544 516
1813 "James Madison" 35 1332 38.057143 552 520
1817 "James Monroe" 123 3697 30.056911 1047 940
1821 "James Monroe" 131 4905 37.442748 1268 1119
1825 "John Quincy Adams" 76 3167 41.671053 1012 924
1829 "Andrew Jackson" 27 1232 45.629630 528 509
1833 "Andrew Jackson" 32 1290 40.312500 507 479
1837 "Martin Van Buren" 97 4171 43.000000 1324 1187
1841 "William Henry Harrison" 212 9113 42.985849 1919 1653
1845 "James K. Polk" 155 5201 33.554839 1345 1204
1849 "Zachary Taylor" 24 1206 50.250000 507 487
1853 "Franklin Pierce" 106 3659 34.518868 1176 1084
1857 "James Buchanan" 91 3107 34.142857 951 879
1861 "Abraham Lincol

If we are comparing the last 2 presidents we can notice that President Trump use more sentences that President Obama but Obama use more unique words and lemmas words than Trump.  The shortest speech is Georges Washington in 1793. It has only 6 sentences.

In [None]:
https://medium.com/@ajgabriel_30288/sentiment-analysis-of-political-speeches-managing-unstructured-text-using-r-b090a42c0bf5  
https://bastian.rieck.me/blog/posts/2017/inauguration_speeches_brief/  
https://github.com/hunsnowboarder/sentiment_analyis_aws_cloud/blob/master/AWS_NLP.ipynb  
https://github.com/Pseudomanifold/us-inauguration-speeches/tree/master/Data  
https://bastian.rieck.me/blog/posts/2017/inauguration_speeches_sentiment_analysis/  
https://www.dataquest.io/blog/web-scraping-beautifulsoup/  
