Performs an analysis of U.S. Presidental Inauguration Speeches based on TF-IDF information. The script below will allow us to
Extract the five most relevant words (according to the TF-IDF weights) of each speech and LDA to obtain 5 different topics present in the corpus of speeches.  





In [92]:
#!/usr/bin/env python3

import numpy

import os

import sys

import glob


from sklearn.decomposition import LatentDirichletAllocation

from sklearn.feature_extraction.text import TfidfVectorizer


import pandas as pd
num_topics = 5

num_words  = 5





def get_top_words(row, feature_names, weights, n=5):

  top_ids     = numpy.argsort(row)[:-n-1:-1]

  top_names   = [feature_names[i] for i in top_ids]

  top_weights = [row[i] / weights[i] for i in top_ids]



  return top_names, top_weights





def get_topics(model, feature_names, n=3):

  topics = []

  for index, topic in enumerate(model.components_):

    words = " ".join( feature_names[i] for i in topic.argsort()[:-n-1:-1] )

    topics.append( words )



  return topics





def get_year_and_name(filename):

  basename = os.path.basename(filename)

  name     = os.path.splitext(basename)[0]

  name     = name.replace("_", " ")

  year     = name[:4]

  name     = name[5:]



  return year, name




In [93]:
#td-idf analysis 
"""

main

"""

if __name__ == "__main__":
    
    path="C:/data620_finalproject/data"
    document=[]
    
    for filenames in glob.glob(os.path.join(path, '*.txt')):
        year, name = get_year_and_name(filenames)
        #print(year,name)
        filename_to_name[name]=filenames
        filename_to_year[year]=filenames
        
        with open(os.path.join(path, filenames), 'r', encoding='utf8') as f:
            
            text = f.read()
            text=[text]
            tf_idf_vectorizer = TfidfVectorizer(input='text',

                           stop_words='english',

                           #max_df=0.50,

                           strip_accents='unicode')
        
            tf_idf = tf_idf_vectorizer.fit_transform(text)
            
        
              
        #year = filename_to_year[filenames]
        #name = filename_to_name[filenames]
        for index, filename in enumerate(text):
            top_words, top_weights = get_top_words( tf_idf[index].toarray().ravel(), tf_idf_vectorizer.get_feature_names(), tf_idf_vectorizer.idf_, num_words)
            print("%s (%s): %s" % (year, name, " ".join(top_words)))
            
        for index, (word, weight) in enumerate(zip(top_words, top_weights)):
            if max(top_weights) != min(top_weights):
                weight = ( weight - min(top_weights) ) / ( max(top_weights) - min(top_weights) )
            else:
                weight = 1.0
            
            print("%s %d \"%s\" %s %f" % (year, index, name, word, weight))
        

1789 (George Washington): government public united citizens country
1789 0 "George Washington" government 1.000000
1789 1 "George Washington" public 0.500000
1789 2 "George Washington" united 0.250000
1789 3 "George Washington" citizens 0.000000
1789 4 "George Washington" country 0.000000
1793 (George Washington): shall united work government oath
1793 0 "George Washington" shall 1.000000
1793 1 "George Washington" united 1.000000
1793 2 "George Washington" work 0.000000
1793 3 "George Washington" government 0.000000
1793 4 "George Washington" oath 0.000000
1797 (John Adams): people government nations states country
1797 0 "John Adams" people 1.000000
1797 1 "John Adams" government 0.700000
1797 2 "John Adams" nations 0.100000
1797 3 "John Adams" states 0.100000
1797 4 "John Adams" country 0.000000
1801 (Thomas Jefferson): government let citizens fellow man
1801 0 "Thomas Jefferson" government 1.000000
1801 1 "Thomas Jefferson" let 0.142857
1801 2 "Thomas Jefferson" citizens 0.142857
1

1921 (Warren Harding): world america war new civilization
1921 0 "Warren Harding" world 1.000000
1921 1 "Warren Harding" america 0.187500
1921 2 "Warren Harding" war 0.187500
1921 3 "Warren Harding" new 0.062500
1921 4 "Warren Harding" civilization 0.000000
1925 (Calvin Coolidge): country great people government world
1925 0 "Calvin Coolidge" country 1.000000
1925 1 "Calvin Coolidge" great 0.750000
1925 2 "Calvin Coolidge" people 0.500000
1925 3 "Calvin Coolidge" government 0.500000
1925 4 "Calvin Coolidge" world 0.000000
1929 (Herbert Hoover): government people progress world peace
1929 0 "Herbert Hoover" government 1.000000
1929 1 "Herbert Hoover" people 0.105263
1929 2 "Herbert Hoover" progress 0.052632
1929 3 "Herbert Hoover" world 0.000000
1929 4 "Herbert Hoover" peace 0.000000
1933 (Franklin Roosevelt): national people helped leadership shall
1933 0 "Franklin Roosevelt" national 1.000000
1933 1 "Franklin Roosevelt" people 0.333333
1933 2 "Franklin Roosevelt" helped 0.000000
1933 

The scores above make sense. The more common the word across documents, the lower its score and the more unique a word is the higher the score. Country has a lower score because it is common to all the documents. Comparing Obama and Trump, People is a unique word but for Trump it is a common word.

To tell briefly, LDA imagines a fixed set of topics. Each topic represents a set of words. And the goal of LDA is to map all the documents to the topics in a way, such that the words in each document are mostly captured by those imaginary topics.

In [91]:
path="C:/data620_finalproject/data"

for filenames in glob.glob(os.path.join(path, '*.txt')):
    with open(os.path.join(path, filenames), 'r', encoding='utf8') as f:
        year, name = get_year_and_name(filenames)
        #print(year,name)
        filename_to_name[name]=filenames
        filename_to_year[year]=filenames
            
        text = f.read()
        text=[text]
        tf_idf_vectorizer = TfidfVectorizer(input='text',

                           stop_words='english',

                           #max_df=0.50,

                           strip_accents='unicode')
        
        tf_idf = tf_idf_vectorizer.fit_transform(text)
            
        lda = LatentDirichletAllocation(n_components=num_topics,

                                  learning_method="online",

                                  max_iter=20,

                                  random_state=42)
        lda.fit(tf_idf)
    
           
        
        topics = get_topics(lda, tf_idf_vectorizer.get_feature_names())
            #print(topics)
        for index, topic in enumerate(topics):
            #print("%s (%s): %s" % (year, name, " ".join(top_words)))
            print("Topic %d %s (%s): %s" % (index, year, name, "".join(topic)))
            scores = lda.transform(tf_idf).ravel()
            topic  = numpy.argsort(scores)[-2]
                
        print(topics[topic])


Topic 0 1789 (George Washington): government public united
Topic 1 1789 (George Washington): prosperity whilst repaired
Topic 2 1789 (George Washington): watch inferior tribute
Topic 3 1789 (George Washington): genuine almighty add
Topic 4 1789 (George Washington): feelings executive considered
feelings executive considered
Topic 0 1793 (George Washington): united shall government
Topic 1 1793 (George Washington): confidence presence fellow
Topic 2 1793 (George Washington): constitution endeavor office
Topic 3 1793 (George Washington): requires punishment distinguished
Topic 4 1793 (George Washington): constitutional office country
constitution endeavor office
Topic 0 1797 (John Adams): continue love rampart
Topic 1 1797 (John Adams): discontents gratitude election
Topic 2 1797 (John Adams): immortal classes conventions
Topic 3 1797 (John Adams): concerning add christians
Topic 4 1797 (John Adams): people government states
continue love rampart
Topic 0 1801 (Thomas Jefferson): governme

Topic 0 1905 (Theodore Roosevelt): govern great development
Topic 1 1905 (Theodore Roosevelt): failed righteousness crises
Topic 2 1905 (Theodore Roosevelt): undertaken lofty earth
Topic 3 1905 (Theodore Roosevelt): people life problems
Topic 4 1905 (Theodore Roosevelt): wither dead tremendous
govern great development
Topic 0 1909 (William Howard Taft): boycott principles remedies
Topic 1 1909 (William Howard Taft): feeling question continue
Topic 2 1909 (William Howard Taft): favorably enactment injuries
Topic 3 1909 (William Howard Taft): government business congress
Topic 4 1909 (William Howard Taft): usual factory sufficient
boycott principles remedies
Topic 0 1913 (Woodrow Wilson): efficiency sanitary concentrating
Topic 1 1913 (Woodrow Wilson): great men government
Topic 2 1913 (Woodrow Wilson): shown taken stirred
Topic 3 1913 (Woodrow Wilson): duty weakening model
Topic 4 1913 (Woodrow Wilson): control come convictions
shown taken stirred
Topic 0 1917 (Woodrow Wilson): far vind

LDA tries to backtrack from the documents to find a set of topics that are likely to have generated the collection. For instance for President Obama 2009 Speech the topics were spirit, history, fist.  It makes sense since the election of Obama was historic and was about the spirit of America.  LDA gives the opportunity to make sense of the president speech.

https://medium.com/@ajgabriel_30288/sentiment-analysis-of-political-speeches-managing-unstructured-text-using-r-b090a42c0bf5  
https://bastian.rieck.me/blog/posts/2017/inauguration_speeches_brief/  
https://github.com/hunsnowboarder/sentiment_analyis_aws_cloud/blob/master/AWS_NLP.ipynb  
https://github.com/Pseudomanifold/us-inauguration-speeches/tree/master/Data  
https://bastian.rieck.me/blog/posts/2017/inauguration_speeches_sentiment_analysis/  
https://www.dataquest.io/blog/web-scraping-beautifulsoup/  
