# Mini-Project: Comparing Trump's and Biden's Inaugural Speeches

We will use a mini CSS project as an extended example to put into practice the concepts we are learning. The project aims to analyze and compare the inaugural speeches of the current and last US presidents. We will guide you through each successive step.

The speech transcripts were obtained from https://millercenter.org/the-presidency/presidential-speeches and copied in the text files `biden_inauguration_millercenter.txt` and `trump_inauguration_millercenter.txt` in the `data` folder.


## 1. Import data

First, we will get the data into a Python-native format. Create a function that reads one of the text files into a single string and returns the string. We have provided some skeleton code for you to use.  

In [104]:
# import

import pandas as pd

trump = "/Users/loiswong/Downloads/css/data/trump_inauguration_millercenter.txt"
biden = "/Users/loiswong/Downloads/css/data/biden_inauguration_millercenter.txt"

In [105]:
def get_text(fname):
    """Read given text file and return a string with the contents."""
    
    # Open the file and get the text into a string variable called txt
    with open(fname) as f:
        txt = f.read()
        
    # Remove any trailing white space and paragraphs
    # Format consistently by replacing ’ with '
    return txt.strip().replace("’", "'")
    

In [106]:
# Call the function on Trump's speech and print the first 500 words
trump = get_text(trump)
print(trump[:500])

Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans, and people of the world: thank you.

We, the citizens of America, are now joined in a great national effort to rebuild our country and to restore its promise for all of our people.

Together, we will determine the course of America and the world for years to come.

We will face challenges. We will confront hardships. But we will get the job done.

Every four years, we gather on these st


## 2. Clean and tokenize text

In the next step, we will process the data so that a machine can analyze it. Create another function called `get_tokens()` that takes a string with something that looks like a speech, cleans up the text, and extract a list of all the words used in the speech in the order they appear. We have provided some clues below.

In [107]:
def get_tokens(txt):
    """Take given string and return a list with all words in lowercase
    in the order they appear in the text. Common contractions are expanded
    and hyphenated words are combined in one word.
    """
    modified = txt.lower().replace("'", "").replace("I've", "I have").replace("can't", "cannot")
    modified.replace(",", "").replace(".", "").replace(":", "").replace("–", "")
    return list(modified.split(" "))
                
    # Get rid of possessives so that nation's becomes nation 
    # Expand contractions such that I've becomes I have, can't becomes cannot, etc.
    # Remove all punctuation 
    # Convert to lower-case      
    # Break into words
    # Return the list of tokens

In [108]:
# Call the function on Trump's speech and print the first 50 tokens
trump_tokens = get_tokens(trump)
print(trump_tokens[:50])

['chief', 'justice', 'roberts,', 'president', 'carter,', 'president', 'clinton,', 'president', 'bush,', 'president', 'obama,', 'fellow', 'americans,', 'and', 'people', 'of', 'the', 'world:', 'thank', 'you.\n\nwe,', 'the', 'citizens', 'of', 'america,', 'are', 'now', 'joined', 'in', 'a', 'great', 'national', 'effort', 'to', 'rebuild', 'our', 'country', 'and', 'to', 'restore', 'its', 'promise', 'for', 'all', 'of', 'our', 'people.\n\ntogether,', 'we', 'will', 'determine', 'the']


## 3. Count words

Now, tokenize Biden's speech in the same way. How many words does each speech contains? Who has the longer speech?

In [109]:
biden_tokens = get_tokens(biden)
biden = get_text(biden)
print(biden_tokens[:50])

['chief', 'justice', 'roberts,', 'vice', 'president', 'harris,', 'speaker', 'pelosi,', 'leader', 'schumer,', 'leader', 'mcconnell,', 'vice', 'president', 'pence,', 'distinguished', 'guests,', 'and', 'my', 'fellow', 'americans.\n\nthis', 'is', 'americas', 'day.\n\nthis', 'is', 'democracys', 'day.\n\na', 'day', 'of', 'history', 'and', 'hope.\n\nof', 'renewal', 'and', 'resolve.\n\nthrough', 'a', 'crucible', 'for', 'the', 'ages', 'america', 'has', 'been', 'tested', 'anew', 'and', 'america', 'has', 'risen', 'to']


## 4. Evaluate vocabulary

Next, look at the unique words used by each speaker. Who uses more unique words? Whose speech is more repetitive?

Biden has more unique words in his speech than Trump.

In [110]:
#unique words from Trump's speech 
trump_unique_words = set(trump_tokens)

#number of unique words
print("Trump has", len(trump_unique_words), "unique words in his speech")


Trump has 628 unique words in his speech


In [111]:
#unique words from Biden's speech
biden_unique_words = set(biden_tokens)

#number of unique words
print("Biden has", len(biden_unique_words), "unique words in his speech")


Biden has 905 unique words in his speech


## 5. Discover the main themes

Finally, we will identify the most repeated words, which will give us an idea of the main recurring themes in the speeches.

To begin with, write a function that identifies the most commonly used meaningful words. We will count the number of times each unique word is mentioned in the speech but exclude non-meaningful words such as articles and prepositions, because these are trivially common. 

Use the helping code below and write your code around it to complete the function and call it.


In [112]:
# We declare a global variable to list stop words. 
# Stop words are common words that are not meaningful in this context.
STOP_WORDS = ['a', 'about', 'across', 'after', 'an', 'and', 'any', 'are', 'as', 'at', 
              'be', 'because', 'but', 'by', 'did', 'do', 'does', 'for', 'from',
              'get', 'has', 'have', 'if', 'in', 'is', 'it', 'its',
              'many', 'more', 'much', 'no', 'not', 'of', 'on', 'or', 'out',
              'so', 'some', 'than', 'the', 'this', 'that', 'those', 'through', 'to',
              'very', 'what', 'where', 'whether', 'which', 'while', 'who', 'with']


def get_word_counts(tokens, stopwords = STOP_WORDS):
    """Take a list of tokens and a list of stopwords and 
    return a list with the unique meaningful words (words that are not stopwords)
    sorted by how often they appear. The list contains (word, count) tuples 
    and is sorted in descending order by count.
    """
    # Create an empty dictionary where we will have word: count
    count_dic = {}
    
    # For each token, if it is not in stopwords, either add it as a key with 
    # count 1 if it is new, or increase its count by 1 if it already exists
    for word in tokens:
        if word not in STOP_WORDS:
            if word not in count_dic.keys():
                count_dic[word] = 1
            else:
                count_dic[word] += 1       
    
    # Get the dictionary items as tuples and sort them by the counts in descending order
    sorted_word_counts = sorted(count_dic.items(), key=lambda x: x[1], reverse=True)

    # Return
    return sorted_word_counts

Now, identify the 10 most commonly used meaningful words for Trump and Biden to reveal the theme and tone of their speech. 

In [113]:
#Trump's most commonly used meaningful words
trump_counts = get_word_counts(trump_tokens)
print(trump_counts[:10])

[('our', 48), ('will', 40), ('we', 26), ('all', 12), ('–', 11), ('american', 11), ('their', 10), ('your', 10), ('america', 9), ('people', 6)]


In [114]:
##Biden's most commonly used meaningful words
biden_counts = get_word_counts(biden_tokens)
print(biden_counts[:10])

[('we', 59), ('our', 42), ('will', 29), ('my', 16), ('can', 16), ('us', 16), ('one', 14), ('all', 14), ('i', 13), ('america', 10)]


In the end, can you get the words that are unique to either Trump or Biden? These are words that Trump mentions at least twice but Biden doesn't, and vice versa. We impose the rule of the word being repeated to get more robust results. Do you notice any trends?

In [117]:
# only keep pairs where the value > 2 
def count_filter(dict):
    key, value = dict
    if value > 1:
        return True  # keep pair in the filtered dictionary
    else:
        return False  # filter pair out of the dictionary

In [119]:
# words Trump used 2+ times 
trump_top_words = dict(filter(count_filter, trump_counts)) 

# words Trump used 2+ times and Biden used 0 times 
trump_top_words_unique = [word for word in trump_top_words.keys() if word not in biden_unique_words]
trump_top_words_unique

['back',
 '',
 'bring',
 'again.\n\nwe',
 'too',
 'factories',
 'protected',
 'everyone',
 'millions',
 'foreign',
 'countries',
 'heart',
 'citizens',
 'now',
 'obama',
 'transferring',
 'party',
 'small',
 'government',
 'share',
 'capital,',
 'belongs',
 'forgotten',
 'men',
 'movement',
 'stops',
 'glorious',
 'allegiance',
 'borders',
 'made',
 'workers',
 'jobs.',
 'breath',
 'winning',
 'life',
 'old',
 'loyalty',
 'talk',
 'dreams,']

In [121]:
# words Biden used 2+ times 
biden_top_words = dict(filter(count_filter, biden_counts)) 

# words Biden used 2+ times and Trump used 0 times 
biden_top_words_unique = [word for word in biden_top_words.keys() if word not in trump_unique_words]
biden_top_words_unique

['me',
 'story',
 'know',
 'days',
 'history',
 'democracy',
 'come',
 'war,',
 'soul',
 'need',
 'vice',
 'them',
 'sacred',
 'centuries',
 'were',
 'lost',
 'cry',
 'whole',
 'join',
 'common',
 'unity',
 'once',
 'meet',
 'better',
 'believe',
 'stand,',
 'gave',
 'say',
 'truth',
 'do,',
 'may',
 'leader',
 'ages',
 'tested',
 'cause',
 'again',
 'friends,',
 'ago',
 'violence',
 'nation,',
 'ahead',
 'set',
 'constitution',
 'strength',
 'spoke',
 'last',
 'today,',
 'his',
 'taken',
 '—',
 'still',
 'difficult',
 'year',
 'war',
 'thousands',
 'racial',
 'comes',
 'cant',
 'rise',
 'overcome',
 'future',
 'requires',
 'things',
 'january',
 'act',
 'nation.\n\ni',
 'ask',
 'virus.\n\nwe',
 'sound',
 'civil',
 'enough',
 'forward.\n\nand,',
 'show',
 'way,',
 'stop',
 'unity,',
 'peace,',
 'progress,',
 'raging',
 'disagreement',
 'measure',
 'support',
 'this:',
 'saint',
 'objects',
 'leaders',
 'honor',
 'americans',
 'turn',
 'dont',
 'versus',
 'called',
 'how',
 'would',
 's