## Inaugural Similarity Analysis
1. Implement TF-IDF from scratch and use it to find the closest historic inaugural address to the 2017 address by President Trump
2. Use Latent Symantec Indexing to identify an Inaugural most similar to a user-defined query.

In [1]:
# import library
import re
import math
import pandas as pd
import numpy as np

import nltk
from nltk.corpus import stopwords
from nltk.corpus import inaugural
from nltk import FreqDist

In [2]:
# initialize stemmer and lemmatizer
porter    = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
wnl       = nltk.WordNetLemmatizer()

In [3]:
# display the first 4 items of the inaugural datasets
inaugural.fileids()[:4]

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt']

In [4]:
# display the tokenized words from the first inaugural 
inaugural.words(inaugural.fileids()[0])

['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', ...]

In [5]:
# initial stop words
stop_words = set(stopwords.words('english'))

### 1. Stopwords

**Update stopwords**

Remove all punctuation and strange unicode characters, and anything else that might be extraneous. However, the updated list of stopwords is not exaustive.

In [6]:
# update stopwords
stop_words.update(['-', ',', '.', '!', '?', ' ', ':', ';', '/', '[', ']', 
                   '*', '"', '(', ')', '#', '$', '%', '&', '~', '@', '<', '>',
                  '{', '}', '^', '--', "'", '."', '14th', '";', '"?', '),', '...', 
                   '¡¦', '¡§', '¡¨¡', ',"', '.)', '....', '\x80\x94', '¡', '.¡¨'])

---
### 2. Read each inaugural address into an Pandas DataFrame

**2.1 Create a vocabulary as a set of all unique stemmed terms in the corpus**

In [12]:
# define an empty set to store vocabs
vocab = set()

In [13]:
# your code for creating the vocab goes here
temp_list = [wnl.lemmatize(porter.stem(w.lower())) for w in inaugural.words() if w.lower() not in stop_words]
vocab = set(sorted(temp_list))

In [14]:
# check out vocab length
print('there are ', len(vocab), ' unique words in our corpus.')

there are  5399  unique words in our corpus.


**2.2 Use ocabulary to read each inaugural address into a dataframe**

Each row of the dataframe represents a document (one of the addresses), each column of the dataframe is a term from the vocab.

In [15]:
# initiate a dictionary for tracking document size
doc_length = {}
# initiate an empty dataframe
df = pd.DataFrame()
# loop through all documents in corpus
for fileid in inaugural.fileids():
    # populate doc_length
    doc_length[fileid] = len(inaugural.words(fileid))
    # create a temporary dictionary for each document for tracking word frequency
    temp_list = [wnl.lemmatize(porter.stem(w.lower())) for w in inaugural.words(fileid) 
                 if w.lower() not in stop_words]
    temp_dict = dict(FreqDist(sorted(temp_list)))
    # covert dictionary to series and add to dataframe
    df = df.append(pd.Series(temp_dict, name = fileid))
    pass
# fill null value
df.fillna(0, inplace = True)
# check df size
print(df.shape)

(58, 5399)


---
### 3. Compute TF-IDF for the document-term matrix ###

**3.1. Write a function to compute term frequency (TF) for each document**

In [16]:
# compute term frequency
# inputs: wordvec is a series that contains, for a given doc, 
#                 the word counts for each term in the vocab
#         doclen  is the length of the document
# returns: a series with new term-frequencies (raw counts divided by doc length)
def computetf(wordvec,doclen):
    return wordvec/doclen

**3.2 Write a function to comput inverse document frequency**

In [19]:
import math

# input:   document-by-term (row-by-column) dataframe
# returns: dictionary of key-value pairs. Keys are terms in the vocab, values are IDF.
def computeidf(df):
    idf_dict = {}
    for vocab in df.columns.values:
        # calculate the ratio of total number of documents over number of documents
        # containing current vocab, and tkae a log of this ratio.
        ratio = df.shape[0] / (df[vocab] > 0).sum()
        idf_dict[vocab] = math.log(ratio)
        pass
    return idf_dict

**Create a new dataframe and populate it with the TF-IDF values for each document-term combination**

In [20]:
# define a new dataframe that stores TF-IDF values
newdf = pd.DataFrame()
# compute idf
idfdict = computeidf(df)
# compute tf-idf
cols = df.columns
for index, row in df.iterrows():
    newrow = computetf(row,doc_length[index])
    for c in cols:
        newrow[c] = newrow[c]*idfdict[c]
    newdf = newdf.append(newrow)

In [21]:
# check the shape of tf-idf dataframe
print('shape: ', newdf.shape)

shape:  (58, 5399)


In [22]:
# look at the first 5 rows of the tf-idf dataframe
newdf.head()

Unnamed: 0,000,1,100,120,125,13,15th,16,1774,1776,...,york,yorktown,young,younger,youngest,youth,zeal,zealou,zealous,zone
1789-Washington.txt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1793-Washington.txt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1797-Adams.txt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000766,0.0,0.0,0.0
1801-Jefferson.txt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.001024,0.0,0.0,0.0
1805-Jefferson.txt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.002493,0.0,0.0,0.0


---
### 4. Using TF-IDF values, find and rank order the 3 closest inaugural addresses to Donald Trump's 2017 address, measured by cosine similarity

In [23]:
# President Trump's address is 57 (0-indexed)
newdf.iloc[57,:]

000        0.0
1          0.0
100        0.0
120        0.0
125        0.0
          ... 
youth      0.0
zeal       0.0
zealou     0.0
zealous    0.0
zone       0.0
Name: 2017-Trump.txt, Length: 5399, dtype: float64

**4.1 Create an array called dist that contains the cosine similarity distance between the 2017 inaugural address and each of the inaugural addresses**

In [24]:
import numpy as np
# assign PResident Trump's tf_idf values to d1
d1 = newdf.iloc[57,:]
# find its magnitude 
norm_d1 = np.sqrt(np.sum(np.square(d1)))
# create a list for storing cosine similarities between each inaugural and President Trump
dist = []
for index,row in newdf.iterrows():
    temp_norm = np.sqrt(np.sum(np.square(row)))
    cos_sim = np.dot(d1, row) / (norm_d1 * temp_norm)
    dist.append(cos_sim)

**4.2 Find the 3 closest associated inaugural address, when measured by cosign similarity**

In [25]:
# find 3 inaugurals that are most similar to Trump. 
indices = np.array(dist).argsort()[-4:-1]
df.index[indices]

Index(['1969-Nixon.txt', '1997-Clinton.txt', '1993-Clinton.txt'], dtype='object')

Cosine similarity is often used in text classification to measure the textual content similarity of two documents. This is normally calculated based on word frequencies relative to the entire corpus, which can be meaninfully represented using TF-IDF.

In this case, inaugural A is considered 'close' to inaugural B if the words used and frequencies of each word used are relatively similar than the ones between inaugural A and inaugural B. As illustrated in the code snippets below, the first dataframe dipicts the words that appeared more than 5 times in Trump's inaugural and also appeared at least once in the 3 closest associated inaugural address. We could see that these words such as 'america', 'great', 'people', 'nation' were also frequently used by the other 3 presidents.

In contrast, the second dataframe shows the 3 furtherest assocaited inaugural address, where such pattern is pattern is not observed. 

In [30]:
# find words that appear in all 4 inaugurals and appears at least 5 times in Trump's inaugural
similar_df = df.loc[['1969-Nixon.txt', '1997-Clinton.txt', '1993-Clinton.txt', '2017-Trump.txt'], :]
keywords = [w for w in similar_df.columns if (similar_df[w]>0).sum() >= 4 and similar_df.loc['2017-Trump.txt', w] >= 5]
similar_df[keywords]

Unnamed: 0,american,great,nation,new,one,peopl,presid,world,america,make,god,across,today
1969-Nixon.txt,4.0,7.0,9.0,8.0,7.0,15.0,3.0,15.0,6.0,10.0,6.0,1.0,5.0
1997-Clinton.txt,16.0,6.0,15.0,29.0,10.0,11.0,1.0,15.0,15.0,5.0,1.0,2.0,6.0
1993-Clinton.txt,14.0,2.0,5.0,9.0,1.0,12.0,3.0,20.0,19.0,3.0,2.0,4.0,10.0
2017-Trump.txt,15.0,6.0,13.0,6.0,9.0,10.0,5.0,6.0,20.0,6.0,5.0,5.0,5.0


In [19]:
# find 3 inaugurals that are most different from Trump. 
df.index[np.array(dist).argsort()[:3]]

Index(['1789-Washington.txt', '1809-Madison.txt', '1829-Jackson.txt'], dtype='object')

In [20]:
# find words that appear in all 4 inaugurals and appears at least 5 times in Trump's inaugural
diff_df = df.loc[['1789-Washington.txt', '1809-Madison.txt', '1829-Jackson.txt', '2017-Trump.txt'], :]
keywords = [w for w in diff_df.columns if (diff_df[w]>0).sum() >= 4 and diff_df.loc['2017-Trump.txt', w] >= 5]
diff_df[keywords]

Unnamed: 0,countri,everi,nation,never,peopl,right
1789-Washington.txt,5.0,9.0,4.0,2.0,4.0,2.0
1809-Madison.txt,5.0,2.0,10.0,1.0,1.0,5.0
1829-Jackson.txt,2.0,1.0,6.0,1.0,4.0,4.0
2017-Trump.txt,12.0,7.0,13.0,6.0,10.0,5.0
