## Inaugural Similarity Analysis
1. Implement TF-IDF from scratch and use it to find the closest historic inaugural address to the 2017 address by President Trump
2. Use Latent Symantec Indexing to identify an Inaugural most similar to a user-defined query.

In [1]:
# import library
import re
import math
import pandas as pd
import numpy as np

import nltk
from nltk.corpus import stopwords
from nltk.corpus import inaugural
from nltk import FreqDist

In [2]:
# initialize stemmer and lemmatizer
porter    = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
wnl       = nltk.WordNetLemmatizer()

In [3]:
# display the first 4 items of the inaugural datasets
inaugural.fileids()[:4]

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt']

In [4]:
# display the tokenized words from the first inaugural 
inaugural.words(inaugural.fileids()[0])

['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', ...]

In [5]:
# initial stop words
stop_words = set(stopwords.words('english'))

### 1. Stopwords

**Update stopwords**

Remove all punctuation and strange unicode characters, and anything else that might be extraneous. However, the updated list of stopwords is not exaustive.

In [6]:
# update stopwords
stop_words.update(['-', ',', '.', '!', '?', ' ', ':', ';', '/', '[', ']', 
                   '*', '"', '(', ')', '#', '$', '%', '&', '~', '@', '<', '>',
                  '{', '}', '^', '--', "'", '."', '14th', '";', '"?', '),', '...', 
                   '¡¦', '¡§', '¡¨¡', ',"', '.)', '....', '\x80\x94', '¡', '.¡¨'])

---
### 2. Read each inaugural address into an Pandas DataFrame

**2.1 Create a vocabulary as a set of all unique stemmed terms in the corpus**

In [7]:
# define an empty set to store vocabs
vocab = set()

In [8]:
# your code for creating the vocab goes here
temp_list = [wnl.lemmatize(porter.stem(w.lower())) for w in inaugural.words() if w.lower() not in stop_words]
vocab = set(sorted(temp_list))

In [9]:
# check out vocab length
print('there are ', len(vocab), ' unique words in our corpus.')

there are  5399  unique words in our corpus.


**2.2 Use ocabulary to read each inaugural address into a dataframe**

Each row of the dataframe represents a document (one of the addresses), each column of the dataframe is a term from the vocab.

In [10]:
# initiate a dictionary for tracking document size
doc_length = {}
# initiate an empty dataframe
df = pd.DataFrame()
# loop through all documents in corpus
for fileid in inaugural.fileids():
    # populate doc_length
    doc_length[fileid] = len(inaugural.words(fileid))
    # create a temporary dictionary for each document for tracking word frequency
    temp_list = [wnl.lemmatize(porter.stem(w.lower())) for w in inaugural.words(fileid) 
                 if w.lower() not in stop_words]
    temp_dict = dict(FreqDist(sorted(temp_list)))
    # covert dictionary to series and add to dataframe
    df = df.append(pd.Series(temp_dict, name = fileid))
    pass
# fill null value
df.fillna(0, inplace = True)
# check df size
print(df.shape)

(58, 5399)


---
### 3. Compute TF-IDF for the document-term matrix ###

**3.1. Write a function to compute term frequency (TF) for each document**

In [11]:
# compute term frequency
# inputs: wordvec is a series that contains, for a given doc, 
#                 the word counts for each term in the vocab
#         doclen  is the length of the document
# returns: a series with new term-frequencies (raw counts divided by doc length)
def computetf(wordvec,doclen):
    return wordvec/doclen

**3.2 Write a function to comput inverse document frequency**

In [12]:
import math

# input:   document-by-term (row-by-column) dataframe
# returns: dictionary of key-value pairs. Keys are terms in the vocab, values are IDF.
def computeidf(df):
    idf_dict = {}
    for vocab in df.columns.values:
        # calculate the ratio of total number of documents over number of documents
        # containing current vocab, and tkae a log of this ratio.
        ratio = df.shape[0] / (df[vocab] > 0).sum()
        idf_dict[vocab] = math.log(ratio)
        pass
    return idf_dict

**Create a new dataframe and populate it with the TF-IDF values for each document-term combination**

In [13]:
# define a new dataframe that stores TF-IDF values
newdf = pd.DataFrame()
# compute idf
idfdict = computeidf(df)
# compute tf-idf
cols = df.columns
for index, row in df.iterrows():
    newrow = computetf(row,doc_length[index])
    for c in cols:
        newrow[c] = newrow[c]*idfdict[c]
    newdf = newdf.append(newrow)

In [14]:
# check the shape of tf-idf dataframe
print('shape: ', newdf.shape)

shape:  (58, 5399)


In [15]:
# look at the first 5 rows of the tf-idf dataframe
newdf.head()

Unnamed: 0,000,1,100,120,125,13,15th,16,1774,1776,...,york,yorktown,young,younger,youngest,youth,zeal,zealou,zealous,zone
1789-Washington.txt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1793-Washington.txt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1797-Adams.txt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000766,0.0,0.0,0.0
1801-Jefferson.txt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.001024,0.0,0.0,0.0
1805-Jefferson.txt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.002493,0.0,0.0,0.0


---
### 4. Using TF-IDF values, find and rank order the 3 closest inaugural addresses to Donald Trump's 2017 address, measured by cosine similarity

In [16]:
# President Trump's address is 57 (0-indexed)
newdf.iloc[57,:]

000        0.0
1          0.0
100        0.0
120        0.0
125        0.0
          ... 
youth      0.0
zeal       0.0
zealou     0.0
zealous    0.0
zone       0.0
Name: 2017-Trump.txt, Length: 5399, dtype: float64

**4.1 Create an array called dist that contains the cosine similarity distance between the 2017 inaugural address and each of the inaugural addresses**

In [17]:
import numpy as np
# assign PResident Trump's tf_idf values to d1
d1 = newdf.iloc[57,:]
# find its magnitude 
norm_d1 = np.sqrt(np.sum(np.square(d1)))
# create a list for storing cosine similarities between each inaugural and President Trump
dist = []
for index,row in newdf.iterrows():
    temp_norm = np.sqrt(np.sum(np.square(row)))
    cos_sim = np.dot(d1, row) / (norm_d1 * temp_norm)
    dist.append(cos_sim)

**4.2 Find the 3 closest associated inaugural address, when measured by cosign similarity**

In [18]:
# find 3 inaugurals that are most similar to Trump. 
indices = np.array(dist).argsort()[-4:-1]
df.index[indices]

Index(['1969-Nixon.txt', '1997-Clinton.txt', '1993-Clinton.txt'], dtype='object')

Cosine similarity is often used in text classification to measure the textual content similarity of two documents. This is normally calculated based on word frequencies relative to the entire corpus, which can be meaninfully represented using TF-IDF.

In this case, inaugural A is considered 'close' to inaugural B if the words used and frequencies of each word used are relatively similar than the ones between inaugural A and inaugural B. As illustrated in the code snippets below, the first dataframe dipicts the words that appeared more than 5 times in Trump's inaugural and also appeared at least once in the 3 closest associated inaugural address. We could see that these words such as 'america', 'great', 'people', 'nation' were also frequently used by the other 3 presidents.

In contrast, the second dataframe shows the 3 furtherest assocaited inaugural address, where such pattern is pattern is not observed. 

In [19]:
# find words that appear in all 4 inaugurals and appears at least 5 times in Trump's inaugural
similar_df = df.loc[['1969-Nixon.txt', '1997-Clinton.txt', '1993-Clinton.txt', '2017-Trump.txt'], :]
keywords = [w for w in similar_df.columns if (similar_df[w]>0).sum() >= 4 and similar_df.loc['2017-Trump.txt', w] >= 5]
similar_df[keywords]

Unnamed: 0,american,great,nation,new,one,peopl,presid,world,america,make,god,across,today
1969-Nixon.txt,4.0,7.0,9.0,8.0,7.0,15.0,3.0,15.0,6.0,10.0,6.0,1.0,5.0
1997-Clinton.txt,16.0,6.0,15.0,29.0,10.0,11.0,1.0,15.0,15.0,5.0,1.0,2.0,6.0
1993-Clinton.txt,14.0,2.0,5.0,9.0,1.0,12.0,3.0,20.0,19.0,3.0,2.0,4.0,10.0
2017-Trump.txt,15.0,6.0,13.0,6.0,9.0,10.0,5.0,6.0,20.0,6.0,5.0,5.0,5.0


In [20]:
# find 3 inaugurals that are most different from Trump. 
df.index[np.array(dist).argsort()[:3]]

Index(['1789-Washington.txt', '1809-Madison.txt', '1829-Jackson.txt'], dtype='object')

In [21]:
# find words that appear in all 4 inaugurals and appears at least 5 times in Trump's inaugural
diff_df = df.loc[['1789-Washington.txt', '1809-Madison.txt', '1829-Jackson.txt', '2017-Trump.txt'], :]
keywords = [w for w in diff_df.columns if (diff_df[w]>0).sum() >= 4 and diff_df.loc['2017-Trump.txt', w] >= 5]
diff_df[keywords]

Unnamed: 0,countri,everi,nation,never,peopl,right
1789-Washington.txt,5.0,9.0,4.0,2.0,4.0,2.0
1809-Madison.txt,5.0,2.0,10.0,1.0,1.0,5.0
1829-Jackson.txt,2.0,1.0,6.0,1.0,4.0,4.0
2017-Trump.txt,12.0,7.0,13.0,6.0,10.0,5.0


---
### 5. Transform the original inaugural addresses using TfidfVectorizer from scikit-learn
**5.1 Create a new dataframe for storing inaugural addresses**

In [22]:
new_df = pd.DataFrame()
for fileid in inaugural.fileids():
    new_df = new_df.append({'name':fileid, 'doc':inaugural.raw(fileid)}, ignore_index = True)

**5.2 define a tokenizer to lemmatize and stem tokenized words within TfidfVectorizer**

In [23]:
from nltk import word_tokenize
# from nltk.stem import WordNetLemmatizer
# from nltk.stem import PorterStemmer

class word_processor:
    def __init__(self):
        self.porter = porter
        self.wnl = wnl
    def __call__(self, doc):
        return [self.wnl.lemmatize(self.porter.stem(t)) for t in word_tokenize(doc)]

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec = TfidfVectorizer(use_idf = True, 
                            tokenizer = word_processor(), 
                            lowercase = True, 
                            stop_words = stop_words)

X = tfidf_vec.fit_transform(new_df['doc'])
print('shape of transformed docs: ',X.shape)

  'stop_words.' % sorted(inconsistent))


shape of transformed docs:  (58, 5574)


---
### 6. Compute the truncated SVD
**6.1 compute singular vectors and singular valuses using SVD**

In [25]:
# import library
from sklearn.decomposition import TruncatedSVD
# call a TruncatedSVD class
svd = TruncatedSVD(n_components = 5, n_iter = 10, random_state = 44)
svd.fit(X)

TruncatedSVD(n_components=5, n_iter=10, random_state=44)

In [26]:
# calculate singular vectors and sigular values:
U = svd.fit_transform(X) / svd.singular_values_
S = svd.singular_values_
VT = svd.components_

print('shape of U: ', U.shape)
print('shape of sigma (singular values): ', S.shape)
print('shape of V transposed: ', VT.shape)

shape of U:  (58, 5)
shape of sigma (singular values):  (5,)
shape of V transposed:  (5, 5574)


**6.2 Visualize sigm aby Looking at $\sigma_i, i=1,\ldots,k$. The singular values $\sigma_i$ diminishes as $k$ increases.**

In [27]:
import matplotlib.pyplot as plt

plt.figure(figsize = (7,7))
plt.plot([x for x in range(1,len(S)+1)], S, 'bo')
plt.xticks(ticks = [1, 2, 3, 4, 5], labels = [1, 2, 3, 4, 5])
plt.show()

<Figure size 700x700 with 1 Axes>

**6.3 for each singular vector (feature vector), find the index of the 5 largest terms, and extract the corresponding terms**

In [28]:
terms = tfidf_vec.get_feature_names()
for k in range(5):
    print(k, [terms[i] for i in VT[k].argsort()[-5:]])

0 ['ha', 'peopl', 'govern', 'nation', 'thi']
1 ['new', "'s", 'world', 'u', 'america']
2 ['peopl', 'econom', 'busi', 'law', 'upon']
3 ['counsel', 'success', 'peac', 'war', 'nation']
4 ['ani', 'offens', 'union', 'war', 'shall']


---
### 7. Inaugural Similarity on user-defined query
**7.1 Get and transform user query**

In [29]:
# get user input
query = input('Please enter a sentence: ')

Please enter a sentence:  peace and prosperity and love and care


In [30]:
print('You entered: ', query, '\n')
# transfrom user query and display its tf-idf values
query_xform = tfidf_vec.transform([query])
print(query_xform)

You entered:  peace and prosperity and love and care 

  (0, 3779)	0.47196368319888016
  (0, 3500)	0.41054832177184714
  (0, 2936)	0.5463965610047704
  (0, 754)	0.5569121612551006


**7.2 Project the transformed query into the document-by-feature space**

In [31]:
# convert sigma from vector into a matrix and invert it
S_values = np.linalg.inv(np.diag(S))
# multiple V by s_values
Z = np.matmul(np.transpose(VT),S_values)
# multiple transformed query and Z
query_feature = np.matmul(query_xform.toarray(), Z)
print('shape of query in document-by-feature space: ', query_feature.shape)
# display features used to represent query
print('feature embedding of the query: ', query_feature)

shape of query in document-by-feature space:  (1, 5)
feature embedding of the query:  [[ 0.02073696  0.00746746  0.00702104  0.0178147  -0.01463502]]


In [32]:
# import library for calculating cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

# define counters
maxcos = -1
maxname = new_df.name.values[0][:-4]
# calculate the cosine similarity with all addresses
# and find the closes one
for i in range(U.shape[0]):
    current = np.expand_dims(U[i], axis = 0)
    cos = cosine_similarity(current, query_feature)
    print(new_df.name.values[i], cos)
    if cos > maxcos:
        maxcos = cos
        maxname = new_df.name.values[i]
        pass
    pass

print('\nthe most similar inaugural is: ', maxname)

1789-Washington.txt [[0.3962494]]
1793-Washington.txt [[-0.65416742]]
1797-Adams.txt [[0.64309032]]
1801-Jefferson.txt [[0.44674683]]
1805-Jefferson.txt [[0.5187397]]
1809-Madison.txt [[0.7041189]]
1813-Madison.txt [[0.61570922]]
1817-Monroe.txt [[0.48549318]]
1821-Monroe.txt [[0.42035131]]
1825-Adams.txt [[0.42208722]]
1829-Jackson.txt [[0.41099476]]
1833-Jackson.txt [[-0.03719566]]
1837-VanBuren.txt [[0.56534884]]
1841-Harrison.txt [[0.00941384]]
1845-Polk.txt [[-0.07826999]]
1849-Taylor.txt [[-0.21271954]]
1853-Pierce.txt [[0.37856621]]
1857-Buchanan.txt [[-0.18095897]]
1861-Lincoln.txt [[-0.43871353]]
1865-Lincoln.txt [[-0.21272667]]
1869-Grant.txt [[0.07931181]]
1873-Grant.txt [[0.22063421]]
1877-Hayes.txt [[0.18575765]]
1881-Garfield.txt [[-0.2123006]]
1885-Cleveland.txt [[0.0410068]]
1889-Harrison.txt [[-0.07613883]]
1893-Cleveland.txt [[0.32823755]]
1897-McKinley.txt [[0.12126114]]
1901-McKinley.txt [[0.28575667]]
1905-Roosevelt.txt [[0.87690435]]
1909-Taft.txt [[-0.04408204]]


In [33]:
# display the most similar inauguarl
inaugural.raw(maxname)

'On each national day of inauguration since 1789, the people have renewed their sense of dedication to the United States.\n\nIn Washington\'s day the task of the people was to create and weld together a nation.\n\nIn Lincoln\'s day the task of the people was to preserve that Nation from disruption from within.\n\nIn this day the task of the people is to save that Nation and its institutions from disruption from without.\n\nTo us there has come a time, in the midst of swift happenings, to pause for a moment and take stock -- to recall what our place in history has been, and to rediscover what we are and what we may be. If we do not, we risk the real peril of inaction.\n\nLives of nations are determined not by the count of years, but by the lifetime of the human spirit. The life of a man is three-score years and ten: a little more, a little less. The life of a nation is the fullness of the measure of its will to live.\n\nThere are men who doubt this. There are men who believe that democr