## Assignment 2: Apply TF-IDF to the Inaugural Corpus 
Our second assignment will have us write code to 
1. Define or augment a set of stopwords for this problem
2. Construct a document-by-term matrix (documents will be rows, terms will be columns), along with a vocabulary while controlling for stopwords
3. Write functions to comput TF-IDF and apply those to the document-by-term matrix
4. Find the closest historic inaugural address to the 2017 address by President Trump
5. Learn to use the PCA transformation and plot the inaugural address along the first two principal components

This assignment is to be done individually. Your code should be your own (with the exception of question 5, for which you're free to get some help from the web).

**Due Date: 2020-09-20 5 pm ET**

Please submit your completed assignment through GradeScope. You should submit a PDF of your notebook with all output.

In [6]:
import re
import math
import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.corpus import inaugural

# Feel free to add your own libs as needed

In [2]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

porter    = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
wnl       = nltk.WordNetLemmatizer()

In [3]:
inaugural.fileids()[:4]

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt']

In [4]:
inaugural.words(inaugural.fileids()[0])

['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', ...]

In [5]:
stop_words = set(stopwords.words('english'))

### 1. Stopwords

**Update stopwords once you've had a chance to explore the text**

Remove all punctuation and strange unicode characters, and anything else you think might be extraneous

In [None]:
# here is where you should update the stop words
stop_words.update(...)

---
### 2. Read each inaugural address into an Pandas DataFrame

**2.1 Create a vocabulary as a set of all unique stemmed terms in the corpus**

In [7]:
vocab = set()

In [8]:
# your code for creating the vocab goes here


**2.2 Use your vocabulary now to read each inaugural address into a dataframe**

Each row of the dataframe should represent a document (one of the addresses). It may be handy at this time to also track the size (length) of each document, since you'll need this later when computing TF.

You should ignore the README file. Hint: use `inaugural.fileids()`.

In [3]:
# Feel free to use your own approach but you can iteratively update a dataframe using dictionaries.
# Here's an example
df =pd.DataFrame()

dict1 = {'a':2, 'b':3, 'c':4}
dict2 = {'b':30, 'a':20, 'd':50, 'c':40}

df = df.append(dict1, ignore_index=True)
df = df.append(dict2, ignore_index=True)

print(df)

      a     b     c     d
0   2.0   3.0   4.0   NaN
1  20.0  30.0  40.0  50.0


In [5]:
df = pd.DataFrame()
# your code for creating the dataframe goes here.
# the resulting dataframe should be called "df"
#
# Each row of the dataframe should represent a document (inaugural address)
# Each column of the dataframe should be a term from the vocab


In [32]:
len(inaugural.fileids())

58

---
### 3. Compute TF-IDF for the document-term matrix ###

**3.1. Write a function to compute term frequency (TF) for each document**

Please write your own code here, and resist the urge to rely on google for your answer.

In [10]:
# compute term frequency
# inputs: wordvec is a series that contains, for a given doc, 
#                 the word counts for each term in the vocab
#         doclen  is the length of the document
# returns: a series with new term-frequencies (raw counts divided by doc length)
def computetf(wordvec,doclen):
    ...

**3.2 Write a function to comput inverse document frequency**

Please write your own code here and resiste the urge to rely on google for your answer.

In [11]:
import math # you may need this for the log function

# input:   document-by-term (row-by-column) dataframe
# returns: dictionary of key-value pairs. Keys are terms in the vocab, values are IDF.

def computeidf(df):
    ...

**Create a new dataframe and populate it with the TF-IDF values for each document-term combination**

The functions your write above should work with the below code snippet.

In [12]:
newdf = pd.DataFrame()

idfdict = computeidf(df)

cols = df.columns
for index, row in df.iterrows():
    newrow = computetf(row,sizearr[index])
    for c in cols:
        newrow[c] = newrow[c]*idfdict[c]
    newdf = newdf.append(newrow)

In [53]:
newdf.shape

(58, 4439)

---
### 4. Using TF-IDF values, find and rank order the 3 closest inaugural addresses to Donald Trump's 2017 address, measured by cosine similarity

In [56]:
# President Trump's address is 57 (0-indexed)
# newdf.iloc[57,:].head(100)

**4.1 Create an array called dist that contains the cosine similarity distance between the 2017 inaugural address (called d1 below) and each of the inaugural addresses**

In [None]:
d1 = newdf.iloc[57,:]
dist = []
for index,row in newdf.iterrows():
    # your code goes here
    ...


In [None]:
dist

**4.2 Find the 3 closest associated inaugural address, when measured by cosign similarity. Which one is the closest?**

**4.3 What is your explanation/understanding of why these documents are "close" to the 2017? Please explain.

*Your explanation goes here*



---
### 5. Compute the first two principal components of the TF-IDF matrix, and plot each document along each of the PCA components
For this question, feel free to use google, stackoverflow, etc to help you compute the PCA (it's pretty easy, just one or two lines). Don't worry too much about the theory for now - we're going to discuss the Principal Component Decomposition later in the semester.

In [71]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

In [72]:
# your code to compute the PCA goes here
# The result should be X, an array of 2-element arrays
...

In [77]:
len(X)

58

In [None]:
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
plt.axis('equal');
for i in range(0,58):
    plt.text(X[i,0],X[i,1],inaugural.fileids()[i])