# Demo 07

In [None]:
import nltk

import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
pd.set_option("max_rows", 20)

## TF-IDF for Inaugaural Addresses

#### Making Document-Matrix with Sklearn

In yesterday's demo we used nltk's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to
> Convert a collection of text documents to a matrix of token counts

Today's demo will use nltk's [TfidfVectorize](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to
> Convert a collection of raw documents to a matrix of TF-IDF features.

In [None]:
# Create a new TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english') 

**Question:** What are these options of `input='filename'` and `stop_words='english'`?

Let's read the documentation in the Contextual help

Let's get the list of speeches

In [None]:
Speeches_path = "data/inaugural_speeches/"

Stores the titles of the speeches, we will use this information later

In [None]:
titles = [title.strip(".txt") for title in os.listdir(Speeches_path)]
" ".join(titles)

This list compression will create the paths to the speech files

*Not for in class: show students how to hide output of a Jupytercell*

In [None]:
[Speeches_path + fname for fname in os.listdir(Speeches_path)]

Now lets create our Document-Term matrix where features are TF-IDF counts

In [None]:
tfidf_vector = tfidf_vectorizer.fit_transform([Speeches_path + fname for fname in os.listdir(Speeches_path)])
tfidf_vector

Let's read contextual help for `.fit_transform`

**Question:** Let's find the size of the matrix, i.e. how many rows and how many columns

In [None]:
tfidf_vector.shape

**Answer:**

Let's store the Document-Word Matrix of TF-IDF values into a DataFrame

In [None]:
tfidf_df = pd.DataFrame(tfidf_vector.toarray())
tfidf_df

Let's clean up the dataframe to add appropriate columns names and indices

In [None]:
# Makes the indices of the dataframe the titles of the speeches
tfidf_df.index = titles
# makes the names of the columns the word types
tfidf_df.columns = tfidf_vectorizer.get_feature_names()   
tfidf_df

## Distinctive Words

Sometimes we know what words are indicative of specific things we want to quantify.


In [None]:
tfidf_slice = tfidf_df[['government', 'borders', 'people', 'war', 'honor','foreign', 'men', 'women', 'children']]
tfidf_slice.sort_index().round(decimals=2)

Let's determine the most distinctive terms for each Inaugrual Address

In [None]:
doc_word_tfidf_df = tfidf_df.stack().reset_index()
doc_word_tfidf_df

Let's rename the columns so the dataframe is interpretable

In [None]:
doc_word_tfidf_df = doc_word_tfidf_df.rename(columns=
                         {0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})
doc_word_tfidf_df

Now we can sort the terms based on the document (ascending) and tfidf of term (descending)

In [None]:
doc_word_tfidf_df = doc_word_tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)
doc_word_tfidf_df

**Question:** How did word usage change between a President's first and second address?

In [None]:
doc_word_tfidf_df[doc_word_tfidf_df['document'].str.contains("George_W._Bush")] 

In [None]:
doc_word_tfidf_df[doc_word_tfidf_df['document'].str.contains("Washington")] 

In [None]:
doc_word_tfidf_df[doc_word_tfidf_df['document'].str.contains("Franklin_D")]

**Question:** How did TF-IDF change over time in these Inaugrual Speeches?

In [None]:
doc_word_tfidf_df = doc_word_tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)
doc_word_tfidf_df

In [None]:
doc_word_tfidf_df['document'].apply(lambda x: x.split("-")[0])

In [None]:
doc_word_tfidf_df['year'] = doc_word_tfidf_df['document'].apply(lambda x: x.split("-")[0])

doc_word_tfidf_df[doc_word_tfidf_df['term'] == 'government']

In [None]:
doc_word_tfidf_df[doc_word_tfidf_df['term'] == 'peace']


Regenerate `doc_word_tfidf_df` but for more than top-10 words in each document

In [None]:
doc_word_tfidf_df = tfidf_df.stack().reset_index()
doc_word_tfidf_df = doc_word_tfidf_df.rename(columns=
                         {0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})
doc_word_tfidf_df = doc_word_tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10000)
doc_word_tfidf_df['year'] = doc_word_tfidf_df['document'].apply(lambda x: x.split("-")[0])

In [None]:
def plot_tfidf_over_time(word):
    ax = doc_word_tfidf_df[doc_word_tfidf_df['term'] == word].plot(kind='line', x='year', y='tfidf')
    ax.set_title(f"TF-IDF of {word} over time")
    ax.set_ylabel("TF-IDF")
    
plot_tfidf_over_time('government')

In [None]:
plot_tfidf_over_time('peace')

In [None]:
plot_tfidf_over_time('war')

(back to slides)
## Similar Addressess

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import cosine_distances

In [None]:
similarity_df = pd.DataFrame(cosine_similarity(tfidf_df))
similarity_df.index =  tfidf_df.index
similarity_df.columns = tfidf_df.index
similarity_df

**Question:** Which two speeches are the most similar?

In [None]:
for key in similarity_df:
    similarity_df[key]
    sorted_similar_speeches = similarity_df[key].sort_values(ascending=False)
    print(f"{sorted_similar_speeches.index[1]} is the most similar speech to {key} with a cosine similarity of\
    {sorted_similar_speeches[1]}")


In [None]:
sorted_similar_speeches = similarity_df[key].sort_values(ascending=False)
sorted_similar_speeches[1]
sorted_similar_speeches.index[1]

**Question:** Which speech was most similar to Kennedy's famous "Ask not what your country can do for you"?

In [None]:
similarity_df.sort_values('1961-John_F._Kennedy', ascending=False).index

In [None]:
doc_word_tfidf_df[doc_word_tfidf_df['document'] == '1973-Richard_Nixon']

In [None]:
doc_word_tfidf_df[doc_word_tfidf_df['document'] == '1961-John_F._Kennedy']

In [None]:
doc_word_tfidf_df[doc_word_tfidf_df['document'] == '1993-Bill_Clinton']

In [None]:
open("data/inaugural_speeches/1973-Richard_Nixon.txt").read()

## Homework 02

Analyzing NYTimes obituaries using tf-idf

### Questions for homework

Are your findings robust to:
1. Stopwords
1. lemmatization
1. stemming