# TF-IDF applied to Inaugural Addresses using Scikit-Learn

This notebook is based on [TF-IDF with Scikit-Learn](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html)

We are going to calculate tf-idf scores using the Python library scikit-learn, which has a module called [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

We will apply this to calculate tf-idf scores for U.S. Inaugural Addresses.

Import necessary modules and libraries

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

We are going to look for the "interesting" words in the inaugural speeches. In this case, we wish to see which president said what, so rather than using the NLTK corpus, we use [the same data from kaggle](https://www.kaggle.com/code/pabheeshta/us-presidential-inaugural-speeches).

You should download this data and put it in a `data` folder below where you put this notebook. 

In [None]:
speechDf = pd.read_csv('data/inaug_speeches.csv', usecols=['Name','Date','text'], encoding='latin1')
speechDf.head()

We need to prepare the dataframe so that we can label each speech appropriately.

In [None]:
speechDf['year'] = pd.DatetimeIndex(speechDf['Date']).year
speechDf['year_Name'] = speechDf['year'].astype(str).str.cat(speechDf[['Name']], sep="_")
speechDf.head()

## Calculate tf–idf

To calculate tf–idf scores for every word, we're going to use scikit-learn's [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

The recommended way to run `TfidfVectorizer` is with smoothing (`smooth_idf = True`) and normalization (`norm='l2'`) turned on. These parameters will better account for differences in text length, and overall produce more meaningful tf–idf scores. Smoothing and L2 normalization are actually the default settings for `TfidfVectorizer`, so to turn them on, you don't need to include any extra code at all.

In [None]:
# Initialize TfidfVectorizer with desired parameters (default
# smoothing and normalization
tfidf_vectorizer = TfidfVectorizer(input='content', stop_words='english')

# Run TfidfVectorizer on the text in speechDf.
tfidf_vector = tfidf_vectorizer.fit_transform(speechDf["text"])

# Make a DataFrame out of the resulting tf-idf vector, setting the
# "feature names" or words as columns and the titles as rows
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=speechDf['year_Name'],
  columns=tfidf_vectorizer.get_feature_names_out())

Add column for document frequency aka number of times word appears in all documents

In [None]:
tfidf_df.loc['00_Document Frequency'] = (tfidf_df > 0).sum()

In [None]:
tfidf_slice = tfidf_df[['government', 'borders', 'people', 'obama', 'war', 'honor','foreign', 'men', 'women', 'children']]
tfidf_slice.sort_index().round(decimals=2)

Let's drop "OO_Document Frequency" since we were just using it for illustration purposes.

In [None]:
tfidf_df = tfidf_df.drop('00_Document Frequency', errors='ignore')

Let's reorganize the DataFrame so that the words are in rows rather than columns.

In [None]:
tfidf_df.stack().reset_index()

In [None]:
tfidf_df = tfidf_df.stack().reset_index()
tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'year_Name','level_1': 'term', 'level_2': 'term'})

To find out the top 10 words with the highest tf–idf for every story, we're going to sort by document and tfidf score and then groupby year_Name and take the first 10 values.

In [None]:
tfidf_df.sort_values(by=['year_Name','tfidf'], ascending=[True,False]).groupby(['year_Name']).head(10)

In [None]:
top_tfidf = tfidf_df.sort_values(by=['year_Name','tfidf'], ascending=[True,False]).groupby(['year_Name']).head(10)

We can zoom in on particular words and particular documents.

In [None]:
top_tfidf[top_tfidf['term'].str.contains('women')]

It turns out that the term "women" is very distinctive in Obama's Inaugural Address.

In [None]:
top_tfidf[top_tfidf['year_Name'].str.contains('Obama')]

In [None]:
top_tfidf[top_tfidf['year_Name'].str.contains('Trump')]

In [None]:
top_tfidf[top_tfidf['year_Name'].str.contains('Kennedy')]

## Visualize TF-IDF

We can also visualize our TF-IDF results with the data visualization library Altair, which needs to be installed using

    conda install -c conda-forge altair

Let's make a heatmap that shows the highest TF-IDF scoring words for each president, and let's put a red dot next to two terms of interest: "war" and "peace":

The code below was contributed by [Eric Monson](https://github.com/emonson). Thanks, Eric!

In [None]:
import altair as alt
import numpy as np

# Terms in this list will get a red dot in the visualization
term_list = ['war', 'peace']

# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    y = 'year_Name:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["year_Name"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600)