# TF-IDF applied to Inaugural Addresses using Scikit-Learn

This notebook is based on [TF-IDF with Scikit-Learn](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html)

We are going to calculate tf-idf scores using the Python library scikit-learn, which has a module called [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

We will apply this to calculate tf-idf scores for U.S. Inaugural Addresses.

Import necessary modules and libraries

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

We are going to look for the "interesting" words in the inaugural speeches. In this case, we wish to see which president said what, so rather than using the NLTK corpus, we use [the same data from kaggle](https://www.kaggle.com/code/pabheeshta/us-presidential-inaugural-speeches).

You should download this data and put it in a `data` folder below where you put this notebook. 

In [2]:
speechDf = pd.read_csv('data/inaug_speeches.csv', usecols=['Name','Date','text'], encoding='latin1')
speechDf.head()

Unnamed: 0,Name,Date,text
0,George Washington,"Thursday, April 30, 1789",Fellow-Citizens of the Senate and o...
1,George Washington,"Monday, March 4, 1793",Fellow Citizens: I AM again cal...
2,John Adams,"Saturday, March 4, 1797","WHEN it was first perceived, in ..."
3,Thomas Jefferson,"Wednesday, March 4, 1801",Friends and Fellow-Citizens: CA...
4,Thomas Jefferson,"Monday, March 4, 1805","PROCEEDING, fellow-citizens, to ..."


We need to prepare the dataframe so that we can label each speech appropriately.

In [3]:
speechDf['year'] = pd.DatetimeIndex(speechDf['Date']).year
speechDf['year_Name'] = speechDf['year'].astype(str).str.cat(speechDf[['Name']], sep="_")
speechDf.head()

Unnamed: 0,Name,Date,text,year,year_Name
0,George Washington,"Thursday, April 30, 1789",Fellow-Citizens of the Senate and o...,1789,1789_George Washington
1,George Washington,"Monday, March 4, 1793",Fellow Citizens: I AM again cal...,1793,1793_George Washington
2,John Adams,"Saturday, March 4, 1797","WHEN it was first perceived, in ...",1797,1797_John Adams
3,Thomas Jefferson,"Wednesday, March 4, 1801",Friends and Fellow-Citizens: CA...,1801,1801_Thomas Jefferson
4,Thomas Jefferson,"Monday, March 4, 1805","PROCEEDING, fellow-citizens, to ...",1805,1805_Thomas Jefferson


## Calculate tf–idf

To calculate tf–idf scores for every word, we're going to use scikit-learn's [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

The recommended way to run `TfidfVectorizer` is with smoothing (`smooth_idf = True`) and normalization (`norm='l2'`) turned on. These parameters will better account for differences in text length, and overall produce more meaningful tf–idf scores. Smoothing and L2 normalization are actually the default settings for `TfidfVectorizer`, so to turn them on, you don't need to include any extra code at all.

Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)

In [4]:
tfidf_vectorizer = TfidfVectorizer(input='content', stop_words='english')

Run TfidfVectorizer on the `text` in `speechDf`.

In [5]:
tfidf_vector = tfidf_vectorizer.fit_transform(speechDf["text"])

Make a DataFrame out of the resulting tf–idf vector, setting the "feature names" or words as columns and the titles as rows

In [6]:
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=speechDf['year_Name'], columns=tfidf_vectorizer.get_feature_names_out())

Add column for document frequency aka number of times word appears in all documents

In [7]:
tfidf_df.loc['00_Document Frequency'] = (tfidf_df > 0).sum()

In [8]:
tfidf_slice = tfidf_df[['government', 'borders', 'people', 'obama', 'war', 'honor','foreign', 'men', 'women', 'children']]
tfidf_slice.sort_index().round(decimals=2)

Unnamed: 0_level_0,government,borders,people,obama,war,honor,foreign,men,women,children
year_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
00_Document Frequency,53.0,5.0,56.0,1.0,45.0,32.0,32.0,47.0,15.0,22.0
1789_George Washington,0.11,0.0,0.05,0.0,0.0,0.0,0.0,0.02,0.0,0.0
1793_George Washington,0.06,0.0,0.06,0.0,0.0,0.09,0.0,0.0,0.0,0.0
1797_John Adams,0.16,0.0,0.19,0.0,0.01,0.1,0.12,0.04,0.0,0.0
1801_Thomas Jefferson,0.16,0.0,0.02,0.0,0.01,0.04,0.0,0.04,0.0,0.0
1805_Thomas Jefferson,0.03,0.0,0.0,0.0,0.04,0.0,0.06,0.01,0.0,0.02
1809_James Madison,0.0,0.0,0.02,0.0,0.02,0.05,0.05,0.0,0.0,0.0
1813_James Madison,0.04,0.0,0.04,0.0,0.26,0.02,0.02,0.0,0.0,0.0
1817_James Monroe,0.17,0.0,0.11,0.0,0.09,0.01,0.1,0.04,0.0,0.0
1821_James Monroe,0.08,0.0,0.07,0.0,0.11,0.02,0.04,0.01,0.0,0.01


Let's drop "OO_Document Frequency" since we were just using it for illustration purposes.

In [9]:
tfidf_df = tfidf_df.drop('00_Document Frequency', errors='ignore')

Let's reorganize the DataFrame so that the words are in rows rather than columns.

In [10]:
tfidf_df.stack().reset_index()

Unnamed: 0,year_Name,level_1,0
0,1789_George Washington,0085,0.000000
1,1789_George Washington,0092,0.000000
2,1789_George Washington,0093,0.000000
3,1789_George Washington,0094,0.000000
4,1789_George Washington,0097,0.014789
...,...,...,...
514107,2017_Donald J. Trump,youthful,0.000000
514108,2017_Donald J. Trump,zeal,0.000000
514109,2017_Donald J. Trump,zealous,0.000000
514110,2017_Donald J. Trump,zealously,0.000000


In [11]:
tfidf_df = tfidf_df.stack().reset_index()

In [12]:
tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'year_Name','level_1': 'term', 'level_2': 'term'})

To find out the top 10 words with the highest tf–idf for every story, we're going to sort by document and tfidf score and then groupby year_Name and take the first 10 values.

In [13]:
tfidf_df.sort_values(by=['year_Name','tfidf'], ascending=[True,False]).groupby(['year_Name']).head(10)

Unnamed: 0,year_Name,term,tfidf
3593,1789_George Washington,government,0.114344
3988,1789_George Washington,immutable,0.104489
4055,1789_George Washington,impressions,0.104489
6213,1789_George Washington,providential,0.104489
5512,1789_George Washington,ought,0.104333
...,...,...,...
509760,2017_Donald J. Trump,jobs,0.134850
511443,2017_Donald J. Trump,protected,0.125096
513178,2017_Donald J. Trump,thank,0.109974
510951,2017_Donald J. Trump,people,0.106140


In [14]:
top_tfidf = tfidf_df.sort_values(by=['year_Name','tfidf'], ascending=[True,False]).groupby(['year_Name']).head(10)

We can zoom in on particular words and particular documents.

In [15]:
top_tfidf[top_tfidf['term'].str.contains('women')]

Unnamed: 0,year_Name,term,tfidf


It turns out that the term "women" is very distinctive in Obama's Inaugural Address.

In [16]:
top_tfidf[top_tfidf['year_Name'].str.contains('Obama')]

Unnamed: 0,year_Name,term,tfidf
487521,2009_Barack Obama,0092,0.317896
487524,2009_Barack Obama,0097,0.219136
487866,2009_Barack Obama,america,0.136347
492747,2009_Barack Obama,nation,0.110501
492808,2009_Barack Obama,new,0.108454
487520,2009_Barack Obama,0085,0.100977
491049,2009_Barack Obama,generation,0.092509
495537,2009_Barack Obama,today,0.090431
492215,2009_Barack Obama,let,0.083729
492032,2009_Barack Obama,jobs,0.083386


In [17]:
top_tfidf[top_tfidf['year_Name'].str.contains('Trump')]

Unnamed: 0,year_Name,term,tfidf
505594,2017_Donald J. Trump,america,0.330747
505249,2017_Donald J. Trump,0092,0.32131
507772,2017_Donald J. Trump,dreams,0.153805
505595,2017_Donald J. Trump,american,0.140953
510592,2017_Donald J. Trump,obama,0.134954
509760,2017_Donald J. Trump,jobs,0.13485
511443,2017_Donald J. Trump,protected,0.125096
513178,2017_Donald J. Trump,thank,0.109974
510951,2017_Donald J. Trump,people,0.10614
506179,2017_Donald J. Trump,borders,0.101138


In [18]:
top_tfidf[top_tfidf['year_Name'].str.contains('Kennedy')]

Unnamed: 0,year_Name,term,tfidf
381156,1961_John F. Kennedy,0097,0.317063
385847,1961_John F. Kennedy,let,0.254751
388374,1961_John F. Kennedy,sides,0.249977
386993,1961_John F. Kennedy,pledge,0.153077
381711,1961_John F. Kennedy,ask,0.102438
381945,1961_John F. Kennedy,begin,0.101279
383072,1961_John F. Kennedy,dare,0.101279
384395,1961_John F. Kennedy,final,0.101279
389957,1961_John F. Kennedy,world,0.09806
386440,1961_John F. Kennedy,new,0.091869


## Visualize TF-IDF

We can also visualize our TF-IDF results with the data visualization library Altair, which needs to be installed using

    conda install -c conda-forge altair

Let's make a heatmap that shows the highest TF-IDF scoring words for each president, and let's put a red dot next to two terms of interest: "war" and "peace":

The code below was contributed by [Eric Monson](https://github.com/emonson). Thanks, Eric!

In [19]:
import altair as alt
import numpy as np

# Terms in this list will get a red dot in the visualization
term_list = ['war', 'peace']

# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    y = 'year_Name:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["year_Name"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600)