# Text Analysis in Python 3: Comparing Texts

# Term Frequency - Inverse Data Frequency (TFIDF)

<img src = "https://miro.medium.com/max/720/1*qQgnyPLDIkUmeZKN2_ZWbQ.webp" style="width:60%">

Image from Yassine Hamdaoui, ["TF(Term Frequency)-IDF(Inverse Document Frequency) from scratch in python"](https://towardsdatascience.com/tf-term-frequency-idf-inverse-document-frequency-from-scratch-in-python-6c2b61b78558) *Towards Data Science (Medium)* (Dec. 9, 2019).

For this lesson, we will be drawing on the [TFIDF section](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/01-TF-IDF.html) in the online book: Melanie Walsh, [*Introduction to Cultural Analytics and Python*](https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html), Version 1 (2021), https://doi.org/10.5281/zenodo.4411250. 

<img src="https://melaniewalsh.github.io/Intro-Cultural-Analytics/_static/favicon.ico" style="width:30%">

All sections below labeled with a **MW** comes from this book. Please consider supporting that project if you find it useful.




## TF-IDF with Scikit-Learn [MW]

Tf-idf is a method that tries to identify the most distinctively frequent or significant words in a document. 

In this lesson, we’re going to learn how to calculate tf-idf scores using a collection of plain text (.txt) files and the Python library scikit-learn, which has a quick and nifty module called TfidfVectorizer.

In this lesson, we will cover how to:

    Calculate and normalize tf-idf scores for U.S. Inaugural Addresses with scikit-learn


## Dataset: U.S. Inaugural Addresses [MW]

    This is the meaning of our liberty and our creed; why men and women and children of every race and every faith can join in celebration across this magnificent Mall, and why a man whose father less than 60 years ago might not have been served at a local restaurant can now stand before you to take a most sacred oath. So let us mark this day with remembrance of who we are and how far we have traveled.

    —Barack Obama, Inaugural Presidential Address, January 2009

During Barack Obama’s Inaugural Address in January 2009, he mentioned “women” four different times, including in the passage quoted above. How distinctive is Obama’s inclusion of women in this address compared to all other U.S. Presidents? This is one of the questions that we’re going to try to answer with tf-idf.


## Breaking Down the TF-IDF Formula [MW]

But first, let’s quickly discuss the tf-idf formula. The idea is pretty simple.

**tf-idf = term_frequency * inverse_document_frequency**

**term_frequency** = number of times a given term appears in document

**inverse_document_frequency** = log(total number of documents / number of documents with term) + 1*****

You take the number of times a term occurs in a document (term frequency). Then you take the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and you flip that fraction on its head (inverse document frequency). Then you multiply the two numbers together (term_frequency * inverse_document_frequency).

The reason we take the inverse, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. Think about the inverse document frequency for the word “said” vs the word “pigeon.” The term “said” appears in 13 (document frequency) of 14 (total documents) Lost in the City stories (14 / 13 –> a smaller inverse document frequency) while the term “pigeons” only occurs in 2 (document frequency) of the 14 stories (total documents) (14 / 2 –> a bigger inverse document frequency, a bigger tf-idf boost).

*There are a bunch of slightly different ways that you can calculate inverse document frequency. The version of idf that we’re going to use is the scikit-learn default, which uses “smoothing” aka it adds a “1” to the numerator and denominator:

**inverse_document_frequency** = log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

## TF-IDF with scikit-learn [MW]

scikit-learn, imported as sklearn, is a popular Python library for machine learning approaches such as clustering, classification, and regression. Though we’re not doing any machine learning in this lesson, we’re nevertheless going to use scikit-learn’s TfidfVectorizer and CountVectorizer.

Install scikit-learn

In [None]:
# !pip install sklearn
#already included with Anaconda

Import necessary modules and libraries

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
import pandas as pd, numpy as np, altair as alt
# pd.set_option("max_rows",600) # returns an OptionError
pd.set_option("display.max_rows",600) #apparently max_rows commands has now been replaced w/ display.max_rows
from pathlib import Path
import glob, collections
              

We’re also going to import pandas and change its default display setting. And we’re going to import two libraries that will help us work with files and the file system: [pathlib](https://docs.python.org/3/library/pathlib.html##basic-use) and [glob](https://docs.python.org/3/library/glob.html).

In [2]:
#MW: Below we’re setting the directory filepath that contains all the text files that we want to analyze.
directory_path="US_Inaugural_Addresses"
#MW: Then we’re going to use glob and Path to make a list of all the filepaths in that directory and a list of all the short story titles.
text_files=glob.glob(f"{directory_path}/*.txt")
text_files

['US_Inaugural_Addresses\\01_washington_1789.txt',
 'US_Inaugural_Addresses\\02_washington_1793.txt',
 'US_Inaugural_Addresses\\03_adams_john_1797.txt',
 'US_Inaugural_Addresses\\04_jefferson_1801.txt',
 'US_Inaugural_Addresses\\05_jefferson_1805.txt',
 'US_Inaugural_Addresses\\06_madison_1809.txt',
 'US_Inaugural_Addresses\\07_madison_1813.txt',
 'US_Inaugural_Addresses\\08_monroe_1817.txt',
 'US_Inaugural_Addresses\\09_monroe_1821.txt',
 'US_Inaugural_Addresses\\10_adams_john_quincy_1825.txt',
 'US_Inaugural_Addresses\\11_jackson_1829.txt',
 'US_Inaugural_Addresses\\12_jackson_1833.txt',
 'US_Inaugural_Addresses\\13_van_buren_1837.txt',
 'US_Inaugural_Addresses\\14_harrison_1841.txt',
 'US_Inaugural_Addresses\\15_polk_1845.txt',
 'US_Inaugural_Addresses\\16_taylor_1849.txt',
 'US_Inaugural_Addresses\\17_pierce_1853.txt',
 'US_Inaugural_Addresses\\18_buchanan_1857.txt',
 'US_Inaugural_Addresses\\19_lincoln_1861.txt',
 'US_Inaugural_Addresses\\20_lincoln_1865.txt',
 'US_Inaugural_Addre

In [3]:
text_titles = [Path(text).stem for text in text_files]

In [4]:
text_titles

['01_washington_1789',
 '02_washington_1793',
 '03_adams_john_1797',
 '04_jefferson_1801',
 '05_jefferson_1805',
 '06_madison_1809',
 '07_madison_1813',
 '08_monroe_1817',
 '09_monroe_1821',
 '10_adams_john_quincy_1825',
 '11_jackson_1829',
 '12_jackson_1833',
 '13_van_buren_1837',
 '14_harrison_1841',
 '15_polk_1845',
 '16_taylor_1849',
 '17_pierce_1853',
 '18_buchanan_1857',
 '19_lincoln_1861',
 '20_lincoln_1865',
 '21_grant_1869',
 '22_grant_1873',
 '23_hayes_1877',
 '24_garfield_1881',
 '25_cleveland_1885',
 '26_harrison_1889',
 '27_cleveland_1893',
 '28_mckinley_1897',
 '29_mckinley_1901',
 '30_roosevelt_theodore_1905',
 '31_taft_1909',
 '32_wilson_1913',
 '33_wilson_1917',
 '34_harding_1921',
 '35_coolidge_1925',
 '36_hoover_1929',
 '37_roosevelt_franklin_1933',
 '38_roosevelt_franklin_1937',
 '39_roosevelt_franklin_1941',
 '40_roosevelt_franklin_1945',
 '41_truman_1949',
 '42_eisenhower_1953',
 '43_eisenhower_1957',
 '44_kennedy_1961',
 '45_johnson_1965',
 '46_nixon_1969',
 '47_

## Calculate tf-idf [MW]

To calculate tf–idf scores for every word, we’re going to use scikit-learn’s TfidfVectorizer.

When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

The recommended way to run TfidfVectorizer is with smoothing (smooth_idf = True) and normalization (norm='l2') turned on. These parameters will better account for differences in text length, and overall produce more meaningful tf–idf scores. Smoothing and L2 normalization are actually the default settings for TfidfVectorizer, so to turn them on, you don’t need to include any extra code at all.

Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)

In [5]:
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

Run TfidfVectorizer on our text_files

In [6]:
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

Make a DataFrame out of the resulting tf–idf vector, setting the “feature names” or words as columns and the titles as rows

In [7]:
#note from Simon: TfidfVectorizer returns a sparse matrix and that's why we have to call .toarray()  before proceeding.
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names_out())
#warning: get_feature_names will be depreciated; use get_feature_names_out instead
   ##I made this fix in the code above

Add column for document frequency aka number of times word appears in all documents

In [8]:
tfidf_df.loc['00_Document Frequency'] = (tfidf_df > 0).sum()

In [9]:
tfidf_slice = tfidf_df[['government', 'borders', 'people', 'obama', 'war', 'honor','foreign', 'men', 'women', 'children']]
tfidf_slice.sort_index().round(decimals=2)

Unnamed: 0,government,borders,people,obama,war,honor,foreign,men,women,children
00_Document Frequency,53.0,5.0,56.0,3.0,45.0,32.0,32.0,47.0,15.0,22.0
01_washington_1789,0.11,0.0,0.05,0.0,0.0,0.0,0.0,0.02,0.0,0.0
02_washington_1793,0.06,0.0,0.05,0.0,0.0,0.08,0.0,0.0,0.0,0.0
03_adams_john_1797,0.16,0.0,0.19,0.0,0.01,0.1,0.12,0.04,0.0,0.0
04_jefferson_1801,0.16,0.0,0.01,0.0,0.01,0.04,0.0,0.04,0.0,0.0
05_jefferson_1805,0.03,0.0,0.0,0.0,0.04,0.0,0.06,0.01,0.0,0.02
06_madison_1809,0.0,0.0,0.02,0.0,0.02,0.05,0.05,0.0,0.0,0.0
07_madison_1813,0.04,0.0,0.04,0.0,0.25,0.02,0.02,0.0,0.0,0.0
08_monroe_1817,0.17,0.0,0.11,0.0,0.09,0.01,0.1,0.04,0.0,0.0
09_monroe_1821,0.08,0.0,0.06,0.0,0.11,0.02,0.04,0.01,0.0,0.01


Let’s drop “OO_Document Frequency” since we were just using it for illustration purposes.

In [10]:
tfidf_df = tfidf_df.drop('00_Document Frequency', errors='ignore')

Let’s reorganize the DataFrame so that the words are in rows rather than columns.

In [11]:
tfidf_df.stack().reset_index()

Unnamed: 0,level_0,level_1,0
0,01_washington_1789,000,0.000000
1,01_washington_1789,03,0.000000
2,01_washington_1789,04,0.023259
3,01_washington_1789,05,0.000000
4,01_washington_1789,100,0.000000
...,...,...,...
521937,58_trump_2017,zachary,0.000000
521938,58_trump_2017,zeal,0.000000
521939,58_trump_2017,zealous,0.000000
521940,58_trump_2017,zealously,0.000000


In [12]:
tfidf_df = tfidf_df.stack().reset_index()

In [13]:
tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})

To find out the top 10 words with the highest tf–idf for every story, we’re going to sort by document and tfidf score and then groupby document and take the first 10 values.

In [14]:
tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

Unnamed: 0,document,term,tfidf
3707,01_washington_1789,government,0.113681
4108,01_washington_1789,immutable,0.103883
4175,01_washington_1789,impressions,0.103883
6337,01_washington_1789,providential,0.103883
5631,01_washington_1789,ought,0.103728
6351,01_washington_1789,public,0.103102
6117,01_washington_1789,present,0.097516
6389,01_washington_1789,qualifications,0.096372
5811,01_washington_1789,peculiarly,0.090546
653,01_washington_1789,article,0.085786


In [15]:
top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

We can zoom in on particular words and particular documents.

In [16]:
top_tfidf[top_tfidf['term'].str.contains('women')]

Unnamed: 0,document,term,tfidf
503861,56_obama_2009,women,0.084859


It turns out that the term “women” is very distinctive in Obama’s Inaugural Address.

In [17]:
top_tfidf[top_tfidf['document'].str.contains('obama')]

Unnamed: 0,document,term,tfidf
495406,56_obama_2009,america,0.148351
500298,56_obama_2009,nation,0.120229
500358,56_obama_2009,new,0.118002
503093,56_obama_2009,today,0.114792
498590,56_obama_2009,generation,0.100654
499762,56_obama_2009,let,0.0911
499578,56_obama_2009,jobs,0.090727
496911,56_obama_2009,crisis,0.087235
498779,56_obama_2009,hard,0.084859
503861,56_obama_2009,women,0.084859


In [18]:
top_tfidf[top_tfidf['document'].str.contains('trump')]

Unnamed: 0,document,term,tfidf
513404,58_trump_2017,america,0.350162
515585,58_trump_2017,dreams,0.156436
513405,58_trump_2017,american,0.149226
517576,58_trump_2017,jobs,0.142766
519262,58_trump_2017,protected,0.132439
518409,58_trump_2017,obama,0.120288
518766,58_trump_2017,people,0.11237
521001,58_trump_2017,thank,0.109171
513989,58_trump_2017,borders,0.107075
521596,58_trump_2017,ve,0.107075


In [19]:
top_tfidf[top_tfidf['document'].str.contains('kennedy')]

Unnamed: 0,document,term,tfidf
391774,44_kennedy_1961,let,0.267869
394306,44_kennedy_1961,sides,0.262849
392921,44_kennedy_1961,pledge,0.16096
387632,44_kennedy_1961,ask,0.107713
387864,44_kennedy_1961,begin,0.106495
388991,44_kennedy_1961,dare,0.106495
395895,44_kennedy_1961,world,0.10311
390313,44_kennedy_1961,final,0.102311
392370,44_kennedy_1961,new,0.0966
390120,44_kennedy_1961,explore,0.094223


## Visualize TF-IDF [MW]

We can also visualize our TF-IDF results with the data visualization library Altair.

In [None]:
#!pip install altair

Let’s make a heatmap that shows the highest TF-IDF scoring words for each president, and let’s put a red dot next to two terms of interest: “war” and “peace”:

The code below was contributed by [Eric Monson](https://github.com/emonson). Thanks, Eric!

In [20]:
import altair as alt
import numpy as np

# Terms in this list will get a red dot in the visualization
term_list = ['war', 'peace']

# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    y = 'document:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["document"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600)


Your Turn!

Take a few minutes to explore the dataframe below and then answer the following questions.

1. What is the difference between a tf-idf score and raw word frequency?

2. Based on the dataframe above, what is one potential problem or limitation that you notice with tf-idf scores?

3. What’s another collection of texts that you think might be interesting to analyze with tf-idf scores? Why?


## Applying TF-IDF to the SOTU

Now, we will try to apply the lessons and code from Melanie Walsh's book to our SOTU corpus. At least to begin, this will involve simply plugging in our dataset into her code. We will make more modifications as we go...

In [59]:
directory_path="sotu"
text_files=glob.glob(f"{directory_path}/*.txt")
text_files
                     

['sotu\\Adams_1797.txt',
 'sotu\\Adams_1798.txt',
 'sotu\\Adams_1799.txt',
 'sotu\\Adams_1800.txt',
 'sotu\\Adams_1825.txt',
 'sotu\\Adams_1826.txt',
 'sotu\\Adams_1827.txt',
 'sotu\\Adams_1828.txt',
 'sotu\\Arthur_1881.txt',
 'sotu\\Arthur_1882.txt',
 'sotu\\Arthur_1883.txt',
 'sotu\\Arthur_1884.txt',
 'sotu\\Buchanan_1857.txt',
 'sotu\\Buchanan_1858.txt',
 'sotu\\Buchanan_1859.txt',
 'sotu\\Buchanan_1860.txt',
 'sotu\\Buren_1837.txt',
 'sotu\\Buren_1838.txt',
 'sotu\\Buren_1839.txt',
 'sotu\\Buren_1840.txt',
 'sotu\\Bush_1989.txt',
 'sotu\\Bush_1990.txt',
 'sotu\\Bush_1991.txt',
 'sotu\\Bush_1992.txt',
 'sotu\\Bush_2001.txt',
 'sotu\\Bush_2002.txt',
 'sotu\\Bush_2003.txt',
 'sotu\\Bush_2004.txt',
 'sotu\\Bush_2005.txt',
 'sotu\\Bush_2006.txt',
 'sotu\\Bush_2007.txt',
 'sotu\\Bush_2008.txt',
 'sotu\\Carter_1978.txt',
 'sotu\\Carter_1979.txt',
 'sotu\\Carter_1980.txt',
 'sotu\\Carter_1981.txt',
 'sotu\\Cleveland_1885.txt',
 'sotu\\Cleveland_1886.txt',
 'sotu\\Cleveland_1887.txt',
 'sot

In [60]:
text_titles = [Path(text).stem for text in text_files]
text_titles

['Adams_1797',
 'Adams_1798',
 'Adams_1799',
 'Adams_1800',
 'Adams_1825',
 'Adams_1826',
 'Adams_1827',
 'Adams_1828',
 'Arthur_1881',
 'Arthur_1882',
 'Arthur_1883',
 'Arthur_1884',
 'Buchanan_1857',
 'Buchanan_1858',
 'Buchanan_1859',
 'Buchanan_1860',
 'Buren_1837',
 'Buren_1838',
 'Buren_1839',
 'Buren_1840',
 'Bush_1989',
 'Bush_1990',
 'Bush_1991',
 'Bush_1992',
 'Bush_2001',
 'Bush_2002',
 'Bush_2003',
 'Bush_2004',
 'Bush_2005',
 'Bush_2006',
 'Bush_2007',
 'Bush_2008',
 'Carter_1978',
 'Carter_1979',
 'Carter_1980',
 'Carter_1981',
 'Cleveland_1885',
 'Cleveland_1886',
 'Cleveland_1887',
 'Cleveland_1888',
 'Cleveland_1893',
 'Cleveland_1894',
 'Cleveland_1895',
 'Cleveland_1896',
 'Clinton_1993',
 'Clinton_1994',
 'Clinton_1995',
 'Clinton_1996',
 'Clinton_1997',
 'Clinton_1998',
 'Clinton_1999',
 'Clinton_2000',
 'Coolidge_1923',
 'Coolidge_1924',
 'Coolidge_1925',
 'Coolidge_1926',
 'Coolidge_1927',
 'Coolidge_1928',
 'Eisenhower_1954',
 'Eisenhower_1955',
 'Eisenhower_195

In [61]:
tfidf_vectorizer=TfidfVectorizer(input='filename',stop_words='english')
tfidf_vector=tfidf_vectorizer.fit_transform(text_files)

In [62]:
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names_out())

In [63]:
tfidf_df.loc['00_Document Frequency'] = (tfidf_df > 0).sum()
## replace the words below with your own search terms
tfidf_slice = tfidf_df[['government', 'borders', 'people', 'obama', 'war', 'honor','foreign', 'men', 'women', 'children']]
tfidf_slice.sort_index().round(decimals=2)

Unnamed: 0,government,borders,people,obama,war,honor,foreign,men,women,children
00_Document Frequency,226.0,79.0,219.0,1.0,220.0,140.0,200.0,196.0,96.0,125.0
Adams_1797,0.02,0.0,0.03,0.0,0.05,0.05,0.06,0.03,0.0,0.0
Adams_1798,0.07,0.0,0.02,0.0,0.06,0.05,0.04,0.0,0.0,0.0
Adams_1799,0.11,0.0,0.06,0.0,0.04,0.06,0.03,0.02,0.0,0.0
Adams_1800,0.08,0.0,0.05,0.0,0.02,0.02,0.02,0.0,0.0,0.0
Adams_1825,0.05,0.01,0.03,0.0,0.06,0.01,0.02,0.05,0.0,0.01
Adams_1826,0.08,0.01,0.01,0.0,0.07,0.01,0.03,0.0,0.0,0.01
Adams_1827,0.08,0.0,0.01,0.0,0.04,0.01,0.01,0.02,0.0,0.0
Adams_1828,0.05,0.01,0.04,0.0,0.05,0.01,0.05,0.01,0.0,0.01
Arthur_1881,0.2,0.0,0.05,0.0,0.01,0.0,0.04,0.0,0.0,0.0


In [64]:
tfidf_df = tfidf_df.drop('00_Document Frequency', errors='ignore')
tfidf_df.stack().reset_index()

Unnamed: 0,level_0,level_1,0
0,Adams_1797,00,0.0
1,Adams_1797,000,0.0
2,Adams_1797,0000,0.0
3,Adams_1797,0001,0.0
4,Adams_1797,001,0.0
...,...,...,...
5606515,Wilson_1920,zone,0.0
5606516,Wilson_1920,zones,0.0
5606517,Wilson_1920,zoological,0.0
5606518,Wilson_1920,zooming,0.0


In [65]:
tfidf_df = tfidf_df.stack().reset_index()
tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})
#To find out the top 10 words with the highest tf–idf for every story, we’re going to sort by document and tfidf score and then groupby document and take the first 10 values.


In [66]:
#tfidf_df.tail(100)
tfidf_df[['pres','year']] = tfidf_df['document'].str.split("_",expand=True)


In [67]:
tfidf_df.tail(20)

Unnamed: 0,document,term,tfidf,pres,year
5606500,Wilson_1920,zeal,0.0,Wilson,1920
5606501,Wilson_1920,zealand,0.0,Wilson,1920
5606502,Wilson_1920,zealous,0.0,Wilson,1920
5606503,Wilson_1920,zealously,0.0,Wilson,1920
5606504,Wilson_1920,zelaya,0.0,Wilson,1920
5606505,Wilson_1920,zeppelin,0.0,Wilson,1920
5606506,Wilson_1920,zero,0.0,Wilson,1920
5606507,Wilson_1920,zeros,0.0,Wilson,1920
5606508,Wilson_1920,zest,0.0,Wilson,1920
5606509,Wilson_1920,zigzag,0.0,Wilson,1920


In [68]:
tfidf_df.sort_values(by=['year','pres','tfidf'], ascending=[True,True,False]).groupby(['document']).head(10)

Unnamed: 0,document,term,tfidf,pres,year
5237670,Washington_1790,00,0.000000,Washington,1790
5237671,Washington_1790,000,0.000000,Washington,1790
5237672,Washington_1790,0000,0.000000,Washington,1790
5237673,Washington_1790,0001,0.000000,Washington,1790
5237674,Washington_1790,001,0.000000,Washington,1790
...,...,...,...,...,...
5127504,Trump_2018,isis,0.138086,Trump,2018
5119650,Trump_2018,cj,0.123817,Trump,2018
5127779,Trump_2018,kenton,0.123817,Trump,2018
5134685,Trump_2018,seong,0.123817,Trump,2018


In [69]:
top_tfidf = tfidf_df.sort_values(by=['year','pres','tfidf'], ascending=[True,True,False]).groupby(['document']).head(10)
top_tfidf[top_tfidf['term'].str.contains('women')]

Unnamed: 0,document,term,tfidf,pres,year


In [72]:
top_tfidf.head()

Unnamed: 0,document,term,tfidf,pres,year
5237670,Washington_1790,0,0.0,Washington,1790
5237671,Washington_1790,0,0.0,Washington,1790
5237672,Washington_1790,0,0.0,Washington,1790
5237673,Washington_1790,1,0.0,Washington,1790
5237674,Washington_1790,1,0.0,Washington,1790


In [74]:
#import altair as alt
#import numpy as np

# Terms in this list will get a red dot in the visualization
term_list = ['war', 'peace']

# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    #y = 'document:N'
    y = 'year'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["document"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600)

### SOTU TFIDF - GROUP BY President

Re-create dataframe of tokens
then groupby presnum and aggregate tokens



In [27]:
sotudir2=Path("sotu2")
pathlist = sotudir2.glob('*.txt') # .glob only stores the pathlist temporarily (for some reason), so you need to call it again!2
txtList = []
for path in pathlist:
    fn=path.stem
    #print(fn)
    fileType=path.suffix
    year,pres=fn.split("_")
    with open(path,'r') as f:  
        sotu = f.read()
    tokens=tokenizer.tokenize(sotu)
    txt = sotu+"\n\n\n\n"
    txtLen = len(sotu)
    ltokens = [tok.lower() for tok in tokens]
    numtoks = len(ltokens)
    tokfreqs=collections.Counter(ltokens)
    #wordFreq = tokfreqs[searchTerm]
    #print(wordFreq)
    txtList.append([pres,year,numtoks,txtLen,tokens,txt])
#print(len(txtList))
colnames=['pres','year','numtoks','txtLen','tokens','txt']
sotudf=pd.DataFrame(txtList,columns=colnames)  ##


In [4]:
sotudf.head(30)

Unnamed: 0,pres,year,numtoks,txtLen,tokens,txt
0,Washington,1790,0,0,[],\n\n\n\n
1,Washington,1791,2314,14136,"[Fellow, Citizens, of, the, Senate, and, House...",Fellow-Citizens of the Senate and House of Rep...
2,Washington,1792,2104,12697,"[Fellow, Citizens, of, the, Senate, and, House...",Fellow-Citizens of the Senate and House of Rep...
3,Washington,1793,1973,11629,"[Fellow, Citizens, of, the, Senate, and, House...",Fellow-Citizens of the Senate and House of Rep...
4,Washington,1794,2918,17575,"[Fellow, Citizens, of, the, Senate, and, House...",Fellow-Citizens of the Senate and House of Rep...
5,Washington,1795,1988,12257,"[Fellow, Citizens, of, the, Senate, and, House...",Fellow-Citizens of the Senate and House of Rep...
6,Washington,1796,2878,17301,"[Fellow, Citizens, of, the, Senate, and, House...",Fellow-Citizens of the Senate and House of Rep...
7,Adams,1797,2060,12440,"[Gentlemen, of, the, Senate, and, Gentlemen, o...",Gentlemen of the Senate and Gentlemen of the H...
8,Adams,1798,2218,13362,"[Gentlemen, of, the, Senate, and, Gentlemen, o...",Gentlemen of the Senate and Gentlemen of the H...
9,Adams,1799,1505,9204,"[Gentlemen, of, the, Senate, and, Gentlemen, o...",Gentlemen of the Senate and Gentlemen of the H...


In [28]:
sotudf["presnum"] = (sotudf["pres"] != sotudf["pres"].shift()).cumsum()
sotudf.tail(30)

Unnamed: 0,pres,year,numtoks,txtLen,tokens,txt,presnum
198,Bush,1989,4917,27817,"[Mr, Speaker, Mr, President, and, distinguishe...","Mr. Speaker, Mr. President, and distinguished ...",39
199,Bush,1990,3880,21396,"[Tonight, I, come, not, to, speak, about, the,...","Tonight, I come not to speak about the ""State ...",39
200,Bush,1991,3878,22395,"[Mr, President, Mr, Speaker, members, of, the,...","Mr. President, Mr. Speaker, members of the Uni...",39
201,Bush,1992,4855,26606,"[Mr, Speaker, Mr, President, distinguished, me...","Mr. Speaker, Mr. President, distinguished memb...",39
202,Clinton,1993,7127,39214,"[Mr, President, Mr, Speaker, Members, of, the,...","Mr. President, Mr. Speaker, Members of the Hou...",40
203,Clinton,1994,7570,42280,"[Mr, Speaker, Mr, President, members, of, the,...","Mr. Speaker, Mr. President, members of the 103...",40
204,Clinton,1995,9437,51285,"[Mr, President, Mr, Speaker, members, of, the,...","Mr. President, Mr. Speaker, members of the 104...",40
205,Clinton,1996,6412,36346,"[Mr, Speaker, Mr, Vice, President, members, of...","Mr. Speaker, Mr. Vice President, members of th...",40
206,Clinton,1997,6876,38998,"[Mr, Speaker, Mr, Vice, President, members, of...","Mr. Speaker, Mr. Vice President, members of th...",40
207,Clinton,1998,7458,42215,"[Mr, Speaker, Mr, Vice, President, members, of...","Mr. Speaker, Mr. Vice President, members of th...",40


In [29]:
sotudf2 = sotudf.groupby(['pres','presnum']).agg({'numtoks':'sum','txtLen':'sum','year':'first','tokens':'sum','txt':'sum'})
#sotudf2['numtoks2'] = sotudf2['tokens'].apply(len) # a test to see that all tokens were included in combined token list
sotudf2 = sotudf2.sort_values(['presnum'])
sotudf2.tail (10)

Unnamed: 0_level_0,Unnamed: 1_level_0,numtoks,txtLen,year,tokens,txt
pres,presnum,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Johnson,34,29422,168522,1964,"[Mr, Speaker, Mr, President, Members, of, the,...","Mr. Speaker, Mr. President, Members of the Hou..."
Nixon,35,19952,113200,1970,"[Mr, Speaker, Mr, President, my, colleagues, i...","Mr. Speaker, Mr. President, my colleagues in t..."
Ford,36,13908,82347,1975,"[Mr, Speaker, Mr, Vice, President, Members, of...","Mr. Speaker, Mr. Vice President, Members of th..."
Carter,37,45591,284077,1978,"[Two, years, ago, today, we, had, the, first, ...",Two years ago today we had the first caucus in...
Reagan,38,33031,190597,1982,"[Mr, Speaker, Mr, President, distinguished, Me...","Mr. Speaker, Mr. President, distinguished Memb..."
Bush,39,17530,98214,1989,"[Mr, Speaker, Mr, President, and, distinguishe...","Mr. Speaker, Mr. President, and distinguished ..."
Clinton,40,60117,338094,1993,"[Mr, President, Mr, Speaker, Members, of, the,...","Mr. President, Mr. Speaker, Members of the Hou..."
Bush,41,40867,237345,2001,"[To, the, Congress, of, the, United, States, M...",To the Congress of the United States:\n\nMr. S...
Obama,42,53807,301862,2009,"[Madame, Speaker, Mr, Vice, President, Members...","Madame Speaker, Mr. Vice President, Members of..."
Trump,43,10299,59616,2017,"[Thank, you, very, much, Mr, Speaker, Mr, Vice...","Thank you very much. Mr. Speaker, Mr. Vice Pre..."


In [30]:
#testRow = sotudf2.query("presnum == 39")
#string1 = list(testRow['tokens'].str[4915:4930])
#print(string1)
import os
sotupath3 = "sotu_presComb"
isExist = os.path.exists(sotupath3)
if not isExist:
   # Create a new directory because it does not exist
   os.makedirs(sotupath3)

for i, r in sotudf2.iterrows():
    #print(r['txt'][:200])
    pres, presnum = i[0],i[1]
    year = r['year']
    with open(Path(sotupath3,"%s_%s_%s.txt"%(pres,presnum,year)),'w') as text_file:
        text_file.write(r['txt'])


In [31]:
directory_path="sotu_presComb"
#MW: Then we’re going to use glob and Path to make a list of all the filepaths in that directory and a list of all the short story titles.
text_files2=glob.glob(f"{directory_path}/*.txt")
text_files2

['sotu_presComb\\Adams_2_1797.txt',
 'sotu_presComb\\Adams_6_1825.txt',
 'sotu_presComb\\Arthur_19_1881.txt',
 'sotu_presComb\\Buchanan_14_1857.txt',
 'sotu_presComb\\Buren_8_1837.txt',
 'sotu_presComb\\Bush_39_1989.txt',
 'sotu_presComb\\Bush_41_2001.txt',
 'sotu_presComb\\Carter_37_1978.txt',
 'sotu_presComb\\Cleveland_20_1885.txt',
 'sotu_presComb\\Cleveland_22_1893.txt',
 'sotu_presComb\\Clinton_40_1993.txt',
 'sotu_presComb\\Coolidge_28_1923.txt',
 'sotu_presComb\\Eisenhower_32_1954.txt',
 'sotu_presComb\\Fillmore_12_1850.txt',
 'sotu_presComb\\Ford_36_1975.txt',
 'sotu_presComb\\Grant_17_1869.txt',
 'sotu_presComb\\Harding_27_1921.txt',
 'sotu_presComb\\Harrison_21_1889.txt',
 'sotu_presComb\\Hayes_18_1877.txt',
 'sotu_presComb\\Hoover_29_1929.txt',
 'sotu_presComb\\Jackson_7_1829.txt',
 'sotu_presComb\\Jefferson_3_1801.txt',
 'sotu_presComb\\Johnson_16_1865.txt',
 'sotu_presComb\\Johnson_34_1964.txt',
 'sotu_presComb\\Kennedy_33_1962.txt',
 'sotu_presComb\\Lincoln_15_1861.txt',


In [32]:
text_titles2 = [Path(text).stem for text in text_files2]
print(text_titles2[:10])

['Adams_2_1797', 'Adams_6_1825', 'Arthur_19_1881', 'Buchanan_14_1857', 'Buren_8_1837', 'Bush_39_1989', 'Bush_41_2001', 'Carter_37_1978', 'Cleveland_20_1885', 'Cleveland_22_1893']


In [33]:
text_files = text_files2
text_titles = text_titles2
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)
tfidf_vectorizer=TfidfVectorizer(input='filename',stop_words='english')
tfidf_vector=tfidf_vectorizer.fit_transform(text_files)
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df = tfidf_df.stack().reset_index()
tfidf_df.head(10)

Unnamed: 0,level_0,level_1,0
0,Adams_2_1797,0,0.0
1,Adams_2_1797,0,0.0
2,Adams_2_1797,0,0.0
3,Adams_2_1797,1,0.0
4,Adams_2_1797,1,0.0
5,Adams_2_1797,2,0.0
6,Adams_2_1797,3,0.0
7,Adams_2_1797,4,0.0
8,Adams_2_1797,5,0.0
9,Adams_2_1797,6,0.0


In [34]:
tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})
tfidf_df[['pres','presnum','year']] = tfidf_df['document'].str.split("_",expand=True)

In [35]:
tfidf_df.head(10)

Unnamed: 0,document,term,tfidf,pres,presnum,year
0,Adams_2_1797,0,0.0,Adams,2,1797
1,Adams_2_1797,0,0.0,Adams,2,1797
2,Adams_2_1797,0,0.0,Adams,2,1797
3,Adams_2_1797,1,0.0,Adams,2,1797
4,Adams_2_1797,1,0.0,Adams,2,1797
5,Adams_2_1797,2,0.0,Adams,2,1797
6,Adams_2_1797,3,0.0,Adams,2,1797
7,Adams_2_1797,4,0.0,Adams,2,1797
8,Adams_2_1797,5,0.0,Adams,2,1797
9,Adams_2_1797,6,0.0,Adams,2,1797


In [36]:
#identifies and sorts the top 10 terms with the highest tfidf scores for each president (across all addresses given by that president)
top_tfidf = tfidf_df.sort_values(by=['presnum','pres','tfidf'], ascending=[True,True,False]).groupby(['document']).head(10)
top_tfidf[top_tfidf['term'].str.contains('public')]
#print(top_tfidf.shape)

Unnamed: 0,document,term,tfidf,pres,presnum,year
1025807,Washington_1_1790,public,0.164076,Washington,1,1790
804497,Polk_10_1845,public,0.152876,Polk,10,1845
337287,Fillmore_12_1850,public,0.109596,Fillmore,12,1850
779907,Pierce_13_1853,public,0.14495,Pierce,13,1853
460237,Hayes_18_1877,public,0.175067,Hayes,18,1877
214337,Cleveland_20_1885,public,0.129465,Cleveland,20,1885
853677,Roosevelt_24_1901,public,0.137248,Roosevelt,24,1901
411057,Harding_27_1921,public,0.168416,Harding,27,1921
288107,Coolidge_28_1923,public,0.133585,Coolidge,28,1923
484827,Hoover_29_1929,public,0.130528,Hoover,29,1929


In [37]:
#def tfidf (text_files)






#import altair as alt
#import numpy as np

# Terms in this list will get a red dot in the visualization
term_list = ['war', 'peace']

# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    #y = 'document:N'
    y = 'year'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["document"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600)


### 