# Text Analysis in Python 3: Comparing Texts

# Term Frequency - Inverse Data Frequency (TFIDF)

<img src = "https://miro.medium.com/max/720/1*qQgnyPLDIkUmeZKN2_ZWbQ.webp" style="width:60%">

Image from Yassine Hamdaoui, ["TF(Term Frequency)-IDF(Inverse Document Frequency) from scratch in python"](https://towardsdatascience.com/tf-term-frequency-idf-inverse-document-frequency-from-scratch-in-python-6c2b61b78558) *Towards Data Science (Medium)* (Dec. 9, 2019).

For this lesson, we will be drawing on the [TFIDF section](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/01-TF-IDF.html) in the online book: Melanie Walsh, [*Introduction to Cultural Analytics and Python*](https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html), Version 1 (2021), https://doi.org/10.5281/zenodo.4411250. 

<img src="https://melaniewalsh.github.io/Intro-Cultural-Analytics/_static/favicon.ico" style="width:30%">

All sections below labeled with a **MW** comes from this book. Please consider supporting that project if you find it useful.




## TF-IDF with Scikit-Learn [MW]

Tf-idf is a method that tries to identify the most distinctively frequent or significant words in a document. 

In this lesson, we’re going to learn how to calculate tf-idf scores using a collection of plain text (.txt) files and the Python library scikit-learn, which has a quick and nifty module called TfidfVectorizer.

In this lesson, we will cover how to:

    Calculate and normalize tf-idf scores for U.S. Inaugural Addresses with scikit-learn


## Dataset: U.S. Inaugural Addresses [MW]

    This is the meaning of our liberty and our creed; why men and women and children of every race and every faith can join in celebration across this magnificent Mall, and why a man whose father less than 60 years ago might not have been served at a local restaurant can now stand before you to take a most sacred oath. So let us mark this day with remembrance of who we are and how far we have traveled.

    —Barack Obama, Inaugural Presidential Address, January 2009

During Barack Obama’s Inaugural Address in January 2009, he mentioned “women” four different times, including in the passage quoted above. How distinctive is Obama’s inclusion of women in this address compared to all other U.S. Presidents? This is one of the questions that we’re going to try to answer with tf-idf.


## Breaking Down the TF-IDF Formula [MW]

But first, let’s quickly discuss the tf-idf formula. The idea is pretty simple.

**tf-idf = term_frequency * inverse_document_frequency**

**term_frequency** = number of times a given term appears in document

**inverse_document_frequency** = log(total number of documents / number of documents with term) + 1*****

You take the number of times a term occurs in a document (term frequency). Then you take the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and you flip that fraction on its head (inverse document frequency). Then you multiply the two numbers together (term_frequency * inverse_document_frequency).

The reason we take the inverse, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. Think about the inverse document frequency for the word “said” vs the word “pigeon.” The term “said” appears in 13 (document frequency) of 14 (total documents) Lost in the City stories (14 / 13 –> a smaller inverse document frequency) while the term “pigeons” only occurs in 2 (document frequency) of the 14 stories (total documents) (14 / 2 –> a bigger inverse document frequency, a bigger tf-idf boost).

*There are a bunch of slightly different ways that you can calculate inverse document frequency. The version of idf that we’re going to use is the scikit-learn default, which uses “smoothing” aka it adds a “1” to the numerator and denominator:

**inverse_document_frequency** = log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

## Part I: TF-IDF with scikit-learn [MW]

*Additional comments by Jeremy denoted by [jm]*

scikit-learn, imported as sklearn, is a popular Python library for machine learning approaches such as clustering, classification, and regression. Though we’re not doing any machine learning in this lesson, we’re nevertheless going to use scikit-learn’s TfidfVectorizer and CountVectorizer.

<p><del>1. Install scikit-learn</del></p>

In [1]:
# !pip install sklearn
#already installed on JHub!!

2. Import necessary modules and libraries

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')                           #[jm] remember, this is the tokenizer that removes punctuation
import pandas as pd, numpy as np, altair as alt
# pd.set_option("max_rows",600)                            # [jm] returns an OptionError
pd.set_option("display.max_rows",600)                      #[jm] apparently max_rows commands has now been replaced w/ display.max_rows
from pathlib import Path
import glob, collections
              

We’re also going to import pandas and change its default display setting. And we’re going to import two libraries that will help us work with files and the file system: [pathlib](https://docs.python.org/3/library/pathlib.html##basic-use) and [glob](https://docs.python.org/3/library/glob.html).

3. Below we’re setting the directory filepath that contains all the text files that we want to analyze.

In [3]:
directory_path = "US_Inaugural_Addresses"
#MW: Then we’re going to use glob and Path to make a list of all the filepaths in that directory and a list of all the short story titles. 
#text_files = glob.glob(f"../../RR/shared/{directory_path}/*.txt")
inaugdir = Path(Path.home(), "shared", "RR-workshop-data", f"{directory_path}")
text_files = glob.glob(f"{inaugdir}/*.txt")
print(text_files)

['../../RR/shared/US_Inaugural_Addresses\\01_washington_1789.txt', '../../RR/shared/US_Inaugural_Addresses\\02_washington_1793.txt', '../../RR/shared/US_Inaugural_Addresses\\03_adams_john_1797.txt', '../../RR/shared/US_Inaugural_Addresses\\04_jefferson_1801.txt', '../../RR/shared/US_Inaugural_Addresses\\05_jefferson_1805.txt', '../../RR/shared/US_Inaugural_Addresses\\06_madison_1809.txt', '../../RR/shared/US_Inaugural_Addresses\\07_madison_1813.txt', '../../RR/shared/US_Inaugural_Addresses\\08_monroe_1817.txt', '../../RR/shared/US_Inaugural_Addresses\\09_monroe_1821.txt', '../../RR/shared/US_Inaugural_Addresses\\10_adams_john_quincy_1825.txt', '../../RR/shared/US_Inaugural_Addresses\\11_jackson_1829.txt', '../../RR/shared/US_Inaugural_Addresses\\12_jackson_1833.txt', '../../RR/shared/US_Inaugural_Addresses\\13_van_buren_1837.txt', '../../RR/shared/US_Inaugural_Addresses\\14_harrison_1841.txt', '../../RR/shared/US_Inaugural_Addresses\\15_polk_1845.txt', '../../RR/shared/US_Inaugural_Add

In [4]:
text_titles = [Path(text).stem for text in text_files]

In [5]:
text_titles

['01_washington_1789',
 '02_washington_1793',
 '03_adams_john_1797',
 '04_jefferson_1801',
 '05_jefferson_1805',
 '06_madison_1809',
 '07_madison_1813',
 '08_monroe_1817',
 '09_monroe_1821',
 '10_adams_john_quincy_1825',
 '11_jackson_1829',
 '12_jackson_1833',
 '13_van_buren_1837',
 '14_harrison_1841',
 '15_polk_1845',
 '16_taylor_1849',
 '17_pierce_1853',
 '18_buchanan_1857',
 '19_lincoln_1861',
 '20_lincoln_1865',
 '21_grant_1869',
 '22_grant_1873',
 '23_hayes_1877',
 '24_garfield_1881',
 '25_cleveland_1885',
 '26_harrison_1889',
 '27_cleveland_1893',
 '28_mckinley_1897',
 '29_mckinley_1901',
 '30_roosevelt_theodore_1905',
 '31_taft_1909',
 '32_wilson_1913',
 '33_wilson_1917',
 '34_harding_1921',
 '35_coolidge_1925',
 '36_hoover_1929',
 '37_roosevelt_franklin_1933',
 '38_roosevelt_franklin_1937',
 '39_roosevelt_franklin_1941',
 '40_roosevelt_franklin_1945',
 '41_truman_1949',
 '42_eisenhower_1953',
 '43_eisenhower_1957',
 '44_kennedy_1961',
 '45_johnson_1965',
 '46_nixon_1969',
 '47_

## Part II. Calculate tf-idf [MW]

To calculate tf–idf scores for every word, we’re going to use scikit-learn’s TfidfVectorizer.

4. When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

The recommended way to run TfidfVectorizer is with smoothing (smooth_idf = True) and normalization (norm='l2') turned on. These parameters will better account for differences in text length, and overall produce more meaningful tf–idf scores. Smoothing and L2 normalization are actually the default settings for TfidfVectorizer, so to turn them on, you don’t need to include any extra code at all.

Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)

In [6]:
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

5. Run TfidfVectorizer on our text_files

In [7]:
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)


6. Make a DataFrame out of the resulting tf–idf vector, setting the “feature names” or words as columns and the titles as rows

In [8]:
#note from Simon: TfidfVectorizer returns a sparse matrix and that's why we have to call .toarray()  before proceeding.
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names_out())
#warning: get_feature_names will be depreciated; use get_feature_names_out instead
   ##I made this fix in the code above


## Part III. Practice with Dataframes [jm]

### Summary Info

7. Often, when working with dataframes in Python, it is helpful to get a quick overview of the type, size, and distribution of the data it contains.

Run the following commands and add a comment next to each explaining what it does.



In [9]:
tfidf_df.head(3)   ## [explain what this does here]

Unnamed: 0,000,03,04,05,100,120,125,13,14th,151,...,young,younger,youngest,youth,youthful,zachary,zeal,zealous,zealously,zone
01_washington_1789,0.0,0.0,0.023259,0.0,0.0,0.0,0.0,0.0,0.057234,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
02_washington_1793,0.0,0.091299,0.093202,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
03_adams_john_1797,0.0,0.016127,0.016463,0.0,0.0,0.0,0.0,0.0,0.0,0.040513,...,0.0,0.0,0.0,0.0,0.0,0.0,0.026615,0.0,0.0,0.0


In [10]:
tfidf_df.shape

(58, 8999)

In [11]:
tfidf_df.describe()

Unnamed: 0,000,03,04,05,100,120,125,13,14th,151,...,young,younger,youngest,youth,youthful,zachary,zeal,zealous,zealously,zone
count,58.0,58.0,58.0,58.0,58.0,58.0,58.0,58.0,58.0,58.0,...,58.0,58.0,58.0,58.0,58.0,58.0,58.0,58.0,58.0,58.0
mean,0.004981,0.009356,0.009214,0.00165,0.000324,0.00043,0.000711,0.000589,0.000987,0.000699,...,0.008279,0.00038,0.000668,0.002467,0.00055,0.001171,0.005519,0.002439,0.00284,0.00055
std,0.017523,0.014687,0.014943,0.009177,0.00247,0.003278,0.005418,0.004486,0.007515,0.00532,...,0.015424,0.002892,0.005088,0.010125,0.004186,0.008917,0.015763,0.008691,0.009761,0.004186
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.014688,0.014162,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.019517,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.093551,0.091299,0.093202,0.061633,0.018814,0.024966,0.041266,0.034165,0.057234,0.040513,...,0.07552,0.022021,0.038746,0.062869,0.031879,0.067914,0.084613,0.050156,0.048508,0.031879


In [12]:
tfidf_df.dtypes

000          float64
03           float64
04           float64
05           float64
100          float64
              ...   
zachary      float64
zeal         float64
zealous      float64
zealously    float64
zone         float64
Length: 8999, dtype: object

### Subsetting

8. We can also create smaller subsets of a dataframe in a variety of ways. Run the following code and explain what each does in a comment next to it.

In [13]:
sub = tfidf_df[['women']]       # [explain what this does]
sub.tail(10)



Unnamed: 0,women
49_reagan_1981,0.040455
50_reagan_1985,0.035922
51_bush_george_h_w_1989,0.05817
52_clinton_1993,0.024003
53_clinton_1997,0.019067
54_bush_george_w_2001,0.0
55_bush_george_w_2005,0.040739
56_obama_2009,0.084859
57_obama_2013,0.044705
58_trump_2017,0.050075


In [14]:
#sub = tfidf_df[['freedom','liberty','democracy']]
sub = tfidf_df.loc[:, ['freedom','liberty','democracy']]
sub.head()

Unnamed: 0,freedom,liberty,democracy
01_washington_1789,0.0,0.017184,0.0
02_washington_1793,0.0,0.0,0.0
03_adams_john_1797,0.0,0.03649,0.0
04_jefferson_1801,0.069921,0.047067,0.0
05_jefferson_1805,0.028723,0.038669,0.0


In [15]:
sub = tfidf_df.loc["02_washington_1793":"05_jefferson_1805",["war", "economy"]]
print(sub)

                         war   economy
02_washington_1793  0.000000  0.000000
03_adams_john_1797  0.011540  0.000000
04_jefferson_1801   0.014885  0.019211
05_jefferson_1805   0.036688  0.000000


In [16]:
sub = tfidf_df.iloc[55:58,:]
sub.head()

Unnamed: 0,000,03,04,05,100,120,125,13,14th,151,...,young,younger,youngest,youth,youthful,zachary,zeal,zealous,zealously,zone
56_obama_2009,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.020657,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
57_obama_2013,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.021764,0.0,0.0,0.033632,0.0,0.0,0.0,0.0,0.0,0.0
58_trump_2017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.024379,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
sub = tfidf_df.iloc[:,1000:1005]
sub.head()

Unnamed: 0,blend,blended,bless,blessed,blessing
01_washington_1789,0.0,0.0,0.0,0.0,0.033845
02_washington_1793,0.0,0.0,0.0,0.0,0.0
03_adams_john_1797,0.0,0.0,0.0,0.0,0.023957
04_jefferson_1801,0.0,0.0,0.0,0.0,0.0
05_jefferson_1805,0.0,0.0,0.0,0.0,0.025387


In [18]:
sub = tfidf_df.iloc[[10,20,30,40,50],[-1,-2,-3,-4]]
sub.head()

Unnamed: 0,zone,zealously,zealous,zeal
11_jackson_1829,0.0,0.0,0.050156,0.043967
21_grant_1869,0.0,0.0,0.0,0.0
31_taft_1909,0.0,0.0,0.0,0.0
41_truman_1949,0.0,0.0,0.0,0.0
51_bush_george_h_w_1989,0.0,0.0,0.0,0.0


### Filtering

9. We can also filter dataframes by the values found in columns or rows. Review what the following code does.

In [19]:
filterdf = tfidf_df[tfidf_df['democracy']>0.1]      # keeps only those rows with a frequency greater than 0.1 in the democracy column
filterdf.head()                                     

Unnamed: 0,000,03,04,05,100,120,125,13,14th,151,...,young,younger,youngest,youth,youthful,zachary,zeal,zealous,zealously,zone
38_roosevelt_franklin_1937,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
39_roosevelt_franklin_1941,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
41_truman_1949,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
filterdf = tfidf_df.loc[tfidf_df['democracy']>0.1, 'democracy']
filterdf.head()

38_roosevelt_franklin_1937    0.178041
39_roosevelt_franklin_1941    0.244486
41_truman_1949                0.154140
Name: democracy, dtype: float64

In [21]:
filterdf = tfidf_df.loc[(tfidf_df['democracy']>0.05) & (tfidf_df['freedom']>0.03), ['democracy','freedom']]
filterdf.head()

Unnamed: 0,democracy,freedom
34_harding_1921,0.056336,0.04722
39_roosevelt_franklin_1941,0.244486,0.109295
40_roosevelt_franklin_1945,0.094707,0.031753
41_truman_1949,0.15414,0.149297
51_bush_george_h_w_1989,0.091996,0.074027


In [22]:
tfidf_df['god'].idxmax()

'20_lincoln_1865'

## Part IV. Subsetting, Filtering, and Sorting the TFIDF Dataframe

10. Add column for document frequency aka number of times word appears in all documents

In [23]:
tfidf_df.loc['00_DocumentFrequency'] = (tfidf_df > 0).sum()

In [24]:
tfidf_slice = tfidf_df[['government', 'borders', 'people', 'obama', 'war', 'honor','foreign', 'men', 'women', 'children']]
tfidf_slice.sort_index().round(decimals=2)

Unnamed: 0,government,borders,people,obama,war,honor,foreign,men,women,children
00_DocumentFrequency,53.0,5.0,56.0,3.0,45.0,32.0,32.0,47.0,15.0,22.0
01_washington_1789,0.11,0.0,0.05,0.0,0.0,0.0,0.0,0.02,0.0,0.0
02_washington_1793,0.06,0.0,0.05,0.0,0.0,0.08,0.0,0.0,0.0,0.0
03_adams_john_1797,0.16,0.0,0.19,0.0,0.01,0.1,0.12,0.04,0.0,0.0
04_jefferson_1801,0.16,0.0,0.01,0.0,0.01,0.04,0.0,0.04,0.0,0.0
05_jefferson_1805,0.03,0.0,0.0,0.0,0.04,0.0,0.06,0.01,0.0,0.02
06_madison_1809,0.0,0.0,0.02,0.0,0.02,0.05,0.05,0.0,0.0,0.0
07_madison_1813,0.04,0.0,0.04,0.0,0.25,0.02,0.02,0.0,0.0,0.0
08_monroe_1817,0.17,0.0,0.11,0.0,0.09,0.01,0.1,0.04,0.0,0.0
09_monroe_1821,0.08,0.0,0.06,0.0,0.11,0.02,0.04,0.01,0.0,0.01


11. Let’s drop “OO_Document Frequency” since we were just using it for illustration purposes.

In [25]:
tfidf_df = tfidf_df.drop('00_DocumentFrequency', errors='ignore')

12. Let’s reorganize the DataFrame so that the words are in rows rather than columns.

In [26]:
tfidf_df.stack().reset_index()

Unnamed: 0,level_0,level_1,0
0,01_washington_1789,000,0.000000
1,01_washington_1789,03,0.000000
2,01_washington_1789,04,0.023259
3,01_washington_1789,05,0.000000
4,01_washington_1789,100,0.000000
...,...,...,...
521937,58_trump_2017,zachary,0.000000
521938,58_trump_2017,zeal,0.000000
521939,58_trump_2017,zealous,0.000000
521940,58_trump_2017,zealously,0.000000


In [27]:
tfidf_df = tfidf_df.stack().reset_index()

In [28]:
tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})

13. To find out the top 10 words with the highest tf–idf for every story, we’re going to sort by document and tfidf score and then groupby document and take the first 10 values.

In [29]:
tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

Unnamed: 0,document,term,tfidf
3707,01_washington_1789,government,0.113681
4108,01_washington_1789,immutable,0.103883
4175,01_washington_1789,impressions,0.103883
6337,01_washington_1789,providential,0.103883
5631,01_washington_1789,ought,0.103728
6351,01_washington_1789,public,0.103102
6117,01_washington_1789,present,0.097516
6389,01_washington_1789,qualifications,0.096372
5811,01_washington_1789,peculiarly,0.090546
653,01_washington_1789,article,0.085786


In [30]:
top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

14. We can zoom in on particular words and particular documents.

In [31]:
top_tfidf[top_tfidf['term'].str.contains('women')]

Unnamed: 0,document,term,tfidf
503861,56_obama_2009,women,0.084859


In [32]:
top_tfidf['term'].str.contains('e')

3707       True
4108       True
4175       True
6337       True
5631      False
6351      False
6117       True
6389      False
5811       True
653        True
9018      False
9643       True
17576     False
13250     False
17700      True
17872     False
13368     False
13705     False
15157      True
17910      True
23821      True
21705      True
23959      True
21462      True
23356     False
26711     False
22016      True
22799      True
19759     False
21983     False
30704      True
33171      True
31814      True
34063      True
32010     False
35095     False
30979      True
30327      True
33893      True
32574     False
42347     False
39257      True
43618      True
44836      True
40872     False
37368      True
42486      True
37481     False
42131      True
39138      True
49179      True
45923      True
51346     False
50353     False
51728      True
45782      True
45793     False
48041      True
46898     False
50558      True
62759     False
55101     False
59062   

15. It turns out that the term “women” is very distinctive in Obama’s Inaugural Address.

In [33]:
top_tfidf[top_tfidf['document'].str.contains('obama')]

Unnamed: 0,document,term,tfidf
495406,56_obama_2009,america,0.148351
500298,56_obama_2009,nation,0.120229
500358,56_obama_2009,new,0.118002
503093,56_obama_2009,today,0.114792
498590,56_obama_2009,generation,0.100654
499762,56_obama_2009,let,0.0911
499578,56_obama_2009,jobs,0.090727
496911,56_obama_2009,crisis,0.087235
498779,56_obama_2009,hard,0.084859
503861,56_obama_2009,women,0.084859


In [34]:
top_tfidf[top_tfidf['document'].str.contains('trump')]

Unnamed: 0,document,term,tfidf
513404,58_trump_2017,america,0.350162
515585,58_trump_2017,dreams,0.156436
513405,58_trump_2017,american,0.149226
517576,58_trump_2017,jobs,0.142766
519262,58_trump_2017,protected,0.132439
518409,58_trump_2017,obama,0.120288
518766,58_trump_2017,people,0.11237
521001,58_trump_2017,thank,0.109171
513989,58_trump_2017,borders,0.107075
521596,58_trump_2017,ve,0.107075


In [35]:
top_tfidf[top_tfidf['document'].str.contains('kennedy')]

Unnamed: 0,document,term,tfidf
391774,44_kennedy_1961,let,0.267869
394306,44_kennedy_1961,sides,0.262849
392921,44_kennedy_1961,pledge,0.16096
387632,44_kennedy_1961,ask,0.107713
387864,44_kennedy_1961,begin,0.106495
388991,44_kennedy_1961,dare,0.106495
395895,44_kennedy_1961,world,0.10311
390313,44_kennedy_1961,final,0.102311
392370,44_kennedy_1961,new,0.0966
390120,44_kennedy_1961,explore,0.094223


## Visualize TF-IDF [MW]

17. We can also visualize our TF-IDF results with the data visualization library Altair.

In [36]:
#!pip install altair

18. Let’s make a heatmap that shows the highest TF-IDF scoring words for each president, and let’s put a red dot next to two terms of interest: “war” and “peace”:

The code below was contributed by [Eric Monson](https://github.com/emonson). Thanks, Eric!

In [37]:
import altair as alt
import numpy as np

# Terms in this list will get a red dot in the visualization
term_list = ['war', 'peace']

# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    y = 'document:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["document"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600)

<div class="alert alert-info" role="alert"><h3 style='color:blue'> Review Questions </h3>

<p style='color:blue'>Your Turn!</p>

<p style='color:blue'>Take a few minutes to explore the dataframe below and then answer the following questions.</p>

<p style='color:blue'>1. What is the difference between a tf-idf score and raw word frequency?</p>


</div>


<div class="alert alert-info" role="alert">
    <p style='color:blue'>2. Based on the dataframe above, what is one potential problem or limitation that you notice with tf-idf scores?</p>
</div>


<div class="alert alert-info" role="alert">
    <p style='color:blue'>3. What’s another collection of texts that you think might be interesting to analyze with tf-idf scores? Why?</p>
</div>

<div class="alert alert-info" role="alert"><h3 style='color:blue'> EXERCISES </h3>

<p style='color:blue'>Write new code below that creates a TFIDF dataframe and color-coded graphic (just like the one above) but this time for either:</p>

<ol style='color:blue'>
    <li>our 233 SOTU addresses</li>
    <li>or a collection of texts of your own choosing (which should be saved as plain text [.txt] files)</li>
</ol>

</div>

In [38]:
sotudir = Path(Path.cwd().parent, "strings-and-files", "state-of-the-union-dataset", "txt")  # creates a filepath to our dataset or corpus of texts
sotu_files = sorted(sotudir.glob("*.txt")) 
sotu_names = [file.stem for file in sotu_files]
print(sotu_names)


['1790_Washington', '1791_Washington', '1792_Washington', '1793_Washington', '1794_Washington', '1795_Washington', '1796_Washington', '1797_Adams', '1798_Adams', '1799_Adams', '1800_Adams', '1801_Jefferson', '1802_Jefferson', '1803_Jefferson', '1804_Jefferson', '1805_Jefferson', '1806_Jefferson', '1807_Jefferson', '1808_Jefferson', '1809_Madison', '1810_Madison', '1811_Madison', '1812_Madison', '1813_Madison', '1814_Madison', '1815_Madison', '1816_Madison', '1817_Monroe', '1818_Monroe', '1819_Monroe', '1820_Monroe', '1821_Monroe', '1822_Monroe', '1823_Monroe', '1824_Monroe', '1825_Adams', '1826_Adams', '1827_Adams', '1828_Adams', '1829_Jackson', '1830_Jackson', '1831_Jackson', '1832_Jackson', '1833_Jackson', '1834_Jackson', '1835_Jackson', '1836_Jackson', '1837_Buren', '1838_Buren', '1839_Buren', '1840_Buren', '1841_Tyler', '1842_Tyler', '1843_Tyler', '1844_Tyler', '1845_Polk', '1846_Polk', '1847_Polk', '1848_Polk', '1849_Taylor', '1850_Fillmore', '1851_Fillmore', '1852_Fillmore', '185

In [39]:
tfidf_vectorizer=TfidfVectorizer(input='filename',stop_words='english')
sotu_tfidf_vector=tfidf_vectorizer.fit_transform(sotu_files)
sotu_tfidf_df = pd.DataFrame(sotu_tfidf_vector.toarray(), index=sotu_names, columns=tfidf_vectorizer.get_feature_names_out())
sotu_tfidf_df = sotu_tfidf_df.stack().reset_index()
sotu_tfidf_df.head(10)

Unnamed: 0,level_0,level_1,0
0,1790_Washington,0,0.0
1,1790_Washington,0,0.0
2,1790_Washington,0,0.0
3,1790_Washington,1,0.0
4,1790_Washington,1,0.0
5,1790_Washington,2,0.0
6,1790_Washington,3,0.0
7,1790_Washington,4,0.0
8,1790_Washington,5,0.0
9,1790_Washington,6,0.0


In [40]:
sotu_tfidf_df = sotu_tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term'})
sotu_tfidf_df.head(10)

Unnamed: 0,document,term,tfidf
0,1790_Washington,0,0.0
1,1790_Washington,0,0.0
2,1790_Washington,0,0.0
3,1790_Washington,1,0.0
4,1790_Washington,1,0.0
5,1790_Washington,2,0.0
6,1790_Washington,3,0.0
7,1790_Washington,4,0.0
8,1790_Washington,5,0.0
9,1790_Washington,6,0.0


In [41]:
sotu_tfidf_df[['year','pres']] = sotu_tfidf_df['document'].str.split("_",expand=True)

In [42]:
sotu_tfidf_df.head(10)

Unnamed: 0,document,term,tfidf,year,pres
0,1790_Washington,0,0.0,1790,Washington
1,1790_Washington,0,0.0,1790,Washington
2,1790_Washington,0,0.0,1790,Washington
3,1790_Washington,1,0.0,1790,Washington
4,1790_Washington,1,0.0,1790,Washington
5,1790_Washington,2,0.0,1790,Washington
6,1790_Washington,3,0.0,1790,Washington
7,1790_Washington,4,0.0,1790,Washington
8,1790_Washington,5,0.0,1790,Washington
9,1790_Washington,6,0.0,1790,Washington


In [43]:
top_sotu_tfidf = sotu_tfidf_df.sort_values(by=['year','tfidf'], ascending=[True,False]).groupby(['document']).head(10)
top_sotu_tfidf.head(10)

Unnamed: 0,document,term,tfidf,year,pres
0,1790_Washington,0,0.0,1790,Washington
1,1790_Washington,0,0.0,1790,Washington
2,1790_Washington,0,0.0,1790,Washington
3,1790_Washington,1,0.0,1790,Washington
4,1790_Washington,1,0.0,1790,Washington
5,1790_Washington,2,0.0,1790,Washington
6,1790_Washington,3,0.0,1790,Washington
7,1790_Washington,4,0.0,1790,Washington
8,1790_Washington,5,0.0,1790,Washington
9,1790_Washington,6,0.0,1790,Washington


In [44]:
# Terms in this list will get a red dot in the visualization
term_list = ['war', 'peace']

# adding a little randomness to break ties in term ranking
top_sotu_tfidf_plusRand = top_sotu_tfidf.copy()
top_sotu_tfidf_plusRand['tfidf'] = top_sotu_tfidf_plusRand['tfidf'] + np.random.rand(top_sotu_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_sotu_tfidf_plusRand).encode(
    x = 'rank:O',
    #y = 'document:N'
    y = 'year'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["document"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600)

This is still a little bit of a mess. There are things that could be cleaned up (i.e. the "ve" are the result of verb contractions such as "we've" and should be removed along with other stopwords). 

Also, notice that for the last few presidents you have a variety of names that appear with high tfidf scores. These are the result of the relatively recent practice of presidents inviting civilians to put a human face on their administration's policies or political arguments.