# Text Analysis in Python 3: Comparing Texts

# Term Frequency - Inverse Data Frequency (TFIDF)

<img src = "https://miro.medium.com/max/720/1*qQgnyPLDIkUmeZKN2_ZWbQ.webp" style="width:60%">

Image from Yassine Hamdaoui, ["TF(Term Frequency)-IDF(Inverse Document Frequency) from scratch in python"](https://towardsdatascience.com/tf-term-frequency-idf-inverse-document-frequency-from-scratch-in-python-6c2b61b78558) *Towards Data Science (Medium)* (Dec. 9, 2019).

For this lesson, we will be drawing on the [TFIDF section](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/01-TF-IDF.html) in the online book: Melanie Walsh, [*Introduction to Cultural Analytics and Python*](https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html), Version 1 (2021), https://doi.org/10.5281/zenodo.4411250. 

<img src="https://melaniewalsh.github.io/Intro-Cultural-Analytics/_static/favicon.ico" style="width:30%">

All sections below labeled with a **MW** comes from this book. Please consider supporting that project if you find it useful.




## TF-IDF with Scikit-Learn [MW]

Tf-idf is a method that tries to identify the most distinctively frequent or significant words in a document. 

In this lesson, we’re going to learn how to calculate tf-idf scores using a collection of plain text (.txt) files and the Python library scikit-learn, which has a quick and nifty module called TfidfVectorizer.

In this lesson, we will cover how to:

    Calculate and normalize tf-idf scores for U.S. Inaugural Addresses with scikit-learn


## Dataset: U.S. Inaugural Addresses [MW]

    This is the meaning of our liberty and our creed; why men and women and children of every race and every faith can join in celebration across this magnificent Mall, and why a man whose father less than 60 years ago might not have been served at a local restaurant can now stand before you to take a most sacred oath. So let us mark this day with remembrance of who we are and how far we have traveled.

    —Barack Obama, Inaugural Presidential Address, January 2009

During Barack Obama’s Inaugural Address in January 2009, he mentioned “women” four different times, including in the passage quoted above. How distinctive is Obama’s inclusion of women in this address compared to all other U.S. Presidents? This is one of the questions that we’re going to try to answer with tf-idf.


## Breaking Down the TF-IDF Formula [MW]

But first, let’s quickly discuss the tf-idf formula. The idea is pretty simple.

**tf-idf = term_frequency * inverse_document_frequency**

**term_frequency** = number of times a given term appears in document

**inverse_document_frequency** = log(total number of documents / number of documents with term) + 1*****

You take the number of times a term occurs in a document (term frequency). Then you take the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and you flip that fraction on its head (inverse document frequency). Then you multiply the two numbers together (term_frequency * inverse_document_frequency).

The reason we take the inverse, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. Think about the inverse document frequency for the word “said” vs the word “pigeon.” The term “said” appears in 13 (document frequency) of 14 (total documents) Lost in the City stories (14 / 13 –> a smaller inverse document frequency) while the term “pigeons” only occurs in 2 (document frequency) of the 14 stories (total documents) (14 / 2 –> a bigger inverse document frequency, a bigger tf-idf boost).

*There are a bunch of slightly different ways that you can calculate inverse document frequency. The version of idf that we’re going to use is the scikit-learn default, which uses “smoothing” aka it adds a “1” to the numerator and denominator:

**inverse_document_frequency** = log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

## Part I: TF-IDF with scikit-learn [MW]

*Additional comments by Jeremy denoted by [jm]*

scikit-learn, imported as sklearn, is a popular Python library for machine learning approaches such as clustering, classification, and regression. Though we’re not doing any machine learning in this lesson, we’re nevertheless going to use scikit-learn’s TfidfVectorizer and CountVectorizer.

<p><del>1. Install scikit-learn</del></p>

In [None]:
# !pip install sklearn
#already installed on JHub!!

2. Import necessary modules and libraries

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')                           #[jm] remember, this is the tokenizer that removes punctuation
import pandas as pd, numpy as np, altair as alt
# pd.set_option("max_rows",600)                            # [jm] returns an OptionError
pd.set_option("display.max_rows",600)                      #[jm] apparently max_rows commands has now been replaced w/ display.max_rows
from pathlib import Path
import glob, collections
              

We’re also going to import pandas and change its default display setting. And we’re going to import two libraries that will help us work with files and the file system: [pathlib](https://docs.python.org/3/library/pathlib.html##basic-use) and [glob](https://docs.python.org/3/library/glob.html).

3. Below we’re setting the directory filepath that contains all the text files that we want to analyze.

In [None]:
directory_path = "US_Inaugural_Addresses"
#MW: Then we’re going to use glob and Path to make a list of all the filepaths in that directory and a list of all the short story titles.
#text_files = glob.glob(f"../../RR/shared/{directory_path}/*.txt")
inaugdir = Path(Path.home(), "shared", "RR-workshop-data", f"{directory_path}")
text_files = glob.glob(f"{inaugdir}/*.txt")
print(text_files)

In [None]:
text_titles = [Path(text).stem for text in text_files]

In [None]:
text_titles

In [3]:
Path("~/shared/RR-workshops-data/US_Inaugural_Addresses").expanduser() 



WindowsPath('C:/Users/F0040RP')

## Part II. Calculate tf-idf [MW]

To calculate tf–idf scores for every word, we’re going to use scikit-learn’s TfidfVectorizer.

4. When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

The recommended way to run TfidfVectorizer is with smoothing (smooth_idf = True) and normalization (norm='l2') turned on. These parameters will better account for differences in text length, and overall produce more meaningful tf–idf scores. Smoothing and L2 normalization are actually the default settings for TfidfVectorizer, so to turn them on, you don’t need to include any extra code at all.

Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)

In [None]:
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

5. Run TfidfVectorizer on our text_files

In [None]:
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)


6. Make a DataFrame out of the resulting tf–idf vector, setting the “feature names” or words as columns and the titles as rows

In [None]:
#note from Simon: TfidfVectorizer returns a sparse matrix and that's why we have to call .toarray()  before proceeding.
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names_out())
#warning: get_feature_names will be depreciated; use get_feature_names_out instead
   ##I made this fix in the code above


## Part III. Practice with Dataframes [jm]

### Summary Info

7. Often, when working with dataframes in Python, it is helpful to get a quick overview of the type, size, and distribution of the data it contains.

Run the following commands and add a comment next to each explaining what it does.



In [None]:
tfidf_df.head(3)   ## [explain what this does here]

In [None]:
tfidf_df.shape

In [None]:
tfidf_df.describe()

In [None]:
tfidf_df.dtypes

### Subsetting

8. We can also create smaller subsets of a dataframe in a variety of ways. Run the following code and explain what each does in a comment next to it.

In [None]:
sub = tfidf_df[['women']]       # [explain what this does]
sub.tail(10)



In [None]:
#sub = tfidf_df[['freedom','liberty','democracy']]
sub = tfidf_df.loc[:, ['freedom','liberty','democracy']]
sub.head()

In [None]:
sub = tfidf_df.loc["02_washington_1793":"05_jefferson_1805",["war", "economy"]]
print(sub)

In [None]:
sub = tfidf_df.iloc[55:58,:]
sub.head()

In [None]:
sub = tfidf_df.iloc[:,1000:1005]
sub.head()

In [None]:
sub = tfidf_df.iloc[[10,20,30,40,50],[-1,-2,-3,-4]]
sub.head()

### Filtering

9. We can also filter dataframes by the values found in columns or rows. Review what the following code does.

In [None]:
filterdf = tfidf_df[tfidf_df['democracy']>0.1]      # keeps only those rows with a frequency greater than 0.1 in the democracy column
filterdf.head()                                     

In [None]:
filterdf = tfidf_df.loc[tfidf_df['democracy']>0.1, 'democracy']
filterdf.head()

In [None]:
filterdf = tfidf_df.loc[(tfidf_df['democracy']>0.05) & (tfidf_df['freedom']>0.03), ['democracy','freedom']]
filterdf.head()

In [None]:
tfidf_df['god'].idxmax()

## Part IV. Subsetting, Filtering, and Sorting the TFIDF Dataframe

10. Add column for document frequency aka number of times word appears in all documents

In [None]:
tfidf_df.loc['00_DocumentFrequency'] = (tfidf_df > 0).sum()

In [None]:
tfidf_slice = tfidf_df[['government', 'borders', 'people', 'obama', 'war', 'honor','foreign', 'men', 'women', 'children']]
tfidf_slice.sort_index().round(decimals=2)

11. Let’s drop “OO_Document Frequency” since we were just using it for illustration purposes.

In [None]:
tfidf_df = tfidf_df.drop('00_DocumentFrequency', errors='ignore')

12. Let’s reorganize the DataFrame so that the words are in rows rather than columns.

In [None]:
tfidf_df.stack().reset_index()

In [None]:
tfidf_df = tfidf_df.stack().reset_index()

In [None]:
tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})

13. To find out the top 10 words with the highest tf–idf for every story, we’re going to sort by document and tfidf score and then groupby document and take the first 10 values.

In [None]:
tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

In [None]:
top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

14. We can zoom in on particular words and particular documents.

In [None]:
top_tfidf[top_tfidf['term'].str.contains('women')]

In [None]:
top_tfidf['term'].str.contains('e')

15. It turns out that the term “women” is very distinctive in Obama’s Inaugural Address.

In [None]:
top_tfidf[top_tfidf['document'].str.contains('obama')]

In [None]:
top_tfidf[top_tfidf['document'].str.contains('trump')]

In [None]:
top_tfidf[top_tfidf['document'].str.contains('kennedy')]

## Visualize TF-IDF [MW]

17. We can also visualize our TF-IDF results with the data visualization library Altair.

In [None]:
#!pip install altair

18. Let’s make a heatmap that shows the highest TF-IDF scoring words for each president, and let’s put a red dot next to two terms of interest: “war” and “peace”:

The code below was contributed by [Eric Monson](https://github.com/emonson). Thanks, Eric!

In [None]:
import altair as alt
import numpy as np

# Terms in this list will get a red dot in the visualization
term_list = ['war', 'peace']

# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    y = 'document:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["document"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600)

<div class="alert alert-info" role="alert"><h3 style='color:blue'> Review Questions </h3>

<p style='color:blue'>Your Turn!</p>

<p style='color:blue'>Take a few minutes to explore the dataframe below and then answer the following questions.</p>

<p style='color:blue'>1. What is the difference between a tf-idf score and raw word frequency?</p>

<p style='color:blue'>2. Based on the dataframe above, what is one potential problem or limitation that you notice with tf-idf scores?</p>

<p style='color:blue'>3. What’s another collection of texts that you think might be interesting to analyze with tf-idf scores? Why?</p>
</div>


<div class="alert alert-info" role="alert"><h3 style='color:blue'> EXERCISES </h3>

<p style='color:blue'>Write new code below that creates a TFIDF dataframe and color-coded graphic (just like the one above) but this time for either:</p>

<ol style='color:blue'>
    <li>our 233 SOTU addresses</li>
    <li>or a collection of texts of your own choosing (which should be saved as plain text [.txt] files)</li>
</ol>

</div>