# Class 13.1: TF-IDF

You should have most of these installed already, but you can double check. The last two or three might be new for you.

```
python3 -m pip install jupyter
python3 -m pip install nltk
python3 -m pip install numpy
python3 -m pip install matplotlib
python3 -m pip install scikit-learn
python3 -m pip install pandas
python3 -m pip install altair
```

In this folder you'll find a directory called `Endangered_animals` containing about a hundred text files, each one corresponding to one of the pages in the Wikipedia Category "Endangered Animals". This is the data used in this notebook.


### TF-IDF

TF-IDF stands for "term frequency - inverse document frequency". It's a way of measuring how common a word is in a document ("term frequency") relative to how common that word is in all your documents ("inverse document frequency"). This metric allows you to discover the words that are central to a particular document and make that document special or unique compared to other documents. 

For example, in my endangered animals example, all of the documents of course contain the same stop words, but they are also likely to contain many of the same content words -- *animal, endangered, threated, species, male*. Applying TF-IDF will allow the words that particularly important for a particular document to be highlighted. 

This technique will be useful to your in your projects, when you try to highlight for your audience how different examples of the same kind of document (e.g., song lyrics, earnings calls, literature) are different from one another.

Read my comments and then run the code blocks. In the end you will get a very cool heat map-tyle visualization of the words that are most particular to each of the Wikipedia pages in your chosen category.

In [3]:
# some import statements

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from pathlib import Path
import pandas as pd
import glob

In [4]:
# If you were running this with your data, you'd replace "Endangered_animals" below with your directory name
directoryname = "Endangered_animals"

# Then run this code to get the files in that directory and their names.
text_files = glob.glob(directoryname + "/*.txt")
file_names = [Path(text).stem for text in text_files]

In [5]:
# This code here does all the tf-idf counting for you.

tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)


The next block of code converts results to a pandas dataframe, which makes it easier to process and visualize the resulting TF-IDF values. The final line of code prints out the dataframe showing you for each Wikipedia page in the category, what its most salient and special words are -- the words with the highest frequency in that document (TF) relative to the frequency of that word in all the documents (IDF).

In [6]:
# This converts the results to a pandas dataframe, which makes it easier to
# process and visualize
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=file_names, columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df.stack().reset_index()
tfidf_df = tfidf_df.stack().reset_index()
tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})
tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

Unnamed: 0,document,term,tfidf
225213,African_bush_elephant,elephants,0.405045
225211,African_bush_elephant,elephant,0.366661
223595,African_bush_elephant,bulls,0.343408
222750,African_bush_elephant,african,0.302117
223615,African_bush_elephant,bush,0.275314
...,...,...,...
245093,White-collared_kite,flight,0.086618
244463,White-collared_kite,eastern,0.081678
242838,White-collared_kite,brazil,0.072956
242879,White-collared_kite,broad,0.072956


In [7]:
# This line of code just saves the above output to a variable so that you can query it.

top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

In [8]:
# This says "find all the documents that have bird in their top 10.

top_tfidf[top_tfidf['term'].str.contains('bird')]

Unnamed: 0,document,term,tfidf
165450,Funds_for_Endangered_Parrots,bird,0.149423
522872,Green_peafowl,birds,0.082535
551852,Pterodroma_madeira,birds,0.118991
513210,Red-billed_curassow,bird,0.165133


In [9]:
# This says "find the top ten words in the African bush elephant document

top_tfidf[top_tfidf['document'].str.contains('African_bush_elephant')]

Unnamed: 0,document,term,tfidf
225213,African_bush_elephant,elephants,0.405045
225211,African_bush_elephant,elephant,0.366661
223595,African_bush_elephant,bulls,0.343408
222750,African_bush_elephant,african,0.302117
223615,African_bush_elephant,bush,0.275314
227987,African_bush_elephant,musth,0.197433
231782,African_bush_elephant,years,0.186117
224451,African_bush_elephant,cows,0.166857
222756,African_bush_elephant,age,0.116129
231236,African_bush_elephant,tusks,0.103417


### The pièce de résistance: A heat map

The code below (from [here](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html)] makes a lovely visualization showing what terms are most particular to each document, according to TF-IDF. Again, you can just run the code and see what happens.

You'll see that, unsurprisingly, the most prominent or special or particular word for each document is usually the name of the revelant animal, but they you'll start to see other interesting patterns.

Many of you will find this useful in projects that try to compare sets of documents over time.

In [10]:
import altair as alt
import numpy as np


# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    y = 'document:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["document"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + text).properties(width = 600)