<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
___

# Finding Significant Words Using TF/IDF

**Description:**
This [notebook](https://docs.constellate.org/key-terms/#jupyter-notebook) shows how to discover significant words. The method for finding significant terms is [tf-idf](https://docs.constellate.org/key-terms/#tf-idf).  The following processes are described:

* An educational overview of TF-IDF, including how it is calculated
* Using the `constellate` client to retrieve a dataset
* Filtering based on a pre-processed ID list
* Cleaning the tokens in the dataset
* Creating a [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary)
* Creating a [gensim](https://docs.constellate.org/key-terms/#gensim) [bag of words](https://docs.constellate.org/key-terms/#bag-of-words) [corpus](https://docs.constellate.org/key-terms/#corpus)
* Computing the most significant words in your [corpus](https://docs.constellate.org/key-terms/#corpus) using [gensim](https://docs.constellate.org/key-terms/#gensim) implementation of [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf)

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

[Take me to the **Research Version** of this notebook ->](./finding-significant-terms-for-research.ipynb)

**Difficulty:** Intermediate

**Completion time:** 60 minutes

**Knowledge Required:** 
* Python Basics Series ([Start Python Basics I](./python-basics-1.ipynb))

**Knowledge Recommended:**
* [Exploring Metadata](./metadata.ipynb)
* [Working with Dataset Files](./working-with-dataset-files.ipynb)
* [Pandas I](./pandas-1.ipynb)
* A familiarity with [gensim](https://docs.constellate.org/key-terms/#gensim) is helpful but not required.

**Data Format:** [JSON Lines (.jsonl)](https://docs.constellate.org/key-terms/#jsonl)

**Libraries Used:**
* [constellate](https://docs.constellate.org/key-terms/#tdm-client) client to collect, unzip, and read our dataset
* [pandas](https://constellate.org/docs/key-terms/#pandas) to load a preprocessing list
* [gensim](https://docs.constellate.org/key-terms/#gensim) to help compute the [tf-idf](https://docs.constellate.org/key-terms/#tf-idf) calculations
* [NLTK](https://docs.constellate.org/key-terms/#nltk) to create a stopwords list (if no list is supplied)

**Research Pipeline:**

1. Build a dataset
2. Create a "Pre-Processing CSV" with [Exploring Metadata](./exploring-metadata.ipynb) (Optional)
3. Complete the TF-IDF analysis with this notebook
____

## What is "Term Frequency- Inverse Document Frequency" (TF-IDF)?

[TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) is used in [machine learning](https://docs.constellate.org/key-terms/#machine-learning) and [natural language processing](https://docs.constellate.org/key-terms//#nlp) for measuring the significance of particular terms for a given document. It consists of two parts that are multiplied together:

1. Term Frequency- A measure of how many times a given word appears in a document
2. Inverse Document Frequency- A measure of how many times the same word occurs in other documents within the corpus

**Before starting this lesson,** we recommend reading the [explanation of TF/IDF](https://ghost.constellate.org/what-is-tf-idf/) in the Constellate documentation.


### TF-IDF Calculation in Plain English

$$(Times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-word)}$$

There are variations on the [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) formula, but this is the most widely-used version.

## Computing TF-IDF with your Dataset

We'll use the `constellate` client to automatically retrieve the dataset in the JSON file format. 

Enter a [dataset ID](https://docs.constellate.org/key-terms/#dataset-ID) in the next code cell.

If you don't have a dataset ID, you can:
* Use the sample dataset ID already in the code cell
* [Create a new dataset](https://constellate.org/builder)
* [Use a dataset ID from other pre-built sample datasets](https://constellate.org/dataset/dashboard)

In [1]:
# Default dataset is "Shakespeare Quarterly," 1950-present
dataset_id = "34eb1175-92d1-4fd3-ca54-438e575b6e64"

Next, import the `constellate` client, passing the `dataset_id` as an argument using the `get_dataset` method.

In [2]:
# Importing your dataset with a dataset ID
import constellate
# Pull in the sampled dataset (1500 documents) that matches `dataset_id`
# in the form of a gzipped JSON lines file.
# The .get_dataset() method downloads the gzipped JSONL file
# to the /data folder and returns a string for the file name and location
dataset_file = constellate.get_dataset(dataset_id)

# To download the full dataset (up to a limit of 25,000 documents),
# request it first in the builder environment. See the Constellate Client
# documentation at: https://constellate.org/docs/constellate-client
# Then use the `constellate.download` method show below.
#dataset_file = constellate.download(dataset_id, 'jsonl')

Constellate: use and download of datasets is covered by the Terms & Conditions of Use: https://constellate.org/terms-and-conditions/
Trump from 1900 - 2022. 25000 documents.
INFO:root:File /root/data/34eb1175-92d1-4fd3-ca54-438e575b6e64-sampled-jsonl.jsonl.gz exists. Not re-downloading.


## Apply Pre-Processing Filters (if available)
If you completed pre-processing with the "Exploring Metadata and Pre-processing" notebook, you can use your CSV file of dataset IDs to automatically filter the dataset. Your pre-processed CSV file  must be in the root folder.

In [3]:
# Import a pre-processed CSV file of filtered dataset IDs.
# If you do not have a pre-processed CSV file, the analysis
# will run on the full dataset and may take longer to complete.
import pandas as pd
import os

pre_processed_file_name = f'data/pre-processed_{dataset_id}.csv'

if os.path.exists(pre_processed_file_name):
    df = pd.read_csv(pre_processed_file_name)
    filtered_id_list = df["id"].tolist()
    use_filtered_list = True
    print('Pre-Processed CSV found. Successfully read in ' + str(len(df)) + ' documents.')
else: 
    use_filtered_list = False
    print('No pre-processed CSV file found. Full dataset will be used.')

INFO:numexpr.utils:NumExpr defaulting to 4 threads.
No pre-processed CSV file found. Full dataset will be used.


## Define a Unigram Processing Function
In this step, we gather the unigrams. If there is a Pre-Processing Filter, we will only analyze documents from the filtered ID list. We will also process each unigram, assessing them individually. We will complete the following tasks:

* Lowercase all tokens
* Remove tokens in stopwords list
* Remove tokens with fewer than 4 characters
* Remove tokens with non-alphabetic characters

We can define this process in a function.

In [4]:
# Define a function that will process individual tokens
# Only a token that passes through all three `if` 
# statements will be returned. A `True` result for
# any `if` statement does not return the token. 

def process_token(token):
    token = token.lower()
    if len(token) < 4: # If True, do not return token
        return None
    if not(token.isalpha()): # If True, do not return token
        return None
    return token # If all are False, return the lowercased token

## Collect lists of Document IDs, Titles, and Unigrams

Next, we process all the unigrams into a list called `documents`. For demonstration purposes, this code runs on a limit of 500 documents, but we can change this to process all the documents. We are also collecting the document titles and ids so we can reference them later.

In [5]:
documents = [] # A list that will contain all of our unigrams
document_ids = [] # A list that will contain all of our document ids
document_titles = [] # A list that will contain all of our titles

for document in constellate.dataset_reader(dataset_file):
    processed_document = [] # Temporarily store the unigrams for this document
    document_id = document['id'] # Temporarily store the document id for this document
    document_title = document['title'] # Temporarily store the document title for this document
    if use_filtered_list is True:
        # Skip documents not in our filtered_id_list
        if document_id not in filtered_id_list:
            continue
    unigrams = document.get("unigramCount", [])
    for gram, count in unigrams.items():
        clean_gram = process_token(gram)
        if clean_gram is None:
            continue
        processed_document += [clean_gram] * count # Add the unigram as many times as it was counted
    if len(processed_document) > 0:
        document_ids.append(document_id)
        document_titles.append(document_title)
        documents.append(processed_document)


At this point, we have unigrams collected for all our documents insde the `documents` list variable. Each index of our list is a single document, starting with `documents[0]`. Each document is, in turn, a list with a single stringe for each unigram.

**Note:** As we collect the unigrams for each document, we are simply including them in a list of strings. This is not the same as collecting them into word counts, and we are not using a Counter() object here like the Word Frequencies notebook. 

The next cell demonstrates the contents of each item in our `document` list. Essentially, 

In [6]:
# Show the unigrams collected for a particular document
# Change the value of n to see a different document
n = 0

print(document_titles[n])
list(documents[n])

Russell and the Handshake: Greeting in Spiritual Care


['vigorously',
 'malette',
 'emphasizing',
 'power',
 'power',
 'power',
 'power',
 'power',
 'power',
 'power',
 'power',
 'power',
 'verbal',
 'room',
 'room',
 'room',
 'theological',
 'provide',
 'would',
 'would',
 'whether',
 'whether',
 'whether',
 'whether',
 'whether',
 'whether',
 'whether',
 'whether',
 'whether',
 'whether',
 'whether',
 'whether',
 'recognize',
 'recognize',
 'naturally',
 'offers',
 'offers',
 'black',
 'signals',
 'numerous',
 'numerous',
 'numerous',
 'numerous',
 'numerous',
 'while',
 'while',
 'fixed',
 'shake',
 'shake',
 'shake',
 'shake',
 'shake',
 'away',
 'same',
 'proven',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'light',
 'gift',
 'lesso

If we wanted to see word frequencies, we could convert the lists at this point into `Counter()` objects. The next cell demonstrates that operation.

In [7]:
# Convert a given document into a Counter object to determine
# word frequencies count

# Import counter to help count word frequencies
from collections import Counter

word_freq = Counter(documents[0]) # Change documents index to see a different document
word_freq.most_common(25) 

[('that', 71),
 ('greeting', 69),
 ('from', 40),
 ('handshake', 36),
 ('spiritual', 32),
 ('care', 30),
 ('such', 29),
 ('other', 28),
 ('this', 27),
 ('with', 27),
 ('touch', 24),
 ('approach', 23),
 ('when', 22),
 ('have', 21),
 ('social', 20),
 ('there', 19),
 ('though', 18),
 ('about', 18),
 ('whether', 17),
 ('handshakes', 17),
 ('many', 17),
 ('experience', 17),
 ('counseling', 16),
 ('hand', 16),
 ('several', 15)]

Now that we have all the cleaned unigrams for every document in a list called `documents`, we can use Gensim to compute TF/IDF.

---
## Using Gensim to Compute "Term Frequency- Inverse Document Frequency"

It will be helpful to remember the basic steps we did in the explanatory [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) example:

1. Create a list of the frequency of every word in every document
2. Create a list of every word in the [corpus](https://docs.constellate.org/key-terms/#corpus)
3. Compute [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) based on that data

So far, we have completed the first item by creating a list of the frequency of every word in every document. Now we need to create a list of every word in the corpus. In [gensim](https://docs.constellate.org/key-terms/#gensim), this is called a "dictionary". A [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) is similar to a [Python dictionary](https://docs.constellate.org/key-terms/#python-dictionary), but here it is called a [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) to show it is a specialized kind of dictionary.

### Creating a Gensim Dictionary

Let's create our [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary). A [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) is a kind of masterlist of all the words across all the documents in our corpus. Each unique word is assigned an ID in the gensim dictionary. The result is a set of key/value pairs of unique tokens and their unique IDs.

In [8]:
import gensim
dictionary = gensim.corpora.Dictionary(documents)

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(1039616 unique tokens: ['aaron', 'abdominal', 'aborted', 'about', 'above']...) from 1500 documents (total 17800866 corpus positions)


The [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) stores a unique identifier (starting with 0) for every unique token in the corpus. The [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) does not contain information on word frequencies; it only catalogs all the unique words in the corpus. You can see the unique ID for each token in the text using the .token2id() method.

In [9]:
list(dictionary.token2id.items())

[('aaron', 0),
 ('abdominal', 1),
 ('aborted', 2),
 ('about', 3),
 ('above', 4),
 ('abovementioned', 5),
 ('abrupt', 6),
 ('abstract', 7),
 ('abuse', 8),
 ('accept', 9),
 ('acceptable', 10),
 ('accepted', 11),
 ('accepting', 12),
 ('accompanies', 13),
 ('accompanying', 14),
 ('accurately', 15),
 ('achieving', 16),
 ('acknowledge', 17),
 ('across', 18),
 ('acteristics', 19),
 ('actions', 20),
 ('activities', 21),
 ('actualising', 22),
 ('addition', 23),
 ('additional', 24),
 ('admitted', 25),
 ('advisable', 26),
 ('after', 27),
 ('again', 28),
 ('alike', 29),
 ('allowed', 30),
 ('allowing', 31),
 ('almighty', 32),
 ('almost', 33),
 ('alone', 34),
 ('along', 35),
 ('also', 36),
 ('alter', 37),
 ('always', 38),
 ('ambiguity', 39),
 ('america', 40),
 ('american', 41),
 ('among', 42),
 ('amputated', 43),
 ('anglican', 44),
 ('another', 45),
 ('answering', 46),
 ('anyone', 47),
 ('anything', 48),
 ('applied', 49),
 ('appraised', 50),
 ('appreciating', 51),
 ('apprehensive', 52),
 ('approach'

We could also look up the corresponding ID for a token using the ``.get`` method.

In [10]:
# Get the value for the key 'people'. Return 0 if there is no token matching 'people'. 
# The number returned is the gensim dictionary ID for the token. 

dictionary.token2id.get('people', 0)

732

For the sake of example, we could also discover a particular token using just the ID number. This is not something likely to happen in practice, but it serves here as a demonstration of the connection between tokens and their ID number.

Normally, [Python dictionaries](https://docs.constellate.org/key-terms/#python-dictionary) only map from keys to values (not from values to keys). However, we can write a quick for loop to go the other direction. This cell is simply to demonstrate how the [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) is connected to the list entries in the [gensim](https://docs.constellate.org/key-terms/#gensim) ``bow_corpus``.

In [11]:
# Find the token associated with a token id number
token_id = 100

# If the token id matches, print out the associated token
for dict_id, token in dictionary.items():
    if dict_id == token_id:
        print(token)

become


## Creating a Bag of Words Corpus

The next step is to connect our word frequency data found within ``documents`` to our [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) token IDs. For every document, we want to know how many times a word (notated by its ID) occurs. We will create a [Python list](https://docs.constellate.org/key-terms/#python-list) called ``bow_corpus`` that will turn our word counts into a series of [tuples](https://docs.constellate.org/key-terms/#tuple) where the first number is the [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) token ID and the second number is the word frequency.

![Combining Gensim dictionary with documents list to create Bag of Words Corpus](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/bag-of-words-creation.png)

In [12]:
# Create a bag of words corpus
bow_corpus = []

for document in documents:
    bow_corpus.append(dictionary.doc2bow(document))

print('Bag of words corpus created successfully.')

# The for loop could also be written as a list comprehension
# bow_corpus = [dictionary.doc2bow(document) for document in documents]

Bag of words corpus created successfully.


In [13]:
# Examine the bag of words corpus for a specific document n
# Change the value of n to see another document
n = 100

list(bow_corpus[n][:25]) # List out a slice of the first 25 items

[(3, 8),
 (8, 1),
 (9, 1),
 (17, 1),
 (18, 1),
 (23, 4),
 (27, 8),
 (28, 3),
 (34, 2),
 (35, 3),
 (36, 20),
 (38, 1),
 (42, 1),
 (45, 5),
 (47, 1),
 (56, 4),
 (58, 3),
 (59, 2),
 (63, 1),
 (68, 2),
 (73, 3),
 (78, 1),
 (91, 1),
 (94, 2),
 (98, 1)]

Using IDs can seem a little abstract, but we can discover the word associated with a particular ID. For demonstration purposes, the following code prints the associated token with the token counts. Each line printed below matches the token id and and count from above.

In [14]:
# For each id and count in the bag of words corpus
# Print the corresponding word from the Gensim dictionary and count
for id, count in bow_corpus[n]:
    print(dictionary[id].ljust(15), count)

about           8
abuse           1
accept          1
acknowledge     1
across          1
addition        4
after           8
again           3
alone           2
along           3
also            20
always          1
among           1
another         5
anyone          1
appropriate     4
area            3
argue           2
around          1
aspect          2
assumptions     3
attention       1
away            1
back            2
became          1
because         14
become          2
been            13
being           6
benefits        5
best            7
between         19
beyond          5
body            2
both            19
brown           1
called          1
cannot          3
care            2
caregiver       1
case            15
central         2
challenges      4
change          1
chapter         3
child           39
choose          1
cited           2
clear           5
close           2
come            2
common          19
compared        1
comparing       1
concept         1
co

virtually       2
demonstration   1
minimal         3
assessing       1
chapters        1
committed       1
disparate       1
dissent         3
dissenting      2
doctrinal       1
doctrines       3
estimates       1
footnote        1
gendered        2
illegitimacy    9
justifying      2
margins         1
norton          1
nurtured        1
persuasive      2
reluctance      1
sketch          1
translated      1
alabama         5
confer          2
elevated        1
levy            1
preventing      1
statute         3
steadily        1
cleveland       2
constituted     1
couples         1
delicacy        1
gerald          3
grandmother     2
ignored         1
obligated       1
obstacle        1
southeastern    1
stevens         1
tified          1
tilt            1
touched         1
veto            1
absolute        1
amendment       2
amounted        1
casey           1
confers         2
controlled      2
declaring       1
denial          2
denying         2
favoring        1
fects     

## Create the `TfidfModel`

The next step is to create the [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) model which will set the parameters for our implementation of [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf). In our [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) example, the formula for [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) was:

$$(Times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-word)}$$

In [gensim](https://docs.constellate.org/key-terms/#gensim), the default formula for measuring [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) uses log base 2 instead of log base 10, as shown:

$$(Times-the-word-occurs-in-given-document) \cdot \log_{2} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-the-word)}$$

If you would like to use a different formula for your [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) calculation, there is a description of [parameters you can pass](https://radimrehurek.com/gensim/models/tfidfmodel.html).

In [15]:
# Create our gensim TF-IDF model
model = gensim.models.TfidfModel(bow_corpus) 

INFO:gensim.models.tfidfmodel:collecting document frequencies
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #0
INFO:gensim.models.tfidfmodel:calculating IDF weights for 1500 documents and 1039616 features (5637886 matrix non-zeros)


Now, we apply our model to the ``bow_corpus`` to create our results in ``corpus_tfidf``. The ``corpus_tfidf`` is a python list of each document similar to ``bow_document``. Instead of listing the frequency next to the [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) ID, however, it contains the [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) score for the associated token. Below, we display the first document in ``corpus_tfidf``.

In [16]:
# Create TF-IDF scores for the ``bow_corpus`` using our model
corpus_tfidf = model[bow_corpus]

In [17]:
# List out the TF-IDF scores for the nth document's first 10 tokens
# Change n to change the document
n = 0

list(corpus_tfidf[n][:10])

[(0, 0.044073615688884696),
 (1, 0.011192904454479775),
 (2, 0.01298589065779011),
 (3, 0.0028082698488385593),
 (4, 0.003944021514914812),
 (5, 0.02921211714692628),
 (6, 0.009859209718329077),
 (7, 0.004184247891380116),
 (8, 0.005595275303462857),
 (9, 0.004499175751650634)]

Let's display the tokens instead of the [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) IDs.

In [18]:
for id, score in corpus_tfidf[n][:10]:
    print(dictionary[id].ljust(20), score)

aaron                0.044073615688884696
abdominal            0.011192904454479775
aborted              0.01298589065779011
about                0.0028082698488385593
above                0.003944021514914812
abovementioned       0.02921211714692628
abrupt               0.009859209718329077
abstract             0.004184247891380116
abuse                0.005595275303462857
accept               0.004499175751650634


## Find Top Terms in a Single Document
Finally, let's sort the terms by their [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) weights to find the most significant terms in the document.

In [19]:
# Sort the tuples in our tf-idf scores list

# Choosing a document by its index number
# Change n to see a different document
n = 0

def Sort(tfidf_tuples):
    "This sorts based on the second value in our tuple, the tf-idf score"
    tfidf_tuples.sort(key = lambda x: x[1], reverse=True)
    return tfidf_tuples 

# Print the document id and title
print('Title: ', document_titles[n])
print('ID: ', document_ids[n])
print('----------------------------------------')

# List the top twenty tokens in our example document by their TF-IDF scores
# First we sort the tokens with their scores
most_significant_terms = Sort(corpus_tfidf[n])[:20]

# Next we print the list, replacing the token ids with the tokens
for id, score in most_significant_terms:
    print(dictionary[id].ljust(20), score)

Title:  Russell and the Handshake: Greeting in Spiritual Care
ID:  ark://27927/phzdsj96w2k
----------------------------------------
handshake            0.5421573560361246
greeting             0.5386189238803416
handshakes           0.3350811687564916
counseling           0.17683069345124747
spiritual            0.15648025034313418
greetings            0.11831051661994892
vanier               0.10498328303848525
greet                0.09633719761236037
welcoming            0.09104484561350143
client               0.08559527917188679
nonverbal            0.08416677368770462
retrieved            0.08194036105215129
touch                0.08108127784155768
handshaking          0.0788426279427039
renison              0.06958526244664735
pastoral             0.06948508256655142
wesson               0.0692815624197191
therapeutic          0.06922484107943937
awkward              0.06538760890308223
greets               0.061014665548110615


We could also analyze across the entire corpus to find the most unique terms. These are terms that appear in a particular text, but rarely or never appear in other texts. (Often, these will be proper names since a particular article may mention a name often but the name may rarely appear in other articles. There's also a fairly good chance these will be typos or errors in optical character recognition.)

In [20]:
td = {}
for document in corpus_tfidf:
    for token_id, score in document:
        current_score = td.get(dictionary.get(token_id), 0)
        if current_score < score:
            td.update([(dictionary.get(token_id), score)])

In [21]:
# Sort the items of ``td`` into a new variable ``sorted_td``
# the ``reverse`` starts from highest to lowest
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True) 

for term, weight in sorted_td[:25]: # Print the top 25 terms in the entire corpus
    print(term, weight)

ornstein 0.9566290035722528
depuy 0.9469110306689708
pornography 0.9448809540950692
transgender 0.9370654172748747
jurats 0.9214825375935485
circumcision 0.921159716804361
hydroxyurea 0.9190001229829636
stimpson 0.9169873171776958
shuttlesworth 0.9112300665177129
methadone 0.9108329742182604
pushkin 0.9054269106510094
fundraising 0.9047704684885292
bayesian 0.9029780119877477
agresto 0.9004906816196825
toxicology 0.8969706143259554
psyop 0.894534379139476
disability 0.8934702105529245
sherley 0.8928710413351351
justifications 0.8903675573047136
productivity 0.8870315568043946
latinos 0.8822137514301951
footlight 0.8812132515545672
mungiki 0.8747949824667993
sursurunga 0.8717689162716556
coudert 0.8693206192317072


## Display Most Significant Term for each Document
We can see the most significant term in every document.

In [22]:
# For each document, print the ID, most significant/unique word, and TF/IDF score

n = 0

for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(document_ids[n], dictionary.get(word_id), score)
    if n >= 10:
        break

ark://27927/phzdsj96w2k handshake 0.5421573560361246
ark://27927/phzbq5pkrd8 algorithmic 0.4858308014428951
ark://27927/phxwvkbqgn testing 0.5042442811971646
ark://27927/phx829nc410 nnos 0.5090595274585864
https://chroniclingamerica.loc.gov/lccn/sn94057002/1921-09-23/ed-1 edgeworth 0.28665957481825827
ark://27927/phw1hwdh22j natal 0.4358440931638784
ark://27927/phxwn96wc4 nordic 0.5998063952186988
ark://27927/phx85j9xkxb mystery 0.45488544925971786
ark://27927/phx2bnw9171 preferential 0.5276347400683752
ark://27927/pbd93gdsqn muslim 0.6256232220007466
https://chroniclingamerica.loc.gov/lccn/sn84026824/1914-04-23/ed-1 shepherdstown 0.32450991408598096


## Ranking documents by TF-IDF Score for a Search Word


In [23]:
# Set a limit on the number of documents analyzed
limit = 500

from collections import defaultdict
terms_to_docs = defaultdict(list)
for doc_id, doc in enumerate(corpus_tfidf):
    for term_id, value in doc:
        term = dictionary.get(term_id)
        terms_to_docs[term].append((doc_id, value))
    if doc_id >= limit:
        break


In [24]:
# Pick a unigram to discover its score across documents
search_term = 'coriolanus'

# Display a list of documents and scores for the search term

matching = terms_to_docs.get(search_term)

try: 
    for doc_id, score in sorted(matching, key=lambda x: x[1], reverse=True):
        print(document_ids[doc_id], score)
except:
    print('Search term not found. Change the term or expand the corpus size.')

Search term not found. Change the term or expand the corpus size.
