By <a href="https://nkelber.com">Nathan Kelber</a> and Ted Lawless <br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.
____

# Finding Significant Words within a Dataset Using TF/IDF
**Difficulty:** Intermediate

**Programming Knowledge Required:** 
This notebook can be run on a JSTOR/Portico [non-consumptive](./key-terms.ipynb#non-consumptive) [JSON Lines (.jsonl)](./key-terms.ipynb#jsonl) [dataset](./key-terms.ipynb#dataset) with little to no knowledge of [Python](./key-terms.ipynb#python). To have a full understanding of the code used in this [notebook](./key-terms.ipynb#jupyter-notebook), we recommend learning:
* [Python Basics](https://automatetheboringstuff.com/2e/chapter1/)
* [Flow Control](https://automatetheboringstuff.com/2e/chapter2/)
* [Functions](https://automatetheboringstuff.com/2e/chapter3/)
* [Lists](https://automatetheboringstuff.com/2e/chapter4/)
* [Dictionaries](https://automatetheboringstuff.com/2e/chapter5/)

**Completion time:** 75 minutes

**Data Format:** [JSTOR](./key-terms.ipynb#jstor) and/or [Portico](./key-terms.ipynb#portico) [non-consumptive](./key-terms.ipynb#non-consumptive) [JSON Lines (.jsonl)](./key-terms.ipynb#jsonl)

**Libraries Used:**
* **[json](./key-terms.ipynb#json-python-library)** to convert our dataset from json lines format to a Python list
* **[gensim](./key-terms.ipynb#gensim)** to help compute the [tf-idf](./key-terms.ipynb#tf-idf) calculation

**Description of methods in this notebook:**
This [notebook](./key-terms.ipynb#jupyter-notebook) shows how to discover significant words in your [JSTOR](./key-terms.ipynb#jstor) and/or [Portico](./key-terms.ipynb#portico) [dataset](./key-terms.ipynb#dataset) using [Python](./key-terms.ipynb#python). The method for finding significant terms is [tf-idf](./key-terms.ipynb#tf-idf).  The following processes are described:

* Converting your [JSTOR](./key-terms.ipynb#jstor) and/or [Portico](./key-terms.ipynb#portico)[dataset](./key-terms.ipynb#dataset) into a Python list
* Writing a helper function to help clean up a single [token](./key-terms.ipynb#token)
* Cleaning each document of your dataset, one [token](./key-terms.ipynb#token) at a time
* Using a dictionary of English words to remove words with poor [OCR](./key-terms.ipynb#ocr)
* Computing the most significant words in your [corpus](./key-terms.ipynb#corpus) using [TFIDF](./key-terms.ipynb#tf-idf) with the [gensim](./key-terms.ipynb#gensim) library

A familiarity with [gensim](./key-terms.ipynb#gensim) is helpful but not required.
____

## Understanding "Term Frequency- Inverse Document Frequency" (TF-IDF)

TF-IDF is used in machine learning and text mining for measuring the significance of particular terms for a given document. It consists of two parts that are multiplied together:

1. Term Frequency- A measure of how many times a given word appears in a document
2. Inverse Document Frequency- A measure of how many times the same word occurs in other documents within the corpus

If we were to merely consider word frequency, the most frequent words would be common function words like: "the", "and", "of". We could use a stopwords list to remove the common function words, but that still may not give us results that describe the unique terms in the document since the uniqueness of terms depends on the context of a larger body of documents. In other words, the same term could be significant or insignificant depending on the context. Consider these examples:

* Given a set of scientific journal articles in biology, the term "lab" may not be significant since biologists often rely on and mention labs in their research. However, if the term "lab" were to occur frequently in a history or English article, then it is likely to be significant since humanities articles rarely discuss labs. 
* If we were to look at thousands of articles in literary studies, then the term "postcolonial" may be significant for any given article. However, if were to look at a few hundred articles on the topic of "the global south," then the term "postcolonial" may occur so frequently that it is not a significant way to differentiate between the articles.

The TF-IDF calculation reveals the words that are frequent in this document **yet rare in other documents**. The goal is to find out what is unique or remarkable about a document given the context and that context can change the results of the analysis. 

Here is how the calculation is mathematically written:

$$tfidf_{t,d} = tf_{t,d} \cdot idf_{t,D}$$

In plain English, this means: **The value of TF-IDF is the product (or multiplication) of a given term's frequency multiplied by its inverse document frequency.** Let's unpack these terms one at a time.

### Term Frequency Function

$$tf_{t,d}$$
The number of times (t) a term occurs in a given document (d)

### Inverse Document Frequency Function

$$idf_i = \mbox{log} \frac{N}{|{d : t_i \in d}|}$$
The inverse document frequency can be expanded to the calculation on the right. In plain English, this means: **The log of the total number of documents (N) divided by the number of documents that contain the term**

### TF-IDF Calculation in Plain English

$$ tf-idf = (Number-of-times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Total-number-of-documents-containing-the-word)}$$

There are variations on the TF-IDF formula, but this is the most widely-used version.

### An Example Calculation of TF-IDF

Let's take a look at an example to illustrate the fundamentals of TF-IDF. First, we need several texts to compare. Our texts will be very simple.

* text1 = 'The grass was green and spread out the distance like the sea.'
* text2 = 'Green eggs and ham were spread out like the book.'
* text3 = 'Green sailors were met like the sea met troubles.'
* text4 = 'The grass was green.'

The first step is we need to discover how many unique words are in each text. 

|text1|text2|text3|text4|
|    ---    | ---| --- | --- |
|the|green|green|the|
|grass|eggs|sailors|grass|
|was|and|were|was|
|green|ham|met|green|
|and|were|like| |
|spread|spread|the| |
|out|out|sea| |
|into|like|met| |
|distance|the|troubles| |
|like|book| | |
|sea| | | |


Our four texts share some similar words. Next, we create a single list of unique words that occur across all three texts.

|Unique Words|
| --- |
|and|
|book|
|distance|
|eggs|
|grass|
|green|
|ham|
|like|
|met|
|out|
|sailors|
|sea|
|spread|
|the|
|troubles|
|was|
|were|

Now let's count the occurences of each unique word in each sentence

|word|text1|text2|text3|text4|
|---|---|---|---|---|
|and|1|1|0|0|
|book|0|1|0|0|
|distance|1|0|0|0|
|eggs|0|1|0|0|
|grass|1|0|0|1|
|green|1|1|1|1|
|ham|0|1|0|0|
|like|1|1|1|0|
|met|0|0|2|0|
|out|1|1|0|0|
|sailors|0|0|1|0|
|sea|1|0|1|0|
|spread|1|1|0|0|
|the|3|1|1|1|
|troubles|0|0|1|0|
|was|1|0|0|1|
|were|0|1|1|0|

### Computing TF-IDF (Example 1)

We have enough information now to compute TF-IDF for every word in our corpus. Recall the plain English formula.

$$ tf-idf = (Number-of-times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Total-number-of-documents-containing-the-word)}$$

We can use the formula to compute TF-IDF for the most common word in our corpus: 'the'. We will compute TF-IDF four times, once for each of our texts. 

|word|text1|text2|text3|text4|
|---|---|---|---|---|
|the|3|1|1|1|

text1: $$ tf-idf = 3 \cdot \mbox{log} \frac{4}{(4)} = 3 \cdot \mbox{log} 1 = 3 \cdot 0 = 0$$
text2: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(4)} = 1 \cdot \mbox{log} 1 = 1 \cdot 0 = 0$$
text3: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(4)} = 1 \cdot \mbox{log} 1 = 1 \cdot 0 = 0$$
text4: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(4)} = 1 \cdot \mbox{log} 1 = 1 \cdot 0 = 0$$

The results of our analysis suggest 'the' has a weight of 0 in every document. The word 'the' exists in all of our documents, and therefore it is not a significant term to differentiate one document from another.

Given that idf is

$$\mbox{log} \frac{(Total-number-of-documents)}{(Total-number-of-documents-containing-the-word)}$$

and 

$$\mbox{log} 1 = 0$$
we can see that TF-IDF will be 0 for any word that occurs in every document. That is, if a word occurs in every document, then it is not a significant term for any individual document.



### Computing TF-IDF (Example 2)

Let's try a second example with the word 'out'. Recall the plain English formula.

$$ tf-idf = (Number-of-times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Total-number-of-documents-containing-the-word)}$$

We will compute TF-IDF four times, once for each of our texts. 

|word|text1|text2|text3|text4|
|---|---|---|---|---|
|out|1|1|0|0|

text1: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(2)} = 1 \cdot \mbox{log} 2 = 1 \cdot .3010 = .3010$$
text2: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(2)} = 1 \cdot \mbox{log} 2 = 1 \cdot .3010 = .3010$$
text3: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(2)} = 0 \cdot \mbox{log} 2 = 0 \cdot .3010 = 0$$
text4: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(2)} = 0 \cdot \mbox{log} 2 = 0 \cdot .3010 = 0$$

The results of our analysis suggest 'out' has some significance in text1 and text2, but no significance for text3 and text4 where the word does not occur.

### Computing TF-IDF (Example 3)

Let's try one last example with the word 'met'. Here's the tf-idf formula again:

$$ tf-idf = (Number-of-times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Total-number-of-documents-containing-the-word)}$$

And here's how many times the word 'met' occurs in each text.

|word|text1|text2|text3|text4|
|---|---|---|---|---|
|met|0|0|2|0|

text1: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(1)} = 0 \cdot \mbox{log} 4 = 1 \cdot .6021 = 0$$
text2: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(1)} = 0 \cdot \mbox{log} 4 = 1 \cdot .6021 = 0$$
text3: $$ tf-idf = 2 \cdot \mbox{log} \frac{4}{(1)} = 2 \cdot \mbox{log} 4 = 1 \cdot .6021 = 1.2042$$
text4: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(1)} = 0 \cdot \mbox{log} 4 = 1 \cdot .6021 = 0$$

As should be expected, we can see that the word 'met' is very significant in text3 but not significant in any other text since it does not occur in any other text. 

## The Full TF-IDF Example Table

Here are the original sentences for each text:

* text1 = 'The grass was green and spread out the distance like the sea.'
* text2 = 'Green eggs and ham were spread out like the book.'
* text3 = 'Green sailors were met like the sea met troubles.'
* text4 = 'The grass was green.'

And here's the corresponding TF-IDF scores for each word in each text:

|word|text1|text2|text3|text4|
|---|---|---|---|---|
|and|.3010|.3010|0|0|
|book|0|.6021|0|0|
|distance|.6021|0|0|0|
|eggs|0|.6021|0|0|
|grass|.3010|0|0|.3010|
|green|0|0|0|0|
|ham|0|.6021|0|0|
|like|.1249|.1249|.1249|0|
|met|0|0|.6021|0|
|out|.3010|.3010|0|0|
|sailors|0|0|.6021|0|
|sea|.3010|0|.3010|0|
|spread|.3010|.3010|0|0|
|the|0|0|0|0|
|troubles|0|0|.6021|0|
|was|.3010|0|0|.3010|
|were|0|.3010|.3010|0|

There are a few noteworthy things in this data. 

* The TF-IDF score for any word that does not occur in a text is 0.
* The scores for almost every word in text4 are 0 since it is a shorter version of text1. There are no unique words in text4 since text1 contains all the same words. It is also a short text which means that there are only four words to consider. The words 'the' and 'green' occur in every text, leaving only 'was' and 'grass' which are also found in text1.
* The words 'book', 'eggs', and 'ham' are significant in text2 since they only occur in that text.

Now that you have a basic understanding of how TF-IDF is computed at a small scale, let's try computing TF-IDF on a JSTOR/Portico corpus which could contain millions of words.

---

# Computing TF-IDF with your JSTOR/Portico Dataset

## Importing your dataset

You have two options for bringing your dataset into the local environment:

1. Manually download and upload your dataset
2. Use a dataset id to automatically upload a dataset

### Option one: Manually download and upload your dataset

You can download your dataset from the corpus builder in the link shown below. (You may also have a link to your dataset in your email.) If you wish, you can modify your dataset on your local machine before the next upload phase. This gives you some more flexibility than automatically pulling in your dataset using a dataset ID using option 2 below.

![The link for downloading your dataset](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/downloadDataset.png)

Once you have your dataset ready on your local machine, you can then upload your dataset into JupyterLab by clicking the upload button in the file pane on the left.

![The upload button in the file pane](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/uploadDataset.png)

Make sure to upload your dataset to the "datasets" folder. 

### Option Two: Use a Dataset ID to automatically upload a dataset

You'll use the tdm_client library to automatically upload your dataset. We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](./key-terms.ipynb#corpus) [dataset](./key-terms.ipynb#dataset). To analyze your dataset, use the [dataset ID](./key-terms.ipynb/#dataset-ID) provided when you created your [dataset](./key-terms.ipynb/#dataset). A copy of your [dataset ID](./key-terms.ipynb/#dataset-ID) was sent to your email when you created your [corpus](./key-terms.ipynb#corpus). It should look like a long series of characters surrounded by dashes. 

In [1]:
#Importing your dataset with a dataset ID
import tdm_client
tdm_client.get_dataset("f6ae29d4-3a70-36ee-d601-20a8c0311273", "sampleJournalAnalysis") #Load the sample dataset, the full run of Shakespeare Quarterly from 1950-2013.

# Other humanities datasets:

#English
# Negro American Literature Forum (1967-1976) + Black American Literature Forum (1976-1991) + African American Review (1992-2016) (b4668c50-a970-c4d7-eb2c-bb6d04313542)
# Shakespeare Quarterly (1950-2013) (f6ae29d4-3a70-36ee-d601-20a8c0311273)
# ELH (1934-2014) (4999901a-fa17-31da-cfe5-2abf3a429df7)
# College English (1939-2016) (a161f384-720b-b6bf-a0cc-4d7d3b857e1c)
# PMLA (1889-2014) (1aea53b9-26d5-fe54-e35c-8259156ce6cd)

#History

#Philosophy

#Anthropology

#Law

#Art

#Classics
#Classical Quarterly (1907-2014) (82014740-8ed9-3c34-5716-d0879b8317f6)

'datasets/sampleJournalAnalysis.jsonl'

Before we can begin working with our [dataset](./key-terms.ipynb#dataset), we need to convert the [JSON lines](./key-terms.ipynb#jsonl) file written in [JavaScript](./key-terms.ipynb#javascript) into [Python](./key-terms.ipynb#python) so we can work with it. Remember that each line of our [JSON lines](./key-terms.ipynb#jsonl) file represents a single text, whether that is a journal article, book, or something else. We will create a [Python](./key-terms.ipynb#python) list that contains every document. Within each list item for each document, we will use a [Python dictionary](./key-terms.ipynb#python-dictionary) of [key/value pairs](./key-terms.ipynb#key-value-pair) to store information related to that document. 

Essentially we will have a [list](./key-terms.ipynb#python-list) of documents numbered, from zero to the last document. Each [list](./key-terms.ipynb#python-list) item then will be composed of a [dictionary](./key-terms.ipynb#python-dictionary) of [key/value pairs](./key-terms.ipynb#key-value-pair) that allows us to retrieve information from that particular document by number. The structure will look something like this:

![Structure of the corpus, a list of dictionaries](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CorpusView.png)

For each item in our list we will be able to use [key/value pairs](./key-terms.ipynb#key-value-pair) to get a **value** if we supply a **key**. We will call our [Python list](./key-terms.ipynb#python-list) variable `all_documents` since it will contain all of the documents in our [corpus](./key-terms.ipynb#corpus).

In [2]:
# Replace with your filename and be sure your file is in your datasets folder
file_name = 'sampleJournalAnalysis.jsonl' 

# Import the json module
import json
# Create an empty new list variable named `all_documents`
all_documents = [] 
# Temporarily open the file `filename` in the datasets/ folder
with open('./datasets/' + file_name) as dataset_file: 
    #for each line in the dataset file
    for line in dataset_file: 
        # Read each line into a Python dictionary.
        # Create a variable document that contains the line using json.loads to convert the json key/value pairs to a python dictionary
        document = json.loads(line) 
        # Append a new list item to `all_documents` containing the dictionary we created.
        all_documents.append(document) 

Now all of our documents have been converted from our original [JSON lines](./key-terms.ipynb#jsonl) file format (.jsonl) into a [python List](./key-terms.ipynb#python-list) variable named `all_documents`. Let's see what we can discover about our [corpus](./key-terms.ipynb#corpus) with a few simple methods.

First, we can determine how many texts are in our [dataset](./key-terms.ipynb#dataset) by using the `len()` function to get the size of `all_documents`. 

In [3]:
len(all_documents)

6687

---
## Removing Articles That Are Not Full-Length

When journal articles are added to JSTOR, they are broken up into chunks called articles. If we want to analyze every word in every issue, this approach works well. However, if we only want to analyze the full-length articles, we may want to remove some articles. These articles could be things like:

* Tables of Contents
* Indices
* Unauthored Materials
* Short Notes
* Book Reviews

We can design a set of assessments that will remove these materials. Depending on the journal and the field, shorter articles may still be relevant. The example code below demonstrates a set of rules for filtering using the article metadata.

### Articles To Be Removed
* Articles with no authors
* Articles with title: "Review Article"
* Articles with title: "Front Matter"
* Articles with title: "Back Matter"
* Articles with a word count less than 3000 words

The function below outputs a list of the first ten articles followed by the reason each article would or would not be kept. This can be a useful exploratory tool for getting the best starting corpus. You may need to adjust the word count number up or down, particularly for distinguishing between say short notes and full-length articles. You might also consider writing other metadata field tests to narrow your corpus. **Note: The Article Assessment Exploration code below does not change your corpus found within ``all_documents``. It merely serves as a convenient way to consider the logic of your filtering.**

In [4]:
#Article Assessment Exploration

## Define a function ``remove_non_articles`` that will test a single document
def remove_non_articles(test_doc):
    print('Article ' + str(i) + ':') # Print the list index for each article so they can be easily referenced
    print('Title: ' + test_doc.get('title')) # Print the title for the article in question
    print('URL: ' + test_doc.get('id')) # Print the URL for the article so it can be quickly reviewed
    print('Status: ', end='') # Print the phrase 'Status' without a following line-break
    if test_doc.get('creators') == None: # Get the value for the key 'creators' in the test document and check if it is equal to none
        print('Removed--No author') # If the value for 'creators' is none, print 'Removed--No author'
    elif test_doc.get('title') == 'Review Article': # Get the value for the key 'title' in the test document and check if it is equal to 'Review Article'
        print('Removed--Review Article') # If the value for 'title' is 'Review Article', print 'Removed--Review Article'
    elif test_doc.get('title') == 'Front Matter': # Get the value for the key 'title' in the test document and check if it is equal to 'Front Matter'
        print('Removed--Front Matter') # If the value for 'title' is 'Front Matter', print 'Removed--Front Matter'
    elif test_doc.get('title') == 'Back Matter': # Get the value for the key 'title' in the test document and check if it is equal to 'Back Matter'
        print('Removed--Back Matter')  # If the value for 'title' is 'Back Matter', print 'Removed--Back Matter'
    elif test_doc.get('wordCount') < 3000: # Get the value for 'wordCount' in the test document and check if the integer is less than 3000 words (Change this number if you want more or less words)
        print('Removed--Too short at ' + str(test_doc.get('wordCount')) + ' words') # If the value for wordCount is less than 3000, print 'Removed--Too short at' followed by the actual word count
    else:
        print('GOOD ARTICLE at '+ str(test_doc.get('wordCount')) + ' words') # If the article passes all the above tests, print 'GOOD ARTICLE at ' with the article word count

articles_to_show = 10 # Show the first ten articles (Change this number to show more or fewer articles)
#articles_to_show = len(all_documents) # Uncomment to show all articles
for i in range(articles_to_show): # Repeat this process the number of times as the value of ``articles_to_show`` + 1
    remove_non_articles(all_documents[i])  # Run the remove_non_articles function on a single document with list index value of ``i``

Article 0:
Title: Review Article
URL: http://www.jstor.org/stable/2869980
Status: Removed--Review Article
Article 1:
Title: Shakespeare in Sydney
URL: http://www.jstor.org/stable/2870198
Status: Removed--Too short at 2032 words
Article 2:
Title: Shakespeare in the Berkshires, 1985
URL: http://www.jstor.org/stable/2870199
Status: Removed--Too short at 1805 words
Article 3:
Title: Review Article
URL: http://www.jstor.org/stable/2870209
Status: Removed--Review Article
Article 4:
Title: Review Article
URL: http://www.jstor.org/stable/2870208
Status: Removed--Review Article
Article 5:
Title: "Thrift is Blessing": Exchange and Explanation In The Merchant of Venice
URL: http://www.jstor.org/stable/2870189
Status: GOOD ARTICLE at 9332 words
Article 6:
Title: 'Willm̄ Shakespeare 1609': The Flower Portrait Revisited
URL: http://www.jstor.org/stable/2870193
Status: GOOD ARTICLE at 5803 words
Article 7:
Title: Creative Uncreation In King Lear
URL: http://www.jstor.org/stable/2870188
Status: GOOD A

Now that we've done some exploratory analysis to figure out the right parameters for filtering our corpus, let's put them into practice. The following set of list comprehensions creates a new list called ``reduced_list`` from our original corpus ``all_documents``. After each filtering, the number of kept articles is printed.

In [5]:
print('Original number of documents: ' + str(len(all_documents))) # Print the original number of documents in ``all_documents``
reduced_list = [all_documents[x] for x in range(len(all_documents)) if all_documents[x].get('creators') != None] # Copy each list item from ``all_documents`` to ``reduced_list`` if the ``creators`` key does not have a value pair of None
print('After removing articles with no authors: ' + str(len(reduced_list))) # Print the current size of ``reduced_list``
reduced_list = [all_documents[x] for x in range(len(reduced_list)) if all_documents[x].get('title') != 'Review Article'] # Copy each list item from ``reduced_list`` to ``reduced_list`` if the ``title`` key does not have a value pair of 'Review Article'
print('After removing "Review Articles": ' + str(len(reduced_list))) # Print the current size of ``reduced_list``
reduced_list = [all_documents[x] for x in range(len(reduced_list)) if reduced_list[x].get('title') != 'Front Matter'] # Copy each list item from ``reduced_list`` to ``reduced_list`` if the ``title`` key does not have a value pair of 'Front Matter'
print('After removing articles labeled "Front Matter": ' + str(len(reduced_list))) # Print the current size of ``reduced_list``
reduced_list = [all_documents[x] for x in range(len(reduced_list)) if reduced_list[x].get('title') != 'Back Matter'] # Copy each list item from ``reduced_list`` to ``reduced_list`` if the ``title`` key does not have a value pair of 'Back Matter'
print('After removing articles labeled "Back Matter": ' + str(len(reduced_list))) # Print the current size of ``reduced_list``
reduced_list = [all_documents[x] for x in range(len(reduced_list)) if reduced_list[x].get('wordCount') < 3000] # Copy each list item from ``reduced_list`` to ``reduced_list`` if the ``wordCount`` has a value pair less than 3000
print('After removing short articles: ' + str(len(reduced_list))) # Print the current size of ``reduced_list``

Original number of documents: 6687
After removing articles with no authors: 5303
After removing "Review Articles": 3610
After removing articles labeled "Front Matter": 3413
After removing articles labeled "Back Matter": 3276
After removing short articles: 2399


## Cleaning Up the Tokens in the Corpus

Let's create a helper function that can standardize and [clean](./key-terms.ipynb#clean-data) up the [tokens](./key-terms.ipynb#token) in our [dataset](./key-terms.ipynb#dataset). The function will:
* lower case all [tokens](./key-terms.ipynb#token)
* use a dictionary from [The HathiTrust Research Center](./key-terms.ipynb#htrc) to correct common [Optical Character Recognition](./key-terms.ipynb#ocr) problems
* discard [tokens](./key-terms.ipynb#token) less than 4 characters in length
* discard [tokens](./key-terms.ipynb#token) with non-alphabetical characters
* remove [stopwords](./key-terms.ipynb#stop-words) based on [The HathiTrust Research Center](./key-terms.ipynb#htrc) [stopword](./key-terms.ipynb#stop-words) list

In [6]:
from tdm_client import htrc_corrections # Import the htrc_corrections that helps correct common OCR problems

def process_token(token): #define a function `process_token` that takes the argument `token`
    token = token.lower() #set the string in token to a new string with all lowercase letters
    corrected = htrc_corrections.get(token) #initialize a new variable `corrected` that runs token through the `htrc_corrections.get()` function to fix common OCR errors
    if corrected is not None: #if corrected has a value, set the `token` variable to the same value as `corrected`
        token = corrected
    if len(token) < 4: #if token is less than four characters, return nothing for process_function (no output here essentially erases this token)
        return
    if not(token.isalpha()): #if token contains non-alphabetic characters, return nothing for process_function (no output here essentially erases this token)
        return
    return token #return the `token` variable which has been set equal to the `corrected` variable

def process_document(chosen_document): # Create a new function ``process_document`` that takes the argument chosen_document
    this_doc = [] # Create a new list ``this_doc`` that will hold the contents of the current document
    singleDoc = chosen_document.get('unigramCount') # Create a list variable ``singleDoc` that will contain the contents of `unigramCount` for the current document
    for token, count in singleDoc.items(): # For each token in the document, 
        clean_token = process_token(token) # Use the ``process_token`` function above to clean that token
        if clean_token is None: # If there is no token returned, proceed
            continue
        this_doc += [clean_token] * count # Add to ``this_doc`` list the number of token occurences
    documents.append(this_doc) # Add the token count results for ``this_doc`` to the ``documents`` list

Now let's cycle through each document in the [corpus](./key-terms.ipynb#corpus) with our helper function.

In [7]:
documents = [] # An empty variable ``documents`` that will contain all our documents with cleaned tokens
for i in range(len(reduced_list)): # Repeat this process once for every document in ``reduced_list``
    process_document(reduced_list[i]) # Run the ``process_token`` function on the single article by reference to its index number of i

---
# Using Gensim to Compute "Term Frequency- Inverse Document Frequency"

It will be helpful to remember the basic steps we did in the explanatory TF-IDF example:

1. Create a list of the frequency of every word in every document
2. Create a list of every word in the corpus
3. Compute TF-IDF based on that data

So far, we have completed the first item by creating a list of the frequency of every word in every document. Now we need to create a list of every word in the corpus. In gensim, this is called a "dictionary". A "gensim dictionary" is a kind of Python dictionary, but here it is called a **gensim dictionary** to show it is a specialized kind of Python dictionary.

## Creating a Gensim Dictionary

Let's create our gensim dictionary. A gensim dictionary is a kind of masterlist of all the words across all the documents in our corpus. Each unique word is assigned an ID in the gensim dictionary. The result is a set of key/value pairs of unique tokens and their unique IDs.

In [8]:
import gensim 
dictionary = gensim.corpora.Dictionary(documents) # Create the Gensim dictionary based on our ``documents`` variable

Now that we have a gensim dictionary, we can get a preview that displays the number of unique tokens across all of our texts.

In [9]:
print(dictionary)

Dictionary(122408 unique tokens: ['account', 'accounts', 'acted', 'acting', 'admirably']...)


The gensim dictionary stores a unique identifier (starting with 0) for every token in the corpus. The gensim dictionary does not contain information on word frequencies; it only catalogs all the words in the corpus. You can see the unique ID for each token in the text using the .token2id() method. Before running the following line of code, keep in mind that your corpus may have hundreds of thousands of unique words so the output will be lengthy.

In [64]:
#print(dictionary.token2id) # Print all of the tokens and their associated IDs. The output will be very large.


We can also look up the corresponding ID for a token using the .get method.

In [44]:
dictionary.token2id.get('people', 0) # Get the value for the key 'people'. Return 0 if there is no token matching 'people'. The number returned is the gensim dictionary ID for the token. 

647

## Creating a Bag of Words Corpus


### A Single Document Example

The next step is to combine our word frequency data found within ``documents`` and our gensim dictionary token IDs. For every document, we want to know how many times a word (notated by its ID) occurs. We can do a single document first to show how this works. We will create a Python list called ``example_bow_corpus`` that will turn our word counts into a series of two-number sets where the first number is the token ID and the second number is the word frequency.

In [10]:
example_bow_corpus = [dictionary.doc2bow(documents[31])] # Create an example bag of words corpus. We select a document at random to use as our sample.
print(example_bow_corpus)

[[(0, 1), (2, 1), (6, 2), (8, 1), (15, 4), (21, 3), (27, 1), (55, 1), (78, 1), (89, 3), (98, 1), (107, 6), (118, 4), (135, 5), (140, 2), (143, 1), (150, 2), (167, 4), (168, 2), (174, 1), (178, 3), (183, 1), (188, 3), (190, 2), (205, 3), (208, 1), (214, 1), (231, 1), (234, 2), (239, 1), (240, 5), (252, 2), (253, 1), (258, 2), (266, 11), (267, 3), (269, 2), (272, 1), (273, 12), (274, 1), (281, 1), (292, 11), (293, 3), (295, 2), (297, 1), (298, 2), (299, 17), (305, 2), (310, 1), (312, 1), (321, 1), (325, 1), (334, 1), (338, 3), (340, 1), (387, 1), (390, 1), (411, 1), (416, 1), (423, 1), (445, 1), (460, 2), (477, 1), (509, 1), (519, 1), (530, 2), (536, 1), (549, 1), (553, 1), (565, 2), (571, 4), (599, 2), (632, 1), (650, 1), (662, 2), (676, 3), (694, 2), (722, 3), (731, 1), (732, 4), (734, 2), (738, 4), (751, 1), (769, 3), (772, 1), (779, 1), (784, 1), (792, 1), (800, 1), (818, 4), (821, 1), (823, 1), (826, 3), (855, 2), (860, 1), (875, 1), (877, 1), (878, 1), (879, 3), (891, 3), (895, 1),

Using IDs can seem a little abstract, but we can discover the word associated with a particular ID. For demonstration purposes, the following code will replace the token IDs in the last example with the actual tokens.

In [21]:
word_counts = [[(dictionary[id], count) for id, count in line] for line in example_bow_corpus]
print(word_counts)

[[('account', 1), ('acted', 1), ('also', 2), ('another', 1), ('august', 4), ('been', 3), ('both', 1), ('costumes', 1), ('early', 1), ('even', 3), ('filled', 1), ('from', 6), ('have', 4), ('james', 5), ('kind', 2), ('later', 1), ('made', 2), ('more', 4), ('most', 2), ('nothing', 1), ('only', 3), ('other', 1), ('performance', 3), ('perhaps', 2), ('presented', 3), ('probably', 1), ('quarterly', 1), ('reviews', 1), ('running', 2), ('shakespeare', 1), ('shakespearean', 5), ('stage', 2), ('staging', 1), ('such', 2), ('that', 11), ('theatre', 3), ('their', 2), ('they', 1), ('this', 12), ('thorough', 1), ('university', 1), ('were', 11), ('when', 3), ('which', 2), ('whose', 1), ('will', 2), ('with', 17), ('about', 2), ('action', 1), ('actors', 1), ('among', 1), ('appeared', 1), ('attempt', 1), ('audience', 3), ('back', 1), ('christian', 1), ('clear', 1), ('could', 1), ('created', 1), ('curtain', 1), ('during', 1), ('equally', 2), ('face', 1), ('gave', 1), ('hand', 1), ('himself', 2), ('huge', 1

We saw before that you could discover the gensim dictionary ID number by running:

> dictionary.token2id.get('people', 0)

If you wanted to discover the token given only the ID number, the method is a little more involved. You could use list comprehension to find the **key** token based on the **value** ID. Normally, Python dictionaries only map from keys to values (not from values to keys). However, we can write a quick list comprehension to go the other direction. (It is unlikely one would ever do these methods in practice, but they are shown here to demonstrate how the gensim dictionary is connected to the list entries in the gensim bow_corpus. 

In [11]:
[token for dict_id, token in dictionary.items() if dict_id == 239] # Find the corresponding token in our gensim dictionary for the gensim dictionary ID 239

['shakespeare']

## Creating a Bag of Words Corpus Using Every Document

We have seen an example that demonstrates how the gensim bag of words corpus works on a single document. Let's apply it now to all of our documents. 

In [12]:
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]
#print(bow_corpus[:3]) #Show the bag of words corpus for the first 3 documents

The next step is to create the TF-IDF model which will set the parameters for our implementation of TF-IDF. In our TF-IDF example, the formula for TF-IDF was:

$$ tf-idf = (Number-of-times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Total-number-of-documents-containing-the-word)}$$

In gensim, the default formula for measuring TF-IDF uses log base 2 instead of log base 10, as shown:

$$ tf-idf = (Number-of-times-the-word-occurs-in-given-document) \cdot \log_{2} \frac{(Total-number-of-documents)}{(Total-number-of-documents-containing-the-word)}$$

If you would like to use a different formula for your TF-IDF calculation, there is a description of [parameters you can pass](https://radimrehurek.com/gensim/models/tfidfmodel.html).

In [39]:
model = gensim.models.TfidfModel(bow_corpus) # Create our gensim TF-IDF model

Now, we apply our model to the ``bow_corpus`` to create our results in ``corpus_tfidf``. The ``corpus_tfidf`` is a python list of each document similar to ``bow_document``. Instead of listing the frequency next to gensim dictionary ID, however, it contains the TF-IDF weight for the associated token. Below, we display the first document in ``corpus_tfidf``. 

In [45]:
corpus_tfidf = model[bow_corpus]
print(corpus_tfidf[0])

[(0, 0.05762667855835049), (1, 0.034102597860575276), (2, 0.037026500927199675), (3, 0.024573056776732383), (4, 0.055572026677451654), (5, 0.05768848742697945), (6, 0.005788813398634042), (7, 0.0324326430165722), (8, 0.013885354304141131), (9, 0.022178791051670185), (10, 0.02121789616390214), (11, 0.0804344500828935), (12, 0.06700485401995936), (13, 0.09039784167442563), (14, 0.01892290473462724), (15, 0.03868128113006783), (16, 0.04052918392207929), (17, 0.09784735758138051), (18, 0.011435281517478949), (19, 0.06267317885904176), (20, 0.01817764065930487), (21, 0.005646660773540439), (22, 0.029714214548092827), (23, 0.0389875691432937), (24, 0.007400700225646965), (25, 0.023665371110920628), (26, 0.038649408076145496), (27, 0.007806309319420407), (28, 0.03546623448951073), (29, 0.028078753540238678), (30, 0.06528967828521084), (31, 0.08353894862815951), (32, 0.029887439348310733), (33, 0.06013176205293062), (34, 0.10138416311140862), (35, 0.04815005189203068), (36, 0.07845062534967462

In [67]:
example_tfidf_scores = [[(dictionary[id], count) for id, count in line] for line in corpus_tfidf]
list(example_tfidf_scores[0])

[('account', 0.05762667855835049),
 ('accounts', 0.034102597860575276),
 ('acted', 0.037026500927199675),
 ('acting', 0.024573056776732383),
 ('admirably', 0.055572026677451654),
 ('affair', 0.05768848742697945),
 ('also', 0.005788813398634042),
 ('andrew', 0.0324326430165722),
 ('another', 0.013885354304141131),
 ('appear', 0.022178791051670185),
 ('appears', 0.02121789616390214),
 ('assembling', 0.0804344500828935),
 ('assembly', 0.06700485401995936),
 ('assists', 0.09039784167442563),
 ('attention', 0.01892290473462724),
 ('august', 0.03868128113006783),
 ('authorship', 0.04052918392207929),
 ('batch', 0.09784735758138051),
 ('because', 0.011435281517478949),
 ('beckerman', 0.06267317885904176),
 ('become', 0.01817764065930487),
 ('been', 0.005646660773540439),
 ('began', 0.029714214548092827),
 ('bernard', 0.0389875691432937),
 ('between', 0.007400700225646965),
 ('body', 0.023665371110920628),
 ('book', 0.038649408076145496),
 ('both', 0.007806309319420407),
 ('bound', 0.035466234

In [66]:
# Sort the tuples in our tf-idf scores list

def Sort(tfidf_tuples): 
    tfidf_tuples.sort(key = lambda x: x[1], reverse=True) 
    return tfidf_tuples 

list(Sort(example_tfidf_scores[0])[:10]) #List the top ten tokens in our example document by their TF-IDF scores

[('henslowe', 0.3421259940022984),
 ('privy', 0.2307539497079178),
 ('rutter', 0.21895480252781593),
 ('council', 0.17834282096187465),
 ('relating', 0.14010649796998942),
 ('documents', 0.13181604072574934),
 ('penditure', 0.12335680598537459),
 ('operation', 0.119110676226009),
 ('rose', 0.11885548406245758),
 ('slowe', 0.1123704845483916)]

Find the most significant terms, by TFIDF, in the curated dataset. 

In [494]:
td = {
        dictionary.get(_id): value for doc in corpus_tfidf
        for _id, value in doc
    }
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True)

In [495]:
for term, weight in sorted_td[:25]:
    print(term, weight)

ofamiem 1.0
ouderdom 0.9127008148928485
sturgess 0.9086024303923905
zamir 0.8776651519018113
santayana 0.8665736199609847
falocco 0.8562928124271615
weingust 0.8547899776401692
chinese 0.8462001652815331
weils 0.8445705864452131
rudanko 0.8390830868389877
enbiemata 0.8280102498481464
daileader 0.8171301955562135
nodier 0.8168018882816901
usury 0.8005782510346803
menas 0.7909230276348473
beaurline 0.7905965479055058
spectogram 0.7879121817261375
franciscus 0.7771865284620213
soellner 0.77327831058108
bastarde 0.7712490973605648
unton 0.7682807524450844
cohens 0.7677867862204198
falco 0.7635813973078707
callimachus 0.7604638066069844
wynkyn 0.7583409894849417


Print the most significant word, by TFIDF, for the first 50 documents in the corpus. 

In [498]:
for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(reduced_list[n].get('id'), dictionary.get(word_id), score)
    if n >= 50:
        break

http://www.jstor.org/stable/2869980 henslowe 0.33564303350040253
http://www.jstor.org/stable/2870198 stairs 0.15472663369841613
http://www.jstor.org/stable/2870199 beatrice 0.23846910220032727
http://www.jstor.org/stable/2870209 donaldson 0.4977445938192701
http://www.jstor.org/stable/2870208 cheng 0.6067690431054176
http://www.jstor.org/stable/2870203 hartwig 0.6244428220115
http://www.jstor.org/stable/2870194 hall 0.5136214288878345
http://www.jstor.org/stable/2870206 novy 0.5645876593477805
http://www.jstor.org/stable/2870202 booth 0.3864621355873546
http://www.jstor.org/stable/2870313 vizcaya 0.6046299393365276
http://www.jstor.org/stable/2870327 hollar 0.639842687006522
http://www.jstor.org/stable/2870308 longleat 0.5039894018479222
http://www.jstor.org/stable/2869730 andidentifies 0.1635030038209591
http://www.jstor.org/stable/2869726 rubinstein 0.5497022223906934
http://www.jstor.org/stable/2871196 carded 0.2820633806904763
http://www.jstor.org/stable/2871206 jaggard 0.270489292