# An Introduction to Simple Counting Methods for Text Analysis - Complete Notebook

With this notebook and its corresponding video, you will learn the rudiments of text analysis. This is where all of the preperatory work you've done in the last couple of units will start to pay off, and you'll see how quickly you can move from the simplest of observations - how often particular words occur in a text - to real insights about history and literature. Like many of the lessons in this course, this notebook will follow a regular structure with which you're likely starting to become familiar. First, we'll examine a single text. In this case, we'll use <i>Gulliver's Travels</i>, which was written by Jonathan Swift and published anonymously in 1726. Then, we'll begin to ask larger questions about our enriched ECCO corpus. 

## 1. Open a Single Text: Gulliver's Travels

We chose <i>Gulliver's Travels</i> because it's a work of fiction that is fairly well-known, although many who think they are familiar with Gulliver's story only know it through adaptations and abridgements, which often do away with crucial parts of Swift's original.

For the sake of this first part of the notebook, we'll divide the novel into these four separate parts, and run our scripts against each one separately.

As with the text processing notebook, we first want to read each the texts into working memory. Let's get the first one, and call it `gt1`, and its corresponding text string `gt1Txt`. So far, we're following exactly what we did with <i>Robinson Crusoe</i>.

In [None]:
import os
from pathlib import Path
home = str(Path.home())

textdirectory = home + '/dh2/sec4/'

os.chdir(textdirectory)

In [None]:
os.getcwd()

In [None]:
gt1 = open(textdirectory + "gt1.txt", "r")

# Read the document and print its contents
gt1Txt = gt1.read()

Excellent. Let's read the other three parts into working memory, and name them accordingly.

In [None]:
gt2 =         # Repeating what you did in the previous cell, read the second volume into working memory
gt2Txt =      # And read its contents as the new variable `gt2Txt`

gt3 =         # Do the same for the third volume
gt3Txt = 

gt4 =         # And the fourth.
gt4Txt = 

Now, let's examine the first 500 items in the first part of the novel.

In [None]:
print(gt1Txt[:500])

Lets proceed, then, and tokenize each text and count the number of words it contains. To do so, we simply use the built-in `len()` function. As before, we'll need to download `punkt`.

In [None]:
import nltk
nltk.download("punkt")

words1 = nltk.tokenize.word_tokenize(gt1Txt)
words2 = nltk.tokenize.word_tokenize(gt2Txt)
words3 = nltk.tokenize.word_tokenize(gt3Txt)
words4 = nltk.tokenize.word_tokenize(gt4Txt)

print(len(words1), len(words2), len(words3), len(words4))

Now use the `count()` method, which counts the discrete items in a list. We can count any token we want, but at this point we have to remember that we haven't stemmed the words, or taken any other steps to reduce different forms to a single route. Let's start by counting the name of the place where Gulliver spends most of his time in each of the four parts (he visits other lands, but they're more minor - let's stick to the main ones, to keep things simple.)

In [None]:
print(words1.count("lilliput"), words2.count("lilliput"), words3.count("lilliput"), words4.count("lilliput"))

In [None]:
print(words1.count("brobdingnag"), words2.count("brobdingnag"), words3.count("brobdingnag"), words4.count("brobdingnag"))

In [None]:
print(words1.count("laputa"), words2.count("laputa"), words3.count("laputa"), words4.count("laputa"))

In [None]:
print(words1.count("houyhnhnmland"), words2.count("houyhnhnmland"), words3.count("houyhnhnmland"), words4.count("houyhnhnmland"))


### 2. Calculating Variance

What is exciting, now, is that, already, we have almost all the information we need to perform the same sort of analysis that Matt Daniels did in his hip hop article. We know how long each of the four parts of the novel are. Now we just need to figure out how many discrete words are in each one. To do that, we simply turn our list of words, `words1` into a dictionary. If you need a refresher about the `dictionary` object type in Python, you can read about it <a href="https://www.w3schools.com/python/python_dictionaries.asp">on this page</a>.

In [None]:
words1[:10]

Now, we can do something pretty clever. When we turn this list into a dictionary, we'll produce a series of keys mapped to values. By counting the length of the index, the unique keys in that dictioanry, we'll get the total number of discrete words in this volume.

In [None]:
dict1 = dict.fromkeys(words1, 0)

In [None]:
type(dict1)

Then, we just need the length of the index.

In [None]:
print(len(dict1))

There we go! The first volume of <i>Gulliver's Travels</i> contains 3,555 unique words. Now, it should be a simple matter to divide this number by the total number of words in the volume, which we already have. That will give us the variance.

(Of course, you could choose to do this with a function, but we'll continue to perform the calculations singly, for now, so that the operation is clear.)

### Pause the video here and calculate the variance for each volume.

In [None]:
gtVar1 = 
gtVar2 = 
gtVar3 = 
gtVar4 = 
print(gtVar1, gtVar2, gtVar3, gtVar4)

Bingo! This will all come in useful later on, when we apply this analysis to the whole dataset. This is only a rough approximation of the Daniels's analysis, of course, because we didn't bother to standardize the size of each of the texts we're analyzing. That said, they're not so far off from Daniels's benchmark of 30,000 words. With a rough variance of 0.14, we can see that most of Gulliver's Travels is about as variant as Daniels's sample of Jay-Z's lyrics (i.e. 4,275 unique words / 30,000). It's tempting to apply this analysis to the whole dataset, but let's move on to get an aggregate view of the occurence of all the words in Gulliver's Travels. Once we've done that, then we can apply this analysis to the whole TCP!

## 3. Counting All the Words at Once

Now, rather than counting single tokens, let's instead try counting every word in each part of the book, and then plot their frequency. To do this, first make an aggregated table of counts. We'll use <a href="https://www.nltk.org/">nltk</a>, which we've seen before, as well as, <a href="https://pandas.pydata.org/docs/user_guide/10min.html">pandas</a>, and `HTML()` a special function that allows us to display an HTML image in the notebook itself.

Since the dataframe itself is very long, make sure you right-click on the chart and select "Clear Outputs" to proceed.

In [None]:
import pandas as pd
from nltk.probability import FreqDist
from IPython.core.display import HTML


# Get the frequency distribution of the words into a data frame
fdist = FreqDist(words1)
count_frame = pd.DataFrame(fdist, index =[0]).T
count_frame.columns = ['Count']
count_frame = count_frame.sort_values('Count', ascending=False)

# Display the dataframe as HTML (so it's not truncated)
display(HTML(count_frame.to_html()))

Then, simply run this cell, which uses <a href="https://matplotlib.org/">pyplot</a> to create a graph of the information in your dataframe.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(16, 9))
ax = fig.gca()    
count_frame['Count'][:50].plot(kind = 'bar', ax = ax, color='teal')
ax.set_title('Frequency of the most common words')
ax.set_ylabel('Frequency of word')
ax.set_xlabel('Word')
plt.show()

Let's run the same analysis, but this time, let's strip out stop words, which really seem to be swamping everything else, as we'd expect. In this case, let's expland the NLTK list of English stopwords quite a bit.

First, we combine the two lists of stop words.

In [None]:
# Get a set of common stopwords from NLTK
from nltk.corpus import stopwords
more_stopwords = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '100', 'able', 'also', 'although', 'among', 'another', 'away', 'began', 'came', 'could', 'done', 'eight', 'even', 'ever', 'every', 'first', 'five', 'found', 'four', 'gave', 'give', 'go', 'however', 'indeed', 'left', 'like', 'made', 'make', 'many', 'may', 'might', 'much', 'must', 'near', 'never', 'nine', 'nothing', 'often', 'one', 'part', 'put', 'said', 'saw', 'see', 'seven', 'several', 'shall', 'six', 'soon', 'take', 'ten', 'thee', 'therefore', 'thing', 'things', 'thou', 'though', 'three', 'thy', 'till', 'time', 'told', 'took', 'two', 'upon', 'us', 'way', 'well', 'went', 'whether', 'without', 'would', 'yet', '’', '“', '”', ',']
full_stopwords = (stopwords.words('english')) + more_stopwords

Now, let's check our work by comparing the two lists.

In [None]:
print(stopwords.words('english'))

In [None]:
len(stopwords.words('english'))

In [None]:
print(full_stopwords)

In [None]:
len(full_stopwords)

In [None]:
# remove stopwords from the text
usefulWords1 = [word for word in words1 if word not in full_stopwords]
usefulWords2 = [word for word in words2 if word not in full_stopwords]
usefulWords3 = [word for word in words3 if word not in full_stopwords]
usefulWords4 = [word for word in words4 if word not in full_stopwords]

print(usefulWords1)

In [None]:
len(words1)

In [None]:
len(usefulWords1)

Okay! We've removed stopwords, and we're left with a list of what remains. We can see, clearly, that removing the stop words made a huge difference. The size of the first list was cut down by nearly two-thirds! This gives you a sense of how, in addition to being useful to analysis itself, removing stopwords can make a big difference in processing times when you run more involved scripts.

Now, reproduce the same graph we made earlier, but try it for our four lists of useful words, modifying the object name `usefulWords1` for each of the four lists.

In [None]:
# Get the frequency distribution of the remaining words
fdist = FreqDist(usefulWords1)
count_frame = pd.DataFrame(fdist, index =[0]).T
count_frame.columns = ['Count']

# Plot the frequency of the top 50 words
counts = count_frame.sort_values('Count', ascending = False)
fig = plt.figure(figsize=(16, 9))
ax = fig.gca()    
counts['Count'][:50].plot(kind = 'bar', ax = ax, color='teal')
ax.set_title('Frequency of the most common words, stop words removed')
ax.set_ylabel('Frequency of word')
ax.set_xlabel('Word')
plt.show()

We can get even cleaner results if we now use WordNet to reduce each word to its lemma, or the base dictionary form of a word, as we did in the lesson on cleaning text.

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

usefulLemmas1 = [lemmatizer.lemmatize(word, pos="v") for word in usefulWords1]
usefulLemmas2 = [lemmatizer.lemmatize(word, pos="v") for word in usefulWords2]
usefulLemmas3 = [lemmatizer.lemmatize(word, pos="v") for word in usefulWords3]
usefulLemmas4 = [lemmatizer.lemmatize(word, pos="v") for word in usefulWords4]

We can also combine these four lists of lemmas to get a complete set for the entire novel:

In [None]:
len(usefulLemmas1)

In [None]:
usefulLemmasGT = usefulLemmas1 + usefulLemmas2 + usefulLemmas3 + usefulLemmas4
len(usefulLemmasGT)

Let's now plot the most frequent lemmas. Doing so will give us a sense of the meaningful words that occur across all of <i>Gulliver's Travels</i>

In [None]:
# Get the frequency distribution of the remaining words
fdist = FreqDist(usefulLemmasGT)
count_frame = pd.DataFrame(fdist, index =[0]).T
count_frame.columns = ['Count']

# Plot the frequency of the top 50 words
counts = count_frame.sort_values('Count', ascending = False)
fig = plt.figure(figsize=(16, 9))
ax = fig.gca()    
counts['Count'][:50].plot(kind = 'bar', ax = ax, color='teal')
ax.set_title('Frequency of the most common words, stop words removed')
ax.set_ylabel('Frequency of word')
ax.set_xlabel('Word')
plt.show()

We can see how these words might indicate the general subject matter of the book if we compare the results for Samuel Richardson's <i>Clarissa</i>.

In [None]:
Cl = open(textdirectory + "clarissa_complete.txt", "r")
ClTxt = Cl.read()
wordsCl = nltk.tokenize.word_tokenize(ClTxt)
usefulWordsCl = [word for word in wordsCl if word not in full_stopwords]
usefulLemmasCl = [lemmatizer.lemmatize(word) for word in usefulWordsCl]

fdist = FreqDist(usefulLemmasCl)
count_frame = pd.DataFrame(fdist, index =[0]).T
count_frame.columns = ['Count']

# Plot the frequency of the top 50 words
counts = count_frame.sort_values('Count', ascending = False)
fig = plt.figure(figsize=(16, 9))
ax = fig.gca()    
counts['Count'][:50].plot(kind = 'bar', ax = ax, color='teal')
ax.set_title('Frequency of the most common words, stop words removed')
ax.set_ylabel('Frequency of word')
ax.set_xlabel('Word')
plt.show()

## 4. Working with N-Grams in a Single Text

Now that we've had an opportunity to work with single word-tokens, let's turn to n-grams, or set squences of a number of words. We've already worked with Google's Ngram Viewer, but what's important to see, here, is that you have many more options available to you when you code for yourself. The Ngram Viewer could essentially only do one thing - produce timelines that showed the use of n-grams over time, and those results were exceedingly dubious because of how texts were catalogued in the Google Books archive.

In Python, it's surprisingly easy to produce a list of the n-grams in a document. You'll need the `ngrams` module, which is included in `nltk`, and once you get that, you simply generate a list of the ngrams  with the following script:

NB: We'll use `words1` here, because we obviously want to preserve stop words when we're doing this sort of analysis. Removing stop words would necessarily interrupt the natural sequence of the words, so that words would appear contiguous that in fact are not.

In [None]:
from nltk import ngrams

# By specifying n=2, we're going to derive the bigrams for our list words1
n = 2
nGrams = ngrams(words1, n)

Let's take a look at what sort of Python object `nGrams` is.

In [None]:
type(nGrams)

Okay. It's a generator. We haven't seen that before. Put simply, a generator is itself a function that returns an object that you can iterate over. What that means, for our purposes, is that we need to write a look to extract information from it.

Let's then extract the n-grams in the fist part of <i>Gulliver's Travels</i> from the `nGrams` generator and place them in a list called `nGramsInDoc1`. Once you've created that list, create additional ones for the subsequent parts of the novel. (You'll need to use `words2`, for instance, to create `nGramsInDoc2`.

In [None]:
nGramsInDoc1 = []
for grams in nGrams:
    nWords = ' '.join(g for g in grams)
    nGramsInDoc1.append(nWords)

Let's examine the top 50 ngrams, as we've been doing through this notebook.

In [None]:
# Count the frequency of each n-gram
fdist = FreqDist(nGramsInDoc1)
count_frame = pd.DataFrame(fdist, index =[0]).T
count_frame.columns = ['Count']

# Plot the frequency of the top 50 n-grams
counts = count_frame.sort_values('Count', ascending = False)
fig = plt.figure(figsize=(16, 9))
ax = fig.gca()    
counts['Count'][:50].plot(kind = 'bar', ax = ax, color='teal')
ax.set_title('Frequency of the most common n-grams')
ax.set_ylabel('Frequency of n-gram')
ax.set_xlabel('n-gram')
plt.show()

Create this graph this for each of the three subsequent parts of Gulliver's Travels. Most of the ngrams aren't that interesting, except to note that there's a fair amount of language about royal titles throughout the novel: "the emperor", "his majesty", "his majestys", "his imperial", "the king", "the queen", "my master". It's remarkable that language like this comes as easily to Gulliver as "it was" and "by the". If you think about this finding in those terms, it's almost as though the grammar of paying hommage to royal authority were somehow encoded in Gulliver's literary DNA, as though he were hardwired to be subserviant. Try the same analysis with Richardson's <i>Clarissa</i>, expand the graph out to 100 bigrams, and you won't find any evidence of this sort of language. That's not to say that it isn't at all present in Clarissa, but only that, proportionally, there seems to be a great deal more of it in <i>Gulliver's Travels</i>

## 5. Simple Counting Applied to an Entire Corpus

We're just getting started, and already we've come along way, having covered aggregate word counts, variance, and frequency distributions for each word in a text. Now it's time to take that analysis a step further, and to apply it to the entire ECCO datset. You've already done a fair amount of transforming scripts for single texts into loops for aggregate analysis, so try this out on your own. 

Start by training your machine on the corpora_and_metadata directory, which contains our complete enriched metadata from the last unit.

In [None]:
textdirectory = home + '/dh2/corpora_and_metadata/'

os.chdir(textdirectory)
print(os.getcwd())

Now, as you did in the lesson on processing text, import the `glob` library and get a list of the `.csv` files in this directory.

In [None]:
import glob
print(glob.glob("*.csv"))
filenames = glob.glob("*.csv")

We're looking for `enriched_ecco_metadata.csv`. Now, read that into working memory using `pandas`. We've used this module to write `.csv` files in our lessons on APIs and web scraping. It's a rather simple matter to reverse the process and read that data into working memory as a dataframe.

In [None]:
ecco_metadata = pd.read_csv("enriched_ecco_metadata.csv")

In [None]:
ecco_metadata

In [None]:
type(ecco_metadata)

Now that we have a dataframe that contains our metadata, let's move to the directory we made in the last unit, the one that contains all our cleaned ECCO documents, and then list them all using `glob`. Check to make sure you have `2640` files, the complete ECCO-TCP set, with Stephen's additions.

In [None]:
textdirectory = home + '/dh2/corpora_and_metadata/working_set_cleaned/'

os.chdir(textdirectory)
print(os.getcwd())

In [None]:
print(glob.glob("*.txt"))
filenames = glob.glob("*.txt")

In [None]:
len(filenames)

Let's combine a number of the analyses we've done in a single loop, in order to apply them to each document in the directory. Our goal here is to do a large run that uses these calculations to create useful metadata about the most basic elements of our dataset, so let's focus on `total words`, `discrete words` and `variance`. Fill in the blank spaces marked by question marks and the hashed comments that indicate where you should write a line of code yourself.

For all the work this cell does, when you run it, it should only take about five minutes to complete. As always, you can consult the complete notebook for this lesson if you need help.

In [None]:
### Create a blank list called `calculated_data`.

for file in ???:
    with open(str(file), 'r') as inputFile:
        nGramsInFile = []
        readFile = ???.read()
        words = ### Tokenize the object `readFile`
        total_count = ### Count the length of `words`
        discrete_count = ### Count the number of discrete words in `total_count`
        variance = ### Calculate the variance for this file by dividing its discrete_count by the total_count. Make the results a string.
        filename = ### Apply the `replace` method to `file` to change ".txt" to a blank "". Effectively, this removes the ".txt", which we'll need to do to match up the results we create here with the larger metadata file.
        new_data = {'TCP':filename,'total_words':total_count,'discrete_count':discrete_count,'variance':variance}
        calculated_data.append(new_data)
calculated_data_df = pd.DataFrame(calculated_data)

Cross your fingers! Let's check out what we have.

In [None]:
calculated_data_df[0:10]

In [None]:
len(calculated_data_df)

Let's now sort what we have on the TCP column, which lists our filenames.

In [None]:
calculated_data_df = calculated_data_df.sort_values(by=['TCP'])

In [None]:
calculated_data_df[0:10]

### 6. Turning Your Results into Metadata

This is a good start, but now that we have all this data, we want to combine it with the enriched metadata that Christine helped us to compile in our last lesson. Let's do that now.

In [None]:
calculated_data_df.shape

In [None]:
ecco_metadata.shape

In [None]:
ecco_metadata_w_counts = ecco_metadata.merge(calculated_data_df, left_on='TCP', right_on='TCP')

In [None]:
ecco_metadata_w_counts.shape

In [None]:
ecco_metadata_w_counts[0:10]

Before we write this new dataframe with all our results to a file, let's quickly make one last change, and add ".txt" to the end of each filename, while calling the "TCP" column "DocName", which is a bit more descriptive. This will make it a lot easier, whenever we want to access specific files, based on their metadata.

In [None]:
ecco_metadata_w_counts.rename(columns = {'TCP':'DocName'}, inplace = True)

In [None]:
ecco_metadata_w_counts.insert(1, 'FileName', ecco_metadata_w_counts['DocName'] + ".txt")

Finally, let's confirm what we've done.

In [None]:
ecco_metadata_w_counts.shape

In [None]:
ecco_metadata_w_counts[0:10]

All that's left is to do is write our big dataframe to a csv file.

In [None]:
textdirectory = home + '/dh2/corpora_and_metadata/'

os.chdir(textdirectory)

print(os.getcwd())

In [None]:
ecco_metadata_w_counts.to_csv("compiled_ecco_metadata.csv",index=None)

Outstanding! Use your browser to open up your new metadata file, with the results, and check that everything is in place. Think, for a moment, about what you've accomplished. You now have the ability to run comprehensive analyses on your entire dataset, and to use the results you get to feed information back into your metadata, which can then open up new possibilities for analysis.

Now that you've reached this point, you have an enormous amount of metadata that will deepen your ability to draw meaningful information out of your dataset. On the next page, we're going to celebrate this milestone by reading a wonderful article by Stephen Ramsay, called "The Hermeneutics of Screwing Around." Then, we're going to show you how the world of DH really starts to open up, now that you have these skills under your belt.

## 7. Using Enriched Metadata to Drive Discovery

We've done the hard work. Now it's time to play. Start here once you've read Stephen Ramsay's article on "Screwing Around." The rest of this notebook is a sort of free-for-all. We're going to introduce one more vitally important coding idea - how to use your enriched metadata to select certain texts for analyses. This is the sort of operation that would allow you to study, for instance, how particular women used novelistic effects in their nonfictional discourse, or how members of the Royal Society discussed issues of race and ethnicity. It's a simple technique, creating a subset out of the larger set of data, but a powerful one, and we're going to return to it again and again throughout the rest of the course.

To this point, whenever we've wanted to analyse a group of texts, we've either specified their filenames individually, or used the `glob` module to get a bunch of filenames from a specific local directory. This works well enough when you want to work with a complete directory, but it's a rather blunt instrument. Here, we're going to learn how to use our metadata to access specific files.

To start, let's do this manually. Open up your metadata file, and make a list of filenames for all the works in the dataset by Mary Wollstonecraft, Mary Hays, and Helen Maria Williams, three women who supported the ideals of the French Revolution. Read their cleaned documents into working memory. First, point at the `working_set_cleaned` directory, and then make a list of the filenames you want.

In [None]:
textdirectory = home + '???'
os.chdir(???)

filenames =[???]


Then, populate a dictionary with all the files. The dictionary should have keys that correspond to the filenames, and values that correspond to the text for each volume.

In [None]:
text_dictionary = ???

for ??? in ???:
    with open(str(???), 'r') as inputFile:
        readFile = inputFile.read()
        text_dictionary[str(???).format(???)] = readFile

Let's check out the dictionary we've just made.

In [None]:
type(text_dictionary)

In [None]:
text_dictionary.keys()

In [None]:
print(text_dictionary["K046614.000.txt"])

Now, let's write a very simple script to search for any word we might like to find, across these texts. We'll use [Counter](https://docs.python.org/3/library/collections.html#collections.Counter), which will simply count the occurences of our search word, once we've split the text.

In [None]:
search_word = "liberty"

In [None]:
from collections import Counter

for key in text_dictionary:
    word_counts = Counter(text_dictionary[key].split())
    print(key, "-> " + search_word + " =", word_counts[search_word])

This is all well and good, but this isn't a very flexible technique, and it frankly isn't all that much fun. 

Not only do you have to compile and transform a new list of documents every time you want to perform a new search on a set of documents, but we also can't easily tell which text is which in our results. It would be much better, obviously, if the results clearly said something like `E000085.001.txt (Mary Wollstonecraft, An Historical and Moral View of the French Revolution) -> liberty = 117` Had we integrated the metadata into this process, we could have solved all these problems from the beginning. Let's do that now. First, we'll read the metadata into working memory, then we'll isolate the filenames for the documents by Wollstonecraft.

In [None]:
textdirectory = home + '/dh2/corpora_and_metadata/'

os.chdir(textdirectory)
print(os.getcwd())

In [None]:
import glob
print(glob.glob("*.csv"))
filenames = glob.glob("*.csv")

In [None]:
ecco_metadata_w_counts = pd.read_csv("compiled_ecco_metadata.csv")

In [None]:
ecco_metadata_w_counts[0:10]

The `.loc()` property allows you to access elements of your dataframe through the row and column labels. It's extremely useful, so be sure to consult <a href="https://pandas.pydata.org/pandas-docs/version/0.23.1/generated/pandas.DataFrame.loc.html">its documentation</a>. It generally takes this format:

`data_frame.loc[ Whatever You Want To Get ]`

The "Whatever You Want To Get", enclosed in square brackets, is called an "item" in Python. Items are extremely flexible. So, for instance, if you wanted to access all rows for which the column `main_author` is `Mary Wollstonecraft`, you would use the following item:

`ecco_metadata_w_counts["main_author"] == "Mary Wollstonecraft"`

The full cell would look like this:

In [None]:
ecco_metadata_w_counts.loc[ecco_metadata_w_counts["main_author"] == "Mary Wollstonecraft"]

Let's quickly take a look at a few ways to specify items, so that we can access different elements in our dataframe.

In our earlier example, we wanted to access all rows for volumes written by Mary Wollstonecraft, Mary Hays, and Helen Maria Williams. To do that, we would use a script with the format

`array = ['item1','item2']
df.loc[df['column_name'].isin(array)]`

In [None]:
array = ['Mary Wollstonecraft', 'Mary Hays', "Helen Maria Williams"]
ecco_metadata_w_counts.loc[ecco_metadata_w_counts["main_author"].isin(array)]

Or, you could select rows based on multiple column conditions, in this case, the `main_author` and `total_words`, which must be greater than 25000.

In [None]:
ecco_metadata_w_counts.loc[(ecco_metadata_w_counts["main_author"] == "Mary Wollstonecraft") & (ecco_metadata_w_counts["total_words"] >= 25000)]

This line of script selects rows for which the `occupation` column contains `novelist`. (You could also try `opera singer`, `spy`, and `volcanologist`. Our use of `contains` is important, because the occupation field is often a collection of multiple strings. John Adams, for instance, is listed as `lawyer, politician, diplomat, political philosopher, statesperson`, so we couldn't attempt an exact match with `==`.

In [None]:
ecco_metadata_w_counts.loc[ecco_metadata_w_counts["occupation"].str.contains("novelist", na=False)]

The last and most important step is to extract a list of filenames from this restricted set of rows. This is easy to do. As in the cell below, stipulate that you want the information contained in the `FileName` column. Python columns are pandas series when you pull them out, but you can simply call it to a list with `x.tolist()`. Give it a shot.

In [None]:
filenames = ecco_metadata_w_counts.loc[(ecco_metadata_w_counts["occupation"].str.contains("novelist", na=False)) & (ecco_metadata_w_counts["total_words"] >= 30000)]["FileName"].tolist()

In [None]:
filenames

Great. Now, repeat the search that we did for `liberty`, but for these long `novelist` files.

In [None]:
textdirectory = home + '/dh2/corpora_and_metadata/working_set_cleaned/'
os.chdir(textdirectory)

In [None]:
text_dictionary = {}

for file in filenames:
    with open(str(file), 'r') as inputFile:
        readFile = inputFile.read()
        text_dictionary[str(file).format(file)] = readFile

In [None]:
search_word = "liberty"

In [None]:
for key in text_dictionary:
    word_counts = Counter(text_dictionary[key].split())
    print(key, "-> " + search_word + " =", word_counts[search_word])

Since we're working directly with the metadata, we can now print out information that simply wasn't available to us before when we manually grabbed the texts by specifying filenames. Let's print out the author and the first fifty characters of the title, for each volume in our `novelist` set.

In [None]:
for key in text_dictionary:
    word_counts = Counter(text_dictionary[key].split())
    print(key, ecco_metadata_w_counts.loc[ecco_metadata_w_counts["FileName"] == key]["main_author"].values[0] + ", " + ecco_metadata_w_counts.loc[ecco_metadata_w_counts["FileName"] == key]["Title"].values[0][:50] +" -> " + search_word + " =", word_counts[search_word])

Excellent. This is a huge leap forward, but it doesn't make much sense for us to print out the results in this format, every time. If we append each line of data to a dataframe, we'll be able to do much more with these results. Take a stab at this now. Remember that you can always refer to the complete notebook, if you run into difficulty.

In [None]:
search_word = "liberty"
results_data = ???

for key in text_dictionary:
    word_counts = ???
    search_count = ???
    file_names = ???
    main_author = ### Refer to the cell above to add the author's name
    title = ### Again, refer to the cell above - add the full title of the work here.
    results = {'file_names':file_names,'main_author':main_author,'title':title,'search_word':search_word,'search_count':search_count}
    results_data.???(???)
results_df = pd.DataFrame(???)

Let's check what we have.

In [None]:
results_df.shape

In [None]:
results_df[0:10]

That's so much easier on the eyes! What's more, though, we can now transform our data, by, for example, sorting it.

In [None]:
results_df = results_df.sort_values(by=['search_count'], ascending=False) 

In [None]:
results_df[0:10]

This is fantastically useful informtion. We can examine our set of volumes by novelists to quickly see which have the most instances of the word `liberty`. Now that we've produced a dataframe of results, let's look at other ways that we could present the data. It's a simple matter to turn `results_df` into a bar graph.

In [None]:
# Plot the frequency of the searchword
counts = results_df.sort_values('search_count', ascending = False).set_index("title")
fig = plt.figure(figsize=(16, 9))
ax = fig.gca()    
counts['search_count'][:60].plot(kind = 'bar', ax = ax, color='teal')
ax.set_title('Frequency of the term ' + search_word)
ax.set_ylabel('Frequency of ' + search_word)
ax.set_xlabel('Volume')
plt.show()

Let's do the same, but now for principles. In this case, however, let's lemmatize each work, as we read it into the dictionary. This will be a very important run, for thinking about the larger project, as it will point us towards those works by novelists that are most concerned with the word `principle`.

In [None]:
textdirectory = home + '/dh2/corpora_and_metadata/working_set_cleaned/'
os.chdir(textdirectory)

In [None]:
#print(glob.glob("*.txt"))
#filenames = glob.glob("*.txt")

filenames = ecco_metadata_w_counts.loc[(ecco_metadata_w_counts["occupation"].str.contains("novelist", na=False)) & (ecco_metadata_w_counts["total_words"] >= 30000)]["FileName"].tolist()

In [None]:
len(filenames)

Let's first create a dictionary with all the lemmatized `novelist` files.

In [None]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

text_dictionary = {}

for file in filenames:
    with open(str(file), 'r') as inputFile:
        readFile = inputFile.read()
        usefulWordList = [word for word in readFile.split()]
        lemmas = [wordnet_lemmatizer.lemmatize(word) for word in usefulWordList]       
        text_dictionary[str(file).format(file)] = lemmas

And then run the same analysis we did on the texts in the `liberty` search.

In [None]:
search_word = wordnet_lemmatizer.lemmatize("principle")
results_data = ???

for key in text_dictionary:
    word_counts = ???
    search_count = ???
    file_names = ???
    main_author = ### Add the author's name
    title = ### Add full title of the work here.
    results = {'file_names':file_names,'main_author':main_author,'title':title,'search_word':search_word,'search_count':search_count}
    results_data.???(???)
results_df = pd.DataFrame(???)

Okay! That should do it. Let's sort our results as we did before, and then take a look.

In [None]:
results_df = results_df.sort_values(by=['search_count'], ascending=False) 

In [None]:
results_df[0:10]

This is outstanding! We've got a clear indication, here, of where we might want to start, if we were interested in studying the use of the word `principle` in novels. As we'll see later on, this is still a fairly blunt analysis, far less supple than concept search, but we're beginning to approach results that are both meaningful for research and that, by virtue of the way we're using metadata, go beyond the capabilities of even those search engines that most scholars use. And we're just beginning to scratch the surface!

Let's plot these results.

In [None]:
# Plot the frequency of the searchword
counts = results_df.sort_values('search_count', ascending = False).set_index("title")
fig = plt.figure(figsize=(16, 9))
ax = fig.gca()    
counts['search_count'][:60].plot(kind = 'bar', ax = ax, color='teal')
ax.set_title('Frequency of the term ' + search_word)
ax.set_ylabel('Frequency of ' + search_word)
ax.set_xlabel('Volume')
plt.show()

The page that corresponds to this part of the notebook provides some really fascinating exercises that you can work through on your own. Give them a shot now, making as much use of the code in this notebook as you can. Create new coding cells in the workspace below, and feel free to copy any code you might need from the cells above, to solve the problems outlined in the excercises. Good luck!

### Free Workspace for Exercises: Using Enriched Metadata to Drive Discovery