# Evaluating Writing Style with Unigrams

After several weeks on chatbots, we have covered a lot of programming techniques that are useful in this area.
In order to progress at the same rate as before, we have to move on to new shores.
So we are shifting gears and will now spend a few weeks on word-oriented techniques such as stylistic analysis and spell checking.
We will start with stylistic analysis first.


## Unigrams and n-grams

As you probably know from your highschool days, comparing works of fiction can be a very hard and time-consuming task.
It would be much nicer if we could just have the computer do all the work for us.
But how could that work?

One simple idea is that an author's style is represented by which words (s)he uses, and in particular which words (s)he uses most.
Words are also known as *unigrams*.
This is in contrast to *bigrams*, which consist of two words, *trigrams* (three words), and so on.
For instance, the sentence

    John likes Mary and Peter
    
contains the unigrams

    John, likes, Mary, and, Peter
    
the bigrams

    John likes, likes Mary, Mary and, and Peter
    
and the trigrams

    John likes Mary, likes Mary and, Mary and Peter
    
We could also have 4-grams, 5-grams, or 127-grams.
Quite generally, a model that is based on words or sequences of words is called an *n-gram model*.
So if we want to analyze an author's style in terms of their word usage, we are proposing a unigram model of stylistic analysis.

But does a unigram model actually work for comparing writing style?
In class we talked about a study that used a unigram model to predict the success of novels.
If that is possible, stylistic analysis might be feasible, too.
Well, let's put the idea to the test: we will compare three works of fiction comparing this technique:

- William Shakespeare's *Hamlet*
- Christopher Marlowe's *The Tragical History of Dr. Faustus*
- Edgar Rice Burrough's *A Princess of Mars*

If we find something interesting, then unigram models might be worthwhile for stylistic analysis after all.

A brief remark on those works: The first two are world-famous plays, whereas the third is an early 20th century pulp novel that you might know as the basis for Disney's 2012 box office debacle *John Carter*.
Although the movie is better than its reputation, it still doesn't do justice to the book, so give it a read if you are in the mood for a fun science-fantasy read.

## Getting the files

First we need to have the books in some digital format that we can feed into Python.
Ideally, we want this to be a plaintext format, i.e. the pure text without any layout information.
We do not want a pdf or doc file, as those are much harder to work with.
We can use Python to download all the files from [Project Gutenberg](https://www.gutenberg.org/), an online platform that hosts literary works that are no longer under copyright.

In [None]:
# import the urllib.request library
import urllib.request
# download Hamlet
urllib.request.urlretrieve("http://www.gutenberg.org/cache/epub/1524/pg1524.html", "hamlet.txt")
# download Faustus
urllib.request.urlretrieve("http://www.gutenberg.org/cache/epub/811/pg811.txt", "faustus.txt")
# donwnload Princess of Mars
urllib.request.urlretrieve("http://www.gutenberg.org/cache/epub/62/pg62.txt", "johncarter.txt")

# wait till you see an output below this cell;
# as long as you see an * next to In[], Python is busy downloading the files

After running the cell above, you will have three files in the same folder as the notebook:

1. `hamlet.txt`
1. `faustus.txt`
1. `johncarter.txt`

As any other `.txt` file you can open them in a text editor, e.g. Notepad.
Do that right now, open the files and take a peek as to what they look like.
You'll notice that they're full of stuff that isn't part of the books' text itself.
This includes copyright disclaimers, footnotes, and HTML tags.
HTML tags are the things between pointy brackets, for instance `<p id="019253">` or `</p>` in *Faustus*.
Later on, but we will have to get rid of that, but let's hold off on that for now.

Instead, let's look at the code above to see how it works.
The `urllib.request` module provides a function `urlretrieve` that takes two strings as arguments.
The first one tells us the URL of the file we want to download, and the second one is the filename we want to save it as.

```python
    urllib.request.urlretrieve("url", "filename")
```

As you can see, downloading files in Python is easypeasy.

However, it is somewhat annoying that we have to type `urllib.request.urlretrieve` all the time, that's one really long function name.
We can save ourselves some typing by changing the `import` statement.

In [None]:
# directly import urlretrieve from the urllib.request library
from urllib.request import urlretrieve
# download Hamlet
urlretrieve("http://www.gutenberg.org/cache/epub/1524/pg1524.html", "hamlet.txt")
# download Faustus
urlretrieve("http://www.gutenberg.org/cache/epub/811/pg811.txt", "faustus.txt")
# donwnload Princess of Mars
urlretrieve("http://www.gutenberg.org/cache/epub/62/pg62.txt", "johncarter.txt")

Whenever we write `import xyz`, the entire module `xyz` is loaded and all its functions `f`, `g`, `h`, and so on become accessible as `xyz.f`, `xyz.g`, `xyz.h`, and so on.
But if we only want one specific function `f`, there is an alternative.
Instead of `import xyz`, we can write `from xyz import f`.
Then we can directly write `f` instead of `xyz.f`.
Note that you cannot write `import urllib.request.urlretrieve`, this will not work because only modules can be imported this way, but `urllib.request.urlretrieve` is a function.

In [None]:
# this code won't work
import urllib.request.urlretrieve

Alright, let's get back to our files, which are still ildly hanging around on your harddrive.
The files won't do us much good unless we can find a way for Python to actually work with them.
We have to tell Python to read in each file as a string.
While this isn't too difficult, it involves some new commands and concepts that would only be distracting at this point.
So instead, I wrote a function below that already does all the work for you.
The expansion unit explains in detail how to work with files in Python.

In [None]:
# function for reading in files as strings
def read_file(filename):
    with open(filename, "r", encoding="utf-8") as text:
        return text.read()

Once you run the cell above, the function `read_file` will be available everywhere else in the notebook (if you restart the kernel, Python forgets how the function is defined, so you'll have to execute the cell again).

So now we can read in every file as a string and store the string with a variable.

In [None]:
# first hamlet
hamlet = read_file("hamlet.txt")
# then faustus
faustus = read_file("faustus.txt")
# then johncarter
johncarter = read_file("johncarter.txt")

Run the three cells below to see what the first 1000 characters of each string looks like (the meaning of `[:1000]` will be explained soon in one of the next units).

In [None]:
hamlet[:1000]

In [None]:
faustus[:1000]

In [None]:
johncarter[:1000]

Notice how there are no visible linebreaks, instead we have the special character `\n`.
Those strings are what the text files look like to Python --- quite different from how a text editor displays them from a human.
And in the case of `hamlet`, the string is also littered with HTML tags, which provide additional information for web browsers, in particular how to display the text.
Whenever you are looking at a website, your browser actually sees a string like the one for `hamlet`.

**Exercise.**
We now have all the code we need to download the relevant files and read them in as strings, but it's all scattered across the notebook.
If you have to restart the kernel, you'll have to execute all those cells again.
So let's add a final step for our convenience that does it all at once.
And since we might want to be repeat the steps at a latter point, let's pack it all into a function `get_text`.
That way, we can run something like `get_text("hamlet")` at any point later on in the notebook to redownload the original version of Hamlet.
That will come in mighty handy if we try to clean up the files and one of our clean-up steps accidentally rips out half the text.

So your task is to define a custom function `get_text` that satisfies all of the following criteria:

1. `get_text` takes a string as its only argument;
1. The values for the argument can be `hamlet`, `faustus`, or `johncarter`.
1. If a different string is passed as the argument, e.g. `foobar`, the function will behave as if the argument were `hamlet`.
1. Based on what argument was provided, the function
    1. downloads the correct file (check the beginning of the notebook for the URLs), and
    1. reads in the file as a string (copy-paste the relevant code from the `read_text` function).
    
Make sure that your code works even if this is the first cell in the notebook to be run.
So you will also have to add the necessary `import` statement(s) above the function definition.
Once the function has been defined, use it to instantiate the variables `hamlet`, `faustus`, and `johncarter`.

In [None]:
# put your code here

# Cleaning up the files

## Analysis

Take one more look at the files you downloaded.
Scroll up and down to make sure you get a good impression of what the files look like.
Pay close attention to anything that you think is not part of the author's actual writing and should be removed.
Go ahead, check out the files, I'll wait here in the meantime.

Done? Okay, you should have noticed quite a few problems with the files, only some of which we could ever hope to fix by hand.

1. While `faustus.txt` and `mars.txt` are fairly easy to read, `hamlet.txt` is cluttered with HTML tags like `<p id="id00057">` and `<br/>`. That's because we downloaded a textfile for `faustus.txt` and `mars.txt`, but an html-file for `hamlet.txt`.

1. All files start with information about Project Gutenberg, which we do not want.

1. All files have information at the end that is not part of the play. In `hamlet.txt`, it's just a disclaimer that the play is over, `mars.txt` ends with the Project Gutenberg license, and `faustus.txt` is full of footnotes.

1. In `faustus.txt`, the text is often interrupted by strings like `[17]`. Those are references to footnotes.

1. For the two plays, different formats are used to indicate who is speaking.
    - In `hamlet.txt`, names are abbreviated and occur between the markup `<p id="id...">` and `<br/>`.
    - In `faustus.txt`, names are fully capitalized.
    
1. Both plays put stage instructions between square brackets, for example `[Francisco at his post. Enter to him Bernardo.]`.

1. In `faustus.txt`, stage instructions are also indicated by indentation.

1. In `faustus.txt`, all dialog is indented, but less so than the stage instructions.
    
1. All three files contain many empty lines.

1. Both plays capitalize words at the beginning of a new line.

1. In `mars.txt`, Chapters are written in upper caps.

Some of them are very problematic for us:

- We just want to be able to see which words are used in each play, and how often each word is used.
- We do not want extraneous material such as HTML markup, information about Project Gutenberg, or footnotes.
- We also do not want to keep track of names if they just indicate who is speaking. That's not part of the play as such.
- We should also exclude stage instructions because those do not belong to the literary part of the play either.

Fixing all these things by hand would be tons of work.
Fortunately, Python can do it all for us with just the right regular expressions.

## Clean-up step 1: Deleting the start and end

The first step is to remove the parts at the beginning and the end that aren't part of the text itself.
The easiest way to do this is to delete everything up to or after a given line number.
We can find out the relevant line numbers by opening the files we downloaded in a text editor.

Let's start with Hamlet.
The play doesn't start until the description *SCENE. Elsinore.* on line 366.
So we want to delete the first 365 lines.
And the non-play part at the end is marked with *The End of Project Gutenberg Etext of Hamlet by Shakespeare* on line 10929.
So everything after line 10928 should be deleted, too.

For your convenience, I have already written two custom functions `delete_before_line` and `delete_after_line` that handle the deletion.
They take two arguments, a string and a line number, and then delete everything before or after that line number.

In [None]:
def delete_before_line(string, line):
    return str.split(string, "\n", line)[-1]

def delete_after_line(string, line):
    return str.join("\n", str.split(string, "\n")[:line+1])

You do not need to worry about how these functions work.
That is the beauty of functions, you can treat them as blackboxes and do not need to worry about how exactly they do what they are supposed to do.
You didn't worry about how `print` or `re.sub` work, and you do not need to worry about how those two functions work.
The important thing is that we can use them to clean up our version of *Hamlet*.

In [None]:
hamlet_clean = delete_before_line(hamlet, 366)
hamlet_clean = delete_after_line(hamlet_clean, 10928)

Actually, we can condense the two lines above into just one.

In [None]:
hamlet_clean = delete_after_line(delete_before_line(hamlet, 366), 10928)

**Exercise.**
Explain in a step-wise fashion how the single line of code above does the same work as the two lines we had before.

*put your explanation here*

We can compare `hamlet` and `hamlet_clean` to make sure that the deletion has worked correctly.
Since we might want to do this with `faustus` and `johncarter`, too, it is once again convenient to define a custome function.
This one is different from the ones we have seen so far in that it has no `return` statement, instead it just uses `print` to show a few strings.

In [None]:
def comparison_print(text1, text2, number, position):
    # we compare a given number of characters in text1 and text2;
    # either the first n characters, or the last n characters,
    # depending on the value of position
    if position == "start":
        print("TEXT 1")
        print(text1[:number])
        # add a separator in the output
        print()
        print("--------")
        print()
        print("TEXT 2")
        print(text2[:number])
    if position == "end":
        print("TEXT 1")
        print(text1[-number:])
        # add a separator in the output
        print()
        print("--------")
        print()
        print("TEXT 2")
        print(text2[-number:])

In [None]:
# compare the first 500 characters of hamlet and hamlet_clean
comparison_print(hamlet, hamlet_clean, 500, "start")

In [None]:
# compare the last 500 characters of hamlet and hamlet_clean
comparison_print(hamlet, hamlet_clean, 500, "end")

**Exercise.**
There is a minor redundancy in `comparison_print`.
No matter whether `position == "start"` or `position == "end"` is true, a separator needs to be inserted between `text1` and `text2`.
Right now the separator is specified directly in the function as

```python
print()
print("--------")
print()
```

But maybe we would like to change it in the future, in which case we would have to change it in two places in the function.
Fix this issue by defining a custom function `print_separator()` that prints a separator (you can reuse the code above as it is).
Then replace the separator code in `comparison_print` by calls to `print_separator()`.

In [None]:
# add your definition of print_separator here,
# then modify comparison_print accordingly

def comparison_print(text1, text2, number, position):
    # we compare a given number of characters in text1 and text2;
    # either the first n characters, or the last n characters,
    # depending on the value of position
    if position == "start":
        print("TEXT 1")
        print(text1[:number])
        # add a separator in the output
        print()
        print("--------")
        print()
        print("TEXT 2")
        print(text2[:number])
    if position == "end":
        print("TEXT 1")
        print(text1[-number:])
        # add a separator in the output
        print()
        print("--------")
        print()
        print("TEXT 2")
        print(text2[-number:])

Alright, let's move on to the remaining two files.
In `faustus`, we want to delete the passage *FROM THE QUARTO OF 1616* on line 138 and everything before it.
And we also want to get rid of everything after *Terminat hora diem; terminat auctor opus.* on line 2853.
In `johncarter`, we can delete everything before *CHAPTER 1* on line 235, and everything after *that I shall soon know* on line 6939.

**Exercise.**
Similar to `hamlet_clean`, use the functions `delete_after_line` and `delete_before_line` to define new variables `faustus_clean` and `johncarter_clean`.
Make sure you pick the line numbers correctly:

- A command of the form `delete_before_line(faustus, n)` deletes everything **before** line `n` but not line `n` itself.
- A command of the form `delete_after_line(faustus, n)` deletes everything **after** line `n` but not line `n` itself.

In [None]:
# put your code here

## Clean-up step 2: Regex galore

Alright, we have removed some unnecessary lines from the texts, but the remaining lines still contain quite a bit of crud.
We will now clean this up with regular expressions.
As before, we start with *Hamlet*, although it is probably the hardest case.

In [None]:
import re

def hamlet_cleaner(text):
    # 1. remove all headers, i.e. lines starting with <h1, <h2, <h3, and so on
    text = re.sub(r"<h[0-9].*", r"", text)
    # 2. remove speaker information, i.e. lines of the form <p id="id012345789"...<br/>
    text = re.sub(r'<p id="id[0-9]*">[^<]*<br/>', r"", text)
    # 3. remove html tags, i.e. anything of the form <...>
    text = re.sub(r"<[^>]*>", r"", text)
    # 4. remove anything after [ or before ] on a line (this takes care of stage descriptions)
    text = re.sub(r"\[[^\]\n]*", r"", text)
    text = re.sub(r"[^\[\n]*\]", r"", text)
    return text

# apply all the regular expressions to hamlet
hamlet_regexed = hamlet_cleaner(hamlet_clean)

In [None]:
# run this cell if you want to compare the new version to the previous
comparison_print(hamlet_clean, hamlet_regexed, 500, "start")

The regular expressions above aren't exactly for the faint of heart, but at least one of them is fairly easy to figure out.

In [None]:
# run this cell if you want to compare the new version to the previous one
comparison_print(hamlet_clean, hamlet_regexed, 500, "end")

**Exercise.**
Explain how the command `re.sub(r"<h[0-9].*", r"", text)` removes headers from every line in the file.
Here are a few examples of what headers look like in HTML:

- `<h1>A level-one header</h1>`
- `<h2 style="color:#069">A level-two header with additional styling</h2>`
- `<h3 class="test">A level-three header belonging to a custom class "test"</h3>`
- `<h4 class="test" id="320" style="color:#069">A level-four header with class, id, and style specifications</h4>`

*put your explanation here*

Good, *Hamlet* is now in a workable state.
So let's try *Faustus* next.

In [None]:
def faustus_cleaner(text):
    # 1. remove stage information
    #    (anything after 10 spaces)
    text = re.sub(r"(\s){10}[^\n]*", r"", text)
    # 2. remove speaker information
    #    (any word in upper caps followed by space or dot)
    text = re.sub(r"[A-Z]{2,}[\s\.]", r"", text)
    # 3. remove anything between square brackets (this takes care of footnote markers)
    text = re.sub(r"\[[^\]]*\]", r"", text)
    return text
    
faustus_regexed = faustus_cleaner(faustus_clean)

**Exercise.**
Explain in a step-wise fashion how `re.sub(r"\[[^\]]*\]", r"", text)` takes a string of the form `bla bla [ble ble] bli bli [blo blo] blu blu` and reduces it to `bla bla  bli bli  blu blu`.
That is to say, how does this regex delete material between square brackets?

Keep in mind that `[` and `]` have special meaning in regexes, so `\[` and `\]` are used to escape their special meaning and refer to the actual brackets.
And remember that `[^abc]` means *match anything except a, b, or c*.

*put your description here*

Again it is advisable to check the output of our cleaning operations.

In [None]:
# run this cell if you want to compare the new version to the previous
comparison_print(faustus_clean, faustus_regexed, 500, "start")

In [None]:
# run this cell if you want to compare the new version to the previous
comparison_print(faustus_clean, faustus_regexed, 500, "end")

Splendid, our regular expressions are doing an excellent job at automatically ripping out all the stuff we do not want.
This leaves us with only one more text, `johncarter`.

In [None]:
def johncarter_cleaner(text):
    # 1. delete CHAPTER I
    # (must be done like this because Roman 1 looks like English I)
    text = re.sub("CHAPTER I", "", text)
    # 2. remove any word in upper caps that is longer than 1 character
    text = re.sub(r"[A-Z]{2,}", r"", text)
    # 3. remove anything after [ or before ] on a line
    text = re.sub(r"\[[^\]\n]*", r"", text)
    text = re.sub(r"[^\[\n]*\]", r"", text)
    return text
    
johncarter_regexed = johncarter_cleaner(johncarter_clean)

In [None]:
# run this cell if you want to compare the new version to the previous
comparison_print(johncarter_clean, johncarter_regexed, 500, "start")

In [None]:
# run this cell if you want to compare the new version to the previous
comparison_print(johncarter_clean, johncarter_regexed, 500, "end")

**Exercise.**
Once again we've reached a milestone, so it's a good idea to collect all the relevant code into a single cell.
Write a custom function `get_and_clean` that will

- download hamlet, faustus, johncarter, and
- read them in a strings, and
- clean them up.

It is perfectly fine to copy-paste the previous function definitions into the cell below and then just write a custom function that runs them all in the correct order.

As before, you should use your new function to instantiate the three variables `hamlet_regexed`, `faustus_regexed`, and `johncarter_regexed`.
That way you can always run this cell to automatically repeat all the steps we have taken up to this point in the notebook.

In [None]:
# put your code here

## Tokenizing and counting

Now that we have cleaned up the source files, we can finally move on with the analysis.
Remember, never rush straight into the analysis, always clean up the data first.
Even the best analysis is worthless if it uses bad data.
As computer scientists like to say: **garbage in, garbage out**.

In our case, the analysis will be fairly straight-forward.
We first want to tokenize the texts, which means that we convert them from strings into lists of word tokens.
We do this with a regular expression.

In [None]:
import re

tokens_hamlet = re.findall(r"\w+", str.lower(hamlet_regexed))
tokens_faustus = re.findall(r"\w+", str.lower(faustus_regexed))
tokens_johncarter = re.findall(r"\w+", str.lower(johncarter_regexed))

The function `re.findall` takes two arguments, a regular expression and a string.
It then constructs a list that contains all parts of the string that matched the regular expresssion.
By using the regex `r"\w+"`, we instruct `re.findall` to look for sequences that consist only of word characters.
But that's exactly what a word is, a sequence of word characters!
So we are telling `re.findall` to find all words in the string.
The cell below illustrates this for a very short string.

In [None]:
import re

test_string = "FTL is short for faster-than-light; we probably won't ever have space ships capable of FTL-travel."
tokens = re.findall(r"\w+", str.lower(test_string))
print(tokens)

**Exercise.**
Note that *FTL-travel* is split into two words *FTL* and *travel*, and *won't* is split into *won* and *t*.
Explain why this is exactly what we expect given the regex `r"\w+"`.
Then fix the code below (it's the same as above) so that the regex also considers hyphens and apostrophes as part of a single word.

In [None]:
import re

test_string = "FTL is short for faster-than-light; we probably won't ever have space ships capable of FTL-travel."

# tokenize the string
tokens = re.findall(r"\w+", str.lower(test_string))
print(tokens)

Notice that we apply `str.lower` to the string first before it gets fed into `re.findall`.
That's because we are tokenizing the string to get word counts, and we want, say, *the* and *The* to be counted as tokens of the same word type *the*.
By making everything lowercase, we eliminate these distinctions.
However, it may have unintended side effects, like conflating the company name *Google* with the verb *google*.
But this is a minor problem compared to capitalization, so we should still get better result with `str.lower` rather than without it.

At any rate we now have three variables `tokens_hamlet`, `tokens_faustus`, and `tokens_johncarter` that each store a tokenized version of the three files.
The only thing let to do is for us to count the tokens for each word type.
Python makes this very easy for us: the `collections` library provides a function `Counter` that does the counting for us.
The `Counter` function takes as its only argument a list (like the ones produced by `re.findall` for tokenization).
It then converts the list into a *Counter*.
Here is what this looks like with our short example string.

In [None]:
import re
from collections import Counter

test_string = "FTL is short for faster-than-light; we probably won't ever have space ships capable of FTL-travel."

# tokenize the string
tokens = re.findall(r"\w+", str.lower(test_string))
print("The list of tokens:", tokens)

# add an empty line
print()

# and now do the counting
counts = Counter(tokens)
print("Number of tokens for each word type:", counts)

So now let's do the same thing with our three token lists, stored in the variables `tokens_hamlet`, `tokens_faustus`, and `tokens_johncarter`.

In [None]:
from collections import Counter

counts_hamlet = Counter(tokens_hamlet)
counts_faustus = Counter(tokens_faustus)
counts_johncarter = Counter(tokens_johncarter)

**Exercise.**
Again it would be nice to have a function that does all the previous steps for us in one fell-swoop.
Write a custom function `count_tokens` that takes a string as argument and computes its word counts.
Then instantiate the variables `counts_hamlet`, `counts_faustus`, and `counts_johncarter` using your custom function.
Make sure you also load all required libraries in the cell.

Remember, the point of this cell is that when you open the notebook at a later point, you can save some time by just running this cell instead of all the others in this section.

In [None]:
# put your code here

**Exercise.**
Actually, let's be even lazier and combine the custom functions from the previous exercises.
That way, it's enough to just run this cell, skipping all the previous ones.

First, copy paste the relevant code into the cell below.
Then add a custom function `get_and_count` that satisfies the following criteria:

1. `get_and_count` takes a string as its only argument;
1. The values for the argument can be `hamlet`, `faustus`, or `johncarter`.
1. If a different string is passed as the argument, e.g. `foobar`, the function will behave as if the argument were `hamlet`.
1. Based on what argument was provided, the function
    1. downloads the correct file, and
    1. reads in the file as a string, and
    1. tokenizes the string, and
    1. counts the tokens.
    
Then instantiate the variables `counts_hamlet`, `counts_faustus`, and `counts_johncarter` using `get_and_count`.
    
Remember that you can use a custom function inside the definition of another custom function, so you should be able to write `get_and_count` with a minimal amount of additional code.
However, you may have to move some pieces of code from `get_text` into `get_and_count`.

In [None]:
# put your code here

## Looking at the counts

Let's take a quick peak at what the counts looks like for each text.
We don't want to do this with something like `print(counts_hamlet)`, because the output would be so large that your browser might actually choke on it (it has happened to me sometimes).
Instead, we will look at the 100 most common words.
We can do this with the function `Counter.most_common`, which takes two arguments: a Counter, and a positive number.

In [None]:
print("Most common Hamlet words:", Counter.most_common(counts_hamlet, 100))
print()
print("Most common Faustus words:", Counter.most_common(counts_faustus, 100))
print()
print("Most common John Carter words:", Counter.most_common(counts_johncarter, 100))

Well, that doesn't look too bad, but suppose we want to have each entry on its own line, like in a spreadsheet.
For this we can use the function `pprint` from the `pprint` library.
The name *pprint* is short for *pretty-print*.

In [None]:
from pprint import pprint

print("Most common Hamlet words:")
pprint(Counter.most_common(counts_hamlet, 100))
print()
print("Most common Faustus words:")
pprint(Counter.most_common(counts_faustus, 100))
print()
print("Most common John Carter words:")
pprint(Counter.most_common(counts_johncarter, 100))

While the list format is good for figuring out which specific words are most common, it doesn't tell us much about the frequency distribution in general.
Is one of the texts more Zipfian than the others?
This is best done with plots.
Plotting will be described in an expansion unit, for the main units the plotting code will always be present in the cell already.

In [None]:
# tell Jupyter to display the plots directly in the browser
% matplotlib inline

# load the pandas library
import pandas

def plotting(counts):
    counts_sorted = pandas.Series.sort_values(pandas.Series(counts), ascending=False)
    counts_sorted.plot(figsize=(15,15))

plotting(counts_hamlet)

In [None]:
% matplotlib inline
plotting(counts_faustus)

In [None]:
% matplotlib inline
plotting(counts_johncarter)

The graphs show very nicely that all three texts have a very Zipfian distribution, with a tall neck and a very long tail.
However, we also see that *The Princess of Mars* has a harsher shift from neck to tail, with a very slim body, whereas *Hamlet* and *Faustus* are smoother in this area.
We can't say much more than that, though, at this point, because any other differences in word frequencies are drowned out by the rather uninformative high-frequency words like *the* and *and*.
If we want to dig deeper, we will have to get rid of them.
More on how to do that next time.