# Literature analysis with unigrams: Now let's have a little fun

The previous notebook showed you how to download and read in files, clean them up, and tokenize them.
A tokenized list is nice, but not enough.
Slices make it easier for us to compare two versions of the same file, but that's not quite what we want either.
In order to carry out a quantitative analysis of each author's writing style, we also need to know how often each word is used.
As usual, Python is very kind to us and provides an off-the-shelf solution.

But before you proceed, make sure to run the cell below.
This will once again read in the cleaned up text files and store them as tokenized lists in the variables `hamlet`, `faustus`, and `mars`.
If you get an error, make sure that you did the previous notebook and that this notebook is in a folder containing the files `hamlex_clean.txt`, `faustus_clean.txt`, and `mars_clean.txt` (which should be the case if you did the previous notebook).

In [None]:
import re

def tokenize(the_string):
    """Convert string to list of words"""
    return re.findall(r"\w+", the_string)


def tokenize_file(the_file):
    """Read file as string and tokenize it"""
    with open(the_file, mode="r") as text:
        return tokenize(text.read())


# define a variable for each token list
hamlet = tokenize_file("hamlet_clean.txt")
faustus = tokenize_file("faustus_clean.txt")
mars = tokenize_file("mars_clean.txt")

**Caution.**
If you restart the kernel at any point, make sure to run this cell again so that the variables `hamlet`, `faustus`, and `mars` are defined.

## Counting words

Python makes it very easy to count how often an element occurs in a list: the `collections` library provides a function `Counter` that does the counting for us.
The `Counter` function takes as its only argument a list (like the ones produced by `re.findall` for tokenization).
It then converts the list into a *Counter*.
Here is what this looks like with a short example string.

In [None]:
import re
from collections import Counter  # this allows us to use Counter instead of collections.Counter

test_string = "FTL is short for faster-than-light; we probably won't ever have space ships capable of FTL-travel."

# tokenize the string
tokens = re.findall(r"\w+", str.lower(test_string))
print("The list of tokens:", tokens)

# add an empty line
print()

# and now do the counting
counts = Counter(tokens)
print("Number of tokens for each word type:", counts)

Let's take a quick peak at what the counts looks like for each text.
We don't want to do this with something like `print(counts_hamlet)`, because the output would be so large that your browser might actually choke on it (it has happened to me sometimes).
Instead, we will look at the 100 most common words.
We can do this with the function `Counter.most_common`, which takes two arguments: a Counter, and a positive number.

In [None]:
from collections import Counter

# construct the counters
counts_hamlet = Counter(hamlet)
counts_faustus = Counter(faustus)
counts_mars = Counter(mars)

print("Most common Hamlet words:", Counter.most_common(counts_hamlet, 100))
print()
print("Most common Faustus words:", Counter.most_common(counts_faustus, 100))
print()
print("Most common John Carter words:", Counter.most_common(counts_mars, 100))

**Exercise.**
The code below uses `import collections` instead of `from collections import Counter`.
As you can test for yourself, the code now produces various errors.
Fix the code so that the cell runs correctly.
You must not change the `import` statement.

In [None]:
import collections

# construct the counters
counts_hamlet = Counter(hamlet)
counts_faustus = Counter(faustus)
counts_mars = Counter(mars)

print("Most common Hamlet words:", Counter.most_common(counts_hamlet, 100))
print()
print("Most common Faustus words:", Counter.most_common(counts_faustus, 100))
print()
print("Most common John Carter words:", Counter.most_common(counts_mars, 100))

Python's output for `Counter.most_common` doesn't look too bad, but it is a bit convoluted.
We can use the function `pprint` from the `pprint` library to have each word on its own line.
The name *pprint* is short for *pretty-print*.

In [None]:
from pprint import pprint  # we want to use pprint instead of pprint.pprint
from collections import Counter

# construct the counters
counts_hamlet = Counter(hamlet)
counts_faustus = Counter(faustus)
counts_mars = Counter(mars)

# we have to split lines now because pprint cannot take multiple arguments like print
print("Most common Hamlet words:")
pprint(Counter.most_common(counts_hamlet, 100))
print()
print("Most common Faustus words:")
pprint(Counter.most_common(counts_faustus, 100))
print()
print("Most common John Carter words:")
pprint(Counter.most_common(counts_mars, 100))

**Exercise.**
What is the difference between the following two pieces of code?
How do they differ in their output, and why?

In [None]:
from collections import Counter

counts = Counter(hamlet[:50])
print(counts)

In [None]:
from collections import Counter

count = Counter(hamlet)
print(Counter.most_common(count, 50))

*put your answer here*

## A problem

If you look at the lists of 100 most common words for each text, you'll notice that they are fairly similar.
For instance, all of them have *a*, *the*, and *to* among the most frequent ones.
That's not a peculiarity of these few texts, it's a general property of English texts.
This is because of **Zipf's law**: ranking words by their frequency, the n-th word will have a relative frequency of 1/n.
So the most common word is twice as frequent as the second most common one, three times more frequent than the third most common one, and so on.
As a result, a handful of words make up over 50% of all words in a text.

Zipf's law means that word frequencies in a text give rise to a peculiar shape that we might call the Zipf dinosaur.

![The Zipf dinosaur](./media/zipfgraph_dinosaur.jpeg)

A super-high neck, followed by a very long tail.
For English texts, the distribution usually resembles the one below, and that's even though this graph only shows the most common words.

![Zipf distribution for English](./media/zipfgraph_english.png)

There is precious little variation between English texts with respect to which words are at the top.
These common but uninformative words are called **stop words**.
If we want to find any interesting differences between *Hamlet*, *Doctor Faustus*, and *Princess of Mars*, we have to filter out all these stop words.
That's not something we can do by hand, but our existing box of tricks doesn't really seem to fit either.
We could use a regular expression to delete all these words from the string before it even gets tokenized.
But that's not the best solution:

1. A minor mistake in the regular expression might accidentally delete many things we want to keep.
   Odds are that this erroneous deletion would go unnoticed, possibly invalidating our stylistic analysis.
1. There's hundreds of stop words, so the regular expression would be very long.
   Ideally, our code should be compact and easy to read.
   A super-long regular expression is the opposite of that, and it's no fun to type either.
   And of course, the longer a regular expression, the higher the chance that you make a typo (which takes us back to point 1).
1. While regular expressions are fast, they are not as fast as most of the operations Python can perform on lists and counters.
   If there is an easy alternative to a regular expression, that alternative is worth exploring.

Alright, so if regexes aren't the best solution, what's the alternative?
Why, it's simple: 0.

## Changing counts

The values in a Python counter can be changed very easily.

In [None]:
from collections import Counter
from pprint import pprint

# define a test counter and show its values
test = Counter(["John", "said", "that", "Mary", "said", "that", "Bill", "stinks"])
pprint(test)

# 'that' is a stop word; set its count to 0
test["that"] = 0
pprint(test)

The code above uses the new notation `test['that']`.
Remember that Python allows us reference specific elements with an index that corresponds to their position in the list, e.g. `some_list[2]`.
Counters have no fixed orders, so we cannot use indices in this way.
But instead of an index we can just use the element itself.
So `test["that"]` points to the value for `"that"` in the counter `test`.
We also say that `"that"` is a **key** that points to a specific **value**.
The line

```python
test["that"] = 0
```

intstructs Python to set the value for the key `"that"` to `0`.

**Exercise.**
Look at the code cell below.
For each line, add a comment that briefly describes what it does (for instance, *set value of 'that' to 0*).
If the line causes an error, fix the error and add two commments:

1. What caused the error?
1. What does the corrected line do?

You might want to use `pprint` to look at how the counter changes after each line.

In [None]:
from collections import Counter

# define a test counter and show its values
test = Counter(["John", "said", "that", "Mary", "said", "that", "Bill", "stinks"])

test["that"] = 0  # set value of 'that' to 0
test["Mary"] = test["that"]
test[John] = 10
test["said"] = test["John' - 'said"]
test["really"] = 0

Since we can change the values of keys in counters, stop words become very easy to deal with.
Recall that the problem with stop words is not so much that they occur in the counter, but that they make up the large majority of high frequency words.
Our intended fix was to delete them from the counter.
But instead, we can just set the count of each stop word to 0.
Then every stop word is still technically contained by the counter, but since its frequency is 0 it will no longer show up among the most common words, which is what we really care about.

Alright, let's do that.

**Exercise.**
The figure above shows you the most common stop words of English (except for *whale*, you can ignore that one).
Extend the code below so that the count for each one of the stop words listed in the figure is set to 0.
Compare the output before and after stop word removal and ask yourself whether there has been significant progress.

In [None]:
from collections import Counter

# construct the counters
counts_hamlet = Counter(hamlet)
# output with stop words
print("Most common Hamlet words before clean-up:\n", Counter.most_common(counts_hamlet, 25))

# set stop word counts to 0
# put your code here

# output without stop words
print("Most common Hamlet words after clean-up:\n", Counter.most_common(counts_hamlet, 25))

Okay, this is an improvement, but it's really tedious.
You have to write the same code over and over again, changing only the key.
And you aren't even done yet, there's still many more stop words to be removed.
But don't despair, you don't have to add another 100 lines of code.
No, repetitive tasks like that are exactly why programming languages have **`for` loops**.

## A new kind of loop: `for`

You already know `while` loops, which keep executing the same code over and over again until a condition is no longer met.
A `for` loop is similar in that it runs the same code over and over again.
But instead of a condition, the `for` loop takes a collection of elements and runs the code once for each element.

In [None]:
# some tedious repetition
print("H")
print("e")
print("l")
print("l")
print("o")
print("!")
print("!")
print("!")

In [None]:
# a for-loop with a list as container is much shorter
for character in ["H", "e", "l", "l", "o", "!", "!", "!"]:
    print(character)

In [None]:
# strings count as containers, too
for character in "Hello!!!":
    print(character)

The general format of a `for`-loop is very intuitive.

```python
for element in some_container:
    # code to be run for each element
```

Just keep in mind that the container inside the `for`-loop gets run once for each element in the container.
Many different kinds of objects can be containers, including lists and strings.
Intuitively, if something is built up from smaller elements, it is a container.
So `"5"` would be a container (a string built up from a single character), but `5` would not be (an integer isn't a collection of smaller things).

Just like with `while`-loops, there are no restrictions on how complex the code inside a `for`-loop may be.

In [None]:
for character in "Hello!!!":
    character = str.lower(character)
    if character in ["a", "e", "i", "o", "u"]:
        print("Found a vowel:", character)
    else:
        if character not in ["!", ".", "?", "-", ";", " "]:
            print("Found a consonant:", character)

**Exercise.**
The output of the code below might be unexpected to you.
Explain what is going on here.

In [None]:
# strings count as containers, too
for character in ["Hello!!!"]:
    print(character)

*put your explanation here*

**Exercise.**
The code below contains several mistakes.
Fix all of them.

In [None]:
to_buy = ["honey"]

for ingredient in "flour", "sugar", "honey"
if ingredient not in to_buy:
    list.append(to_buy, the_ingredient)

**Exercise.**
Look at the code below.
Provide a more succinct implementation that uses a `for`-loop instead.

In [None]:
print("The first natural number is 0.")
print("The number after that is 1.")
print("The number after that is 2.")
print("The number after that is 3.")
print("The number after that is 4.")
print("The number after that is 5.")
print("The number after that is 6.")
print("It goes on like that forever.")

With a `for`-loop, setting the counts of stop words to 0 becomes a matter of just a few lines.

In [None]:
from collections import Counter

# construct the counters
counts_hamlet = Counter(hamlet)
counts_faustus = Counter(faustus)
counts_mars = Counter(mars)

stopwords = ["the", "of", "and", "a", "to", "in",
             "that", "his", "it", "he", "but", "as",
             "is", "with", "was", "for", "all", "this",
             "at", "while", "by", "not", "from", "him",
             "so", "be", "one", "you", "there", "now",
             "had", "have", "or", "were", "they", "which",
             "like"]

for word in stopwords:
    counts_hamlet[word] = 0
    counts_faustus[word] = 0
    counts_mars[word] = 0

**Exercise.**
Since there's no limits on the complexity of the code inside a `for`-loop, it can also contain another `for`-loop.
Copy-paste the code above into the cell below and replace the last three lines by a `for`-loop inside the stop word `for`-loop.

In [None]:
# put the modified code here

**Exercise.**
This continues the previous exercise.
Keep expanding the list of stop words until the top 100 words for each text are sufficiently distinct for meaningful comparisons.
It's up to you to decide when that is the case, but you want to see things like proper names, nouns, and adjectives, rather than just coordinations, pronouns, and forms of *have* or *be*.

Okay, now we can finally compare the three texts based on their unigram counts.
You can use the `Counter.most_common` function to see which words are most common in each text.
We can also compare the overall frequency distribution.
The code below will plot the counters, giving you a graphical representation of the frequency distribution, similar to the Zipf figures above.

(Don't worry about what any of the code below does.
Just run the cell and look at the pretty output.)

In [None]:
%matplotlib inline

# import relevant matplotlib code
import matplotlib.pyplot as plt

# figsize(20, 10)
plt.figure(figsize=(20,10))
# the lines above are needed for Jupyter to display the plots in your browser
# do not remove them

# a little bit of preprocessing so that the data is ordered by frequency
def plot_preprocess(the_counter, n):
    """format data for plotting n most common items"""
    sorted_list = sorted(the_counter.items(), key=lambda x: x[1], reverse=True)[:n]
    words, counts = zip(*sorted_list)
    return words, counts


for text in [counts_hamlet, counts_faustus, counts_mars]:
    # you can change the max words value to look at more or fewer words in one plot
    max_words = 30
    words = plot_preprocess(text, max_words)[0]
    counts = plot_preprocess(text, max_words)[1]
    plt.bar(range(len(counts)), counts, align="center")
    plt.xticks(range(len(words)), words)
    plt.show()

So there you have it.
Your first, fairly simple quantitative analysis of writing style.
You can compare the three texts among several dimensions:

1. What are the most common words in each text?
1. Are the frequency distributions very different?
   Perhaps one of them keeps repeating the same words over and over, whereas another author varies their vocabulary more and thus has a smoother curve that's not as much tilted towards the left?
   
Play around with this a bit.
You can change the `max_words` variable in the code above to "zoom in" and "zoom out".
But keep in mind that it might take a while to compute the plots for values past 100.

We're of course far away from a truly insightful analysis of these texts.
But this is in fact how a lot of language technology works.
For example, there are algorithms that determine what ads to place on a given website depending on its content.
Except that the analysis of the content doesn't go much beyond calculating what words that aren't stop words are most frequent in the text.

## Bullet point summary
    
- **Counters**
    - Counters count the number of tokens for each type in a list.
    - Load them with `from collections import Counter`.
    - Use `some_counter = Counter(some_list)` to convert `some_list` to a counter `some_counter`.
    - Use `Counter.most_common(your_counter, n)` to get the `n` most common words in the counter.
    - Use `some_counter[some_key]` to get the value of `some_key`.
      Don't forget the quotation marks for strings.
    - Values can also be modified, e.g. `some_counter[some_key] = some_counter[other_key] + 5`.
    
- **Loading libraries**  
  When loading only a part of a library, you can use `from libraryX import partY` instead of `import libraryX`.
  Then you can simple write `partY` in your code instead of `libraryX.partY`.
      
    ```python
    from re import sub
    # we now use sub instead of re.sub
    sub(r"\d+", "", some_string)
    ```
    
  By default, you should use `import X`.
  Only use `from X import Y` if
    1. `X` is a pretty long library name, and
    1. you need to use `Y` a lot, and
    1. you really need only `Y` and none of the other functions provided by `X`.
   
- **`for`-loops**
  Use `for`-loops to apply a piece of code to each element in a collection.
  
    ```python
    for element in collection:
        # do something; you may use element as a variable
    ```