# For-Loops for a More Detailed Analysis of Word Counts

In the previous unit we started our quantiative analysis of *Hamlet*, *Doctor Faustus*, and *The Princess of Mars*.
The cell below contains all the code to immediately repeat the relevant steps from the previous unit so that we can continue where we left off.
As you can see, this is once again a case where it is convenient to break up individual steps into functions that are then combined to yield the desired result.

In [None]:
import urllib.request
import re
from collections import Counter

# we first define custom functions for all individual steps

def get_file(text):
    if text == "hamlet":
        urllib.request.urlretrieve("http://www.gutenberg.org/cache/epub/1524/pg1524.html", "hamlet.txt")
    if text == "faustus":
        urllib.request.urlretrieve("http://www.gutenberg.org/cache/epub/811/pg811.txt", "faustus.txt")
    if text == "johncarter":
        urllib.request.urlretrieve("http://www.gutenberg.org/cache/epub/62/pg62.txt", "johncarter.txt")
        
def read_file(filename):
    with open(filename, "r", encoding="utf-8") as text:
        return text.read()
    
def delete_before_line(string, line):
    return str.split(string, "\n", line)[-1]

def delete_after_line(string, line):
    return str.join("\n", str.split(string, "\n")[:line+1])

def hamlet_cleaner(text):
    # 0. delete unwanted lines
    text = delete_after_line(delete_before_line(text, 366), 10928)
    # 1. remove all headers, i.e. lines starting with <h1, <h2, <h3, and so on
    text = re.sub(r"<h[0-9].*", r"", text)
    # 2. remove speaker information, i.e. lines of the form <p id="id012345789"...<br/>
    text = re.sub(r'<p id="id[0-9]*">[^<]*<br/>', r"", text)
    # 3. remove html tags, i.e. anything of the form <...>
    text = re.sub(r"<[^>]*>", r"", text)
    # 4. remove anything after [ or before ] on a line (this takes care of stage descriptions)
    text = re.sub(r"\[[^\]\n]*", r"", text)
    text = re.sub(r"[^\[\n]*\]", r"", text)
    return text

def faustus_cleaner(text):
    # 0. delete unwanted lines
    text = delete_after_line(delete_before_line(text, 139), 2854)
    # 1. remove stage information
    #    (anything after 10 spaces)
    text = re.sub(r"(\s){10}[^\n]*", r"", text)
    # 2. remove speaker information
    #    (any word in upper caps followed by space or dot)
    text = re.sub(r"[A-Z]{2,}[\s\.]", r"", text)
    # 3. remove anything between square brackets (this takes care of footnote markers)
    text = re.sub(r"\[[^\]]*\]", r"", text)
    return text

def johncarter_cleaner(text):
    # 0. delete unwanted lines
    text = delete_after_line(delete_before_line(text, 234), 6940)
    # 1. delete CHAPTER I
    # (must be done like this because Roman 1 looks like English I)
    text = re.sub("CHAPTER I", "", text)
    # 2. remove any word in upper caps that is longer than 1 character
    text = re.sub(r"[A-Z]{2,}", r"", text)
    # 3. remove anything after [ or before ] on a line
    text = re.sub(r"\[[^\]\n]*", r"", text)
    text = re.sub(r"[^\[\n]*\]", r"", text)
    return text

def tokenize(string):
    return re.findall(r"\w+", string)

def count(token_list):
    return Counter(token_list)


# and now we have two functions that use all the previous functions
# to do all the necessary work for us
def get_and_clean(text):
    get_file(text)
    string = read_file(text + ".txt")
    string = str.lower(string)
    # file-specific cleaning steps
    if text == "hamlet":
        return hamlet_cleaner(string)
    if text == "faustus":
        return faustus_cleaner(string)
    if text == "johncarter":
        return johncarter_cleaner(string)

def tokenize_and_count(string):
    return (count(tokenize(string)))

# and finally we get to run all the code
hamlet = tokenize_and_count(get_and_clean("hamlet"))
faustus = tokenize_and_count(get_and_clean("faustus"))
johncarter = tokenize_and_count(get_and_clean("johncarter"))

Once the cell above has finished executing, you can run the cell below to see the 10 most common words for each text.

In [None]:
print(Counter.most_common(hamlet, 10))
print(Counter.most_common(faustus, 10))
print(Counter.most_common(johncarter, 10))

But looking at the most common words only gets us so far.
There's many other things we might want to look at based on the token counts:

- How often does a word appear on average?

- What are the relative frequencies, rather than the absolute counts?
  Knowing this would make it easier to compare frequencies across the texts, since absolute counts naturally vary with text length.
 
- What is the average frequency of a word?
  This is different from the average count, as the latter once again varies with text length.

- What is the average word length?
  The texts might show a large difference in this area, after all *Hamlet* and *Faustus* have to obey their meter.

Most of these questions require us to look at each word type in the Counter.
But we do not know all the word types, and even if we did there's probably thousands of them, so we cannot tell Python something like "look at *and*, and now look at *the*, and now look at *I*, ...", this would take forever to write.
No, what we need is the final missing piece in our Python toolbox: `for` loops.
And that is exactly what this unit is about.

## Looking at individual words

You already know that you can look at the entire counter by putting it inside a `print` statement.

In [None]:
print(hamlet)

And if you want the output to look a little prettier, you can *pretty print* it with `pprint`.

In [None]:
from pprint import pprint

pprint(hamlet)

But what if you want to just look up the number of tokens for a specific word type?
Printing a giant list and then looking for the word in the output isn't exactly convenient.
Fortunately, we can tell the counter directly to show us the value for a specific word.

In [None]:
print(hamlet["the"])

In [None]:
print(hamlet["hamlet"])

**Exercise.**
Experimentation time!
Try various ways of using this new syntax to get the values of words from certain counters.
Pay particular attention to whether the part in square brackets, e.g. `["hamlet"]`, behaves like a list:

- What happens if you have multiple words between square brackets?
- Is it possible to have nothing between square brackets?
- Can you use list-related functions like `list.append`?

Include at least five mini-experiments, and add comments to explain what you are testing.
Then provide a description of how this technique is used correctly.

In [None]:
# put your code here

*put your description here*

## Calculating the average number of tokens per type

Alright, we now know how to look at the counts for individual words.
In principle, this is all we need to calculate the things we are interested in.
For example, the average number of tokens per word is obtained by adding up all the values for all types and then dividing this number by the number of types.
(Remember, the average of $2$ numbers $a$ and $b$ is $\frac{a+b}{2}$, for $3$ numbers it is $\frac{a+b+c}{3}$, for $4$ $\frac{a+b+c+d}{4}$, and so on.)

### The hard way

We could do something like the following:

1.  Instantiate two variables `words` and `total`, both are set to `0`.
1.  Look at the first word in the counter.
    1. Add 1 to the value of `words`.
    1. Add the count for the word to `total`.
1.  Look at the second word in the counter. 
    1. Add 1 to the value of `words`.
    1. Add the count for the word to `total.
1.  Continue this until all words have been looked at.
1.  Divide `total` by `words`.
    This is the average number of tokens per word type.
    
Let's do this for a toy example to see that it works the way we want.
We will instantiate a counter where `a` occurs has 6 tokens, `b` 4, and `c` 2.
So the average number of tokens per types is $\frac{6+4+2}{3} = 4$.

In [None]:
test = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
print(test)
print(test["a"])
print(test["b"])
print(test["c"])

We now apply our algorithm above to this counter to calculate the average word length.

In [None]:
# define our test counter
test = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])

# Instantiate two variables `words` and `total`, both are set to `0`.
words = 0
total = 0

# Look at the first word in the counter.
current_word = "a"
# - Add 1 to the value of `words`.
words = words + 1
# - Add the count for the word to `total`.
total = total + test[current_word]

# Look at the second word in the counter.
current_word = "b"
# - Add 1 to the value of `words`.
words = words + 1
# - Add the count for the word to `total`.
total = total + test[current_word]

# Look at the third word in the counter.
current_word = "c"
# - Add 1 to the value of `words`.
words = words + 1
# - Add the count for the word to `total`.
total = total + test[current_word]

# all words have been looked at, time for the final step:
# divide `total` by `words` (Python uses the slash / for division)
average = total / words
print(average)

When you run the code above, you'll get the output `4.0`, indicating that the average number of tokens per type in the counter is `4`.
And as we confirmed earlier on, that's indeed the case because `a` has 6 tokens, `b` has 4, and `c` has 2.
So the code does what we want.
All we have to do now is to use this code with our counters `hamlet`, `faustus`, and `johncarter`.

**Exercise.**
Alright, this might take a while, so you better role up your sleeves and get hacking.
Copy-paste the code from the cell above, then adapt is so that it looks at every word in `hamlet`.

What, you don't want to do that?
Okay, but then you have to leave a justification below why that would be a horrible way of writing the code.

*put your explanation here*

### The easy way: A for-loop

Consider once more the code we just wrote.
It is very mechanical in that we keep repeating the same steps over and over again.

In [None]:
# define our test counter
test = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])

# Instantiate two variables `words` and `total`, both are set to `0`.
words = 0
total = 0

################################
# Here we start repeating code #
################################
# Iteration 1
current_word = "a"
words = words + 1
total = total + test[current_word]

# Iteration 2
current_word = "b"
words = words + 1
total = total + test[current_word]

# Iteration 3
current_word = "c"
words = words + 1
total = total + test[current_word]
############################################################
# We are done repeating the same steps over and over again #
############################################################

# all words have been looked at, time for the final step:
# divide `total` by `words` (Python uses the slash / for division)
average = total / words
print(average)

Each one of three iterations above runs exactly the same code, except that the value of `current_word` changes.
Whenever we want to run exactly the same piece of code over and over again, changing only the value of a single variable, we can use a `for`-loop.
A `for` loop allows us define a list of possible values for a specific variable, and then the code in the `for`-loop gets run over and over again until every possible value for the variable has been used.
Here is what this looks like for the code above.

In [None]:
# define our test counter
test = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])

# Instantiate two variables `words` and `total`, both are set to `0`.
words = 0
total = 0

################################
# Here we start repeating code #
################################
# start a for-loop with "a", "b", and "c" as possible values
for current_word in ["a", "b", "c"]:
    words = words + 1
    total = total + test[current_word]
############################################################
# We are done repeating the same steps over and over again #
############################################################

# all words have been looked at, time for the final step:
# divide `total` by `words` (Python uses the slash / for division)
average = total / words
print(average)

Run the code above, and you'll get the same answer as before: the average number of tokens per word is 4.

Here's what's going on inside Python:

1.  Python instantiates `words` and `total` as usual.
1.  Then it encounters a `for`-loop, and realizes that it has to run the code below multiple times based on the list `["a", "b", "c"]` of possible values for `current_variable`:
    1. Python first sets `current_word` to `"a"`, then it runs the two lines below.
    1. Then it sets `current_word` to `"b"`, and once again runs the two lines below.
    1. After that it sets `current_word` to `"c"`, and runs the two lines of code one more time.
    1. At this point, all the possible values have been used and the `for`-loop ends.
1.  Python continues with the calculation of the average and prints the result to screen.

If you still find this confusing, run the cell below.
It is exactly the same code, but `print`-statements have been added to show what is going on inside the `for`-loop.

In [None]:
print("Defining a test counter:")
test = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
print(test)

print("Instantiating words and total")
words = 0
total = 0
print("words is", words)
print("total is", total)

print("Encountered a for-loop")
print("\t ======================")
for current_word in ["a", "b", "c"]:
    print("\t ------------------------")
    print("\t Starting a new iteration")
    print("\t Setting current_word to", current_word)
    print("\t Increasing words from", words, "to", words + 1)
    words = words + 1
    print("\t", current_word, "has count of", test[current_word])
    print("\t Increasing total from", total, "to", total + test[current_word])
    total = total + test[current_word]
    print("\t Finished an iteration")
    print("\t ------------------------")
print("\t ======================")
print("Finished for-loop")

# all words have been looked at, time for the final step:
# divide `total` by `words` (Python uses the slash / for division)
print("Computing average")
average = total / words
print(average)

Every `for`-loop follows the same template:

```python
for variable in range_of_values:
    code_using_variable
```

**Exercise.**
Below is a code cell that we encountered earlier on when we looked at our test counter.
Combine the last three `print` statements into a `for`-loop.

In [None]:
test = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
print(test)
print(test["a"])
print(test["b"])
print(test["c"])

Most of the time the `range_of_values` part of the `for`-loop template is a list.
But sometimes one might also want to use strings, Counters, or other objects.
The important thing is that `range_of_values` must be container-like in that it is a collection of some smaller building blocks, e.g. the elements of a list or the characters in a string.
An integer, for instance, is not a possible choice for `range_of_values` because a number is just a number, it is not a container of smaller numbers or anything like that.

In [None]:
# a for-loop over a string
def vertical_print(string):
    for character in string:
        print(character)
        
vertical_print("Look, a vertical string!!!")

In [None]:
# for-loop over a counter
test = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])

for word in test:
    print(word)

**Exercise.**
As you can tell from the previous example, a `for`-loop over a counter by default chooses the words as possible values for the variable, rather than the numbers.
Using this fact, adapt the for-loop in the code below so that it no longer explicitly mentions `"a"`, `"b"`, or `"c"`.

In [None]:
# define our test counter
test = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])

# Instantiate two variables `words` and `total`, both are set to `0`.
words = 0
total = 0

# start a for-loop over the counter
for current_word in ["a", "b", "c"]:
    words = words + 1
    total = total + test[current_word]

# divide `total` by `words` (Python uses the slash / for division)
average = total / words
print(average)

## Simplifying our code with `len`

We now have a pretty nice piece of code to compute the average number of tokens per type.
And in a minute we will use this to compare *Hamlet*, *Dr. Faustus*, and *The Princess of Mars*.
But first let's apply one more final tweak to simplify our code.

Right now, we are doing two things inside the `for`-loop:

1. We increment the value of `words` by 1 to compute the total number of types in the counter.
1. We add the number of tokens for the type to the running total of tokens to calculate the total number of tokens.

We can skip the first step using the built-in function `len`, which tells us for any container-like object what it's length is.

In [None]:
example_string = "abc"
example_list = ["a", "b", "c"]
example_counter = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])

for example in [example_string, example_list, example_counter]:
    print("The current example is")
    print(example)
    print("It's length is", len(example))
    print("----------")

**Exericse.**
Modify the code below so that it only prints a string if it's length is exactly 10.

In [None]:
examples = ["an example", "0123456789", "Hi!", "What's up???", "Honeybunny"]
for string in examples:
    print(string)

Instead of incrementing the value of `words` during each iteration, we can just use `len(test)` to determine the total number of word types in the counter `test`.
The code then looks as follows:

In [None]:
# define our test counter
test = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])

# Instantiate only `total`
total = 0

# start a for-loop over the counter
for current_word in ["a", "b", "c"]:
    total = total + test[current_word]

# divide `total` by number of word types (Python uses the slash / for division)
average = total / len(test)
print(average)

## Comparing the three texts

Alright, time to do the final comparison.
What do you think, which one of the three texts will have the highest average number of tokens per type?
Well, let's see how it goes.

**Exercise.**
Adapting the code from the previous section, finish the definition of the function `avg_token_count` below.
Then run the next cell to calculate the average number of tokens per type for *Hamlet*, *Dr. Faustus*, and *The Princess of Mars* (remember that you must have run the first cell of this unit for the variables to be defined).
You should see quite a striking pattern, with two texts being very close together and the other one having a noticeably larger average.
Can you think of a reason as to why we find this difference?

In [None]:
def avg_token_count(word_counter):
    # complete this function

In [None]:
for text in [hamlet, faustus, johncarter]:
    print(avg_token_count(text))

*put your explanation of the difference here (there are no wrong answers, the important thing is that you think about it for a bit)*

That's it for this unit, but we are not done doing some quantitative analysis of these texts.
We have only looked at one of the four metrics described at the beginning of the unit, and there's actually still a massive confound in our analysis that we need to take care of to get more reliable numbers.