# Additional uses for `for`-loops

This unit directly continues the previous one.
We now have a function to calculate the average number of tokens per word type in a text.
This allows us to estimate how often an author reuses words in the text.
But it would still be nice to get a few other metrics, such as

1. the frequencies of word types rather than their total counts (this makes it easier to compare different texts since 50 mentions of "buletic" in a 1000-page novel doesn't have the same weight as 50 mentions in a 1000-word essay),
1. the average word length.

Before we continue, though, we once again have to run all the relevant code to get counts for our three texts *Hamlet*, *Dr. Faustus*, and *Princess of Mars*.
This is exactly the same code as what we had at the beginning of the previous unit.
Make sure you run the cell every time you use this notebook, otherwise many functions will be undefined.

In [None]:
import urllib.request
import re
from collections import Counter

# we first define custom functions for all individual steps

def get_file(text):
    if text == "hamlet":
        urllib.request.urlretrieve("http://www.gutenberg.org/cache/epub/1524/pg1524.html", "hamlet.txt")
    if text == "faustus":
        urllib.request.urlretrieve("http://www.gutenberg.org/cache/epub/811/pg811.txt", "faustus.txt")
    if text == "johncarter":
        urllib.request.urlretrieve("http://www.gutenberg.org/cache/epub/62/pg62.txt", "johncarter.txt")
        
def read_file(filename):
    with open(filename, "r", encoding="utf-8") as text:
        return text.read()
    
def delete_before_line(string, line):
    return str.split(string, "\n", line)[-1]

def delete_after_line(string, line):
    return str.join("\n", str.split(string, "\n")[:line+1])

def hamlet_cleaner(text):
    # 0. delete unwanted lines
    text = delete_after_line(delete_before_line(text, 366), 10928)
    # 1. remove all headers, i.e. lines starting with <h1, <h2, <h3, and so on
    text = re.sub(r"<h[0-9].*", r"", text)
    # 2. remove speaker information, i.e. lines of the form <p id="id012345789"...<br/>
    text = re.sub(r'<p id="id[0-9]*">[^<]*<br/>', r"", text)
    # 3. remove html tags, i.e. anything of the form <...>
    text = re.sub(r"<[^>]*>", r"", text)
    # 4. remove anything after [ or before ] on a line (this takes care of stage descriptions)
    text = re.sub(r"\[[^\]\n]*", r"", text)
    text = re.sub(r"[^\[\n]*\]", r"", text)
    return text

def faustus_cleaner(text):
    # 0. delete unwanted lines
    text = delete_after_line(delete_before_line(text, 139), 2854)
    # 1. remove stage information
    #    (anything after 10 spaces)
    text = re.sub(r"(\s){10}[^\n]*", r"", text)
    # 2. remove speaker information
    #    (any word in upper caps followed by space or dot)
    text = re.sub(r"[A-Z]{2,}[\s\.]", r"", text)
    # 3. remove anything between square brackets (this takes care of footnote markers)
    text = re.sub(r"\[[^\]]*\]", r"", text)
    return text

def johncarter_cleaner(text):
    # 0. delete unwanted lines
    text = delete_after_line(delete_before_line(text, 234), 6940)
    # 1. delete CHAPTER I
    # (must be done like this because Roman 1 looks like English I)
    text = re.sub("CHAPTER I", "", text)
    # 2. remove any word in upper caps that is longer than 1 character
    text = re.sub(r"[A-Z]{2,}", r"", text)
    # 3. remove anything after [ or before ] on a line
    text = re.sub(r"\[[^\]\n]*", r"", text)
    text = re.sub(r"[^\[\n]*\]", r"", text)
    return text

def tokenize(string):
    return re.findall(r"\w+", string)

def count(token_list):
    return Counter(token_list)


# and now we have two functions that use all the previous functions
# to do all the necessary work for us
def get_and_clean(text):
    get_file(text)
    string = read_file(text + ".txt")
    string = str.lower(string)
    # file-specific cleaning steps
    if text == "hamlet":
        return hamlet_cleaner(string)
    if text == "faustus":
        return faustus_cleaner(string)
    if text == "johncarter":
        return johncarter_cleaner(string)

def tokenize_and_count(string):
    return (count(tokenize(string)))

# and finally we get to run all the code
hamlet = tokenize_and_count(get_and_clean("hamlet"))
faustus = tokenize_and_count(get_and_clean("faustus"))
johncarter = tokenize_and_count(get_and_clean("johncarter"))

## Calculating frequencies

The frequency of a word indicates how many percent of a text are taken up by its tokens.
For example, if a word type has 6 tokens in a text of 1000 words, then its frequency is $\frac{6}{1000} = 0.006 = 0.6\%$.
So we get the frequency of a word *w* by dividing the count of *w* by the total number of tokens in the text.

We already have many of the tools that are needed to calculate frequencies.
The code provides us with counters for the three texts, and in the previous unit we saw how to calculate average token length, which taught us a few core techniques: `for`-loops, the `len`-function, and how to use `[x]` to get the value of a specific item `x` in a counter.

In [None]:
def avg_token_count(word_counter):
    # keep track of total number of tokens
    total = 0
    # start a for-loop over the counter
    for current_word in word_counter:
        # add count of current word to total
        total = total + word_counter[current_word]
    # divide `total` by number of word types (Python uses the slash / for division)
    average = total / len(word_counter)
    return average

We can recombine these techniques to define a custom function for computing word frequencies.

First, we will need to determine the total number of tokens in the text.
But the code above already tells us how to do that, that's what the variable `total` keeps track of.
Once we know the total, we can calculate the frequency of a word type by dividing its number of tokens by `total`.
Let us put the relevant code for these steps into a function that prints the frequency of every word type.

In [None]:
def frequencies(word_counter):
    # keep track of total number of tokens
    total = 0
    # start a for-loop over the counter
    for current_word in word_counter:
        # add count of current word to total
        total = total + word_counter[current_word]
    # we have computed the total,
    # now we calculate frequencies for all the words in the counter
    for current_word in word_counter:
        number_of_tokens = word_counter[current_word]
        frequency = number_of_tokens / total
        print(frequency)

**Exercise.**
The functions `avg_token_count` and `frequencies` use exactly the same code to calculate the total number of tokens.
Convert this code into a custom function `count_total`, then change `avg_token_count` and `frequencies` so that they use `count_total` for calculating the value of `total`.

In [None]:
# define count_total here, and include modified versions of avg_token_count and frequencies

### Why we need two loops

Note that in the code below we have two `for`-loops that iterate over the same counter.
This is because we first have to look at all elements of the counter to compute the value of `total`.
Once we know `total`, we can again look at each element to calculate its frequency.
We really need two distinct `for`-loops for this, it cannot be done in a single loop.
If you don't understand why, consider the toy example below where we use only one `for`-loop and get the wrong frequencies.

In [None]:
# define our test counter
test = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])

# keep track of total number of tokens
total = 0
# start a for-loop over the counter
for current_word in test:
    # add count of current word to total
    total = total + test[current_word]
    number_of_tokens = test[current_word]
    frequency = number_of_tokens / total
    # and print frequency
    print("Frequency of", current_word, "is", number_of_tokens, "/", total, "=", frequency)

**Exercise.**
There is actually a way to make things work with just one loop.
The first loop is only needed to determine the total number of word tokens in the text.
But this is the same as the length of `tokenize(text)`.
So we could instead design our frequency function as follows:

1. `frequencies` takes some text as its only argument.
1. The function then tokenizes the text.
1. It then determines the length of the tokenized text and stores it in the variable `total`.
1. The rest of the function proceeds as before.

Copy-paste the definition of `frequencies` into the cell below, then modify it in the fashion just described.

In [None]:
# copy-paste frequencies here, then modify it

### Adding frequencies to the counter

Alright, so now we know that two `for`-loops are indeed needed, but the function is still somewhat unsatisfying in that it prints the frequency of each word type.
Printing to screen isn't very useful most of the time.
It would be better if we could simply replace the absolute values in the counter by frequencies.
This is actually fairly easy.
The `[x]` notation is not only useful for retrieving the value of an element, it also allows us to **specify** the value of an element.

In [None]:
# define our test counter
test = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
print(test)

# let's change the value of "a";
# here's what it is right now
print("a's current value:", test["a"])
# and now we'll change it to 0.1
test["a"] = 0.1
print("a's new value:", test["a"])

# and now we add a new element "d" to the counter
test["d"] = 10
print("d was added with count:", test["d"])
print("The new counter is:", test)

**Exercise.**
Now we can finalize the `frequencies` function.
Instead of printing the frequency of `current_word`, the function should override the value of `current_word` in `word_counter` with `frequency`.
At the end, the function returns `word_counter`.
You can test your code in the second cell.

In [None]:
# change and complete the code below

def frequencies(word_counter):
    total = 0
    for current_word in word_counter:
        total = total + word_counter[current_word]
    for current_word in word_counter:
        number_of_tokens = word_counter[current_word]
        frequency = number_of_tokens / total
        print(frequency)

In [None]:
# test your code here
test_counts = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
print(test_counts)

test_frequency = frequencies(test_counts)
print(test_frequency)

### An Unintended Side-Effect

Let us run the test one more time, with just a minor change in the order of the `print`-statements.
Now we first compute `test_frequency` and the print `test_counts` and `test_frequency`.

In [None]:
# test your code here
test_counts = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
test_frequency = frequencies(test_counts)

print(test_counts)
print(test_frequency)

Uhm, what's going on here?
Why do `test_counts` and `test_frequency` look the same?
Where did the absolute word counts go?

The problem is with how we wrote the function `frequency`.
This is a function that takes a word counter as an argument and then **overwrites** the count of each word type with its frequency.
So if we run `frequency` over `test_counts`, all the values of `test_counts` are replaced by frequencies.
That's not really what we want.
Instead, we want to produce a copy of `test_counts` with frequencies while keeping the original version of `test_counts` untouched.
We can create a copy of a counter with the function `Counter.copy`.

In [None]:
# test your code here
test_counts = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
test_frequency = frequencies(Counter.copy(test_counts))

print(test_counts)
print(test_frequency)

Now we now longer run `frequencies` on `test_counts`, but a dynamically created copy of `test_counts`.
Hence the values of `test_counts` remain unaltered, and we get different outputs for `print` at the end.

**Exercise.**
Copy-paste your definition of the `frequencies` function into the cell below, then change it so that it always creates a copy `temp_copy` of `word_counter` at the beginning and then carries out all operations over `temp_copy` instead of `word_counter`.
Then run the code in the next cell to verify that your new definition of `frequencies` works correctly.

In [None]:
# copy-paste your code for frequencies here, then modify it as described

In [None]:
# test your code here
test_counts = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
test_frequency = frequencies(test_counts)

print(test_counts)
print(test_frequency)

## Calculating average word length

Let us turn to average word length next.
Rather than explain at length how it works, I already created a working solution for you.
Your job is just to figure out what it does.
I will tell you that much, though: `*` is multiplication.

In [None]:
def avg_word_length(word_counter):
    total_number = 0
    total_length = 0
    for word in word_counter:
        total_number = total_number + word_counter[word]
        total_length = total_length + (word_counter[word] * len(word))
    return total_length / total_number

**Exercise.**
Carefully read the code above.
Then describe what this function does.
Pay particular attention to the `for`-loop and the use of `len`, and explain why `len(word)` must be multiplied with `word_counter[word]`.

*put your explanation here*

Let us see what happens when we run this function on our three texts.

In [None]:
for text in [hamlet, faustus, johncarter]:
    print(avg_word_length(text))

Hmm, that's surprisingly close together.
One would expect a novel like *Princess of Mars* to have longer words than a play like *Hamlet* or *Dr. Faustus*, simply because the latter have to fit a specific meter.

But wait a second!
*Zipf's law* tells us that the majority of texts is made up of a few very high-frequency words, called *stop words*.
So if those stop words are very short, then that will drastically lower average word length.
And that is indeed the case, as you can probably tell from eyeballing [this list of stopwords](https://raw.githubusercontent.com/Alir3z4/stop-words/master/english.txt).
We can even use Python to see this point more clearly.

In [None]:
# download list of stop words
urllib.request.urlretrieve("https://raw.githubusercontent.com/Alir3z4/stop-words/master/english.txt", "stopwords.txt")

# read it in as a string
stopwords = read_file("stopwords.txt")

# since each word is on its own line,
# we can convert the string to a list of words by
# matching everything except newlines (\n)
stopwords = re.findall(r"[^\n]+", stopwords)

# and here is what we have
print(stopwords)

# and this is the average word length
print("\nAverage word length:", avg_word_length(Counter(stopwords)))

As you can see, the average word length for all three texts is very close to the average word length of English stop words.
So there is a good chance that we will get a much more pronounced difference in average word length between the three texts if we ignore stop words.

## Removing stopwords

Our goal at this point is clear: We have to get rid of those pesky stopwords.
This is actually fairly easy with a `for`-loop.

First, we need to get a tokenized list for each text.
With the functions from the beginning of the unit, that is easy-peasy.

In [None]:
hamlet_full = tokenize(get_and_clean("hamlet"))
faustus_full = tokenize(get_and_clean("faustus"))
johncarter_full = tokenize(get_and_clean("johncarter"))

# check the output to see what these lists look like
for text in [hamlet_full, faustus_full, johncarter_full]:
    print(text[:50])

Next we want to construct a version of these lists where all stopwords are omitted.
We can do this as follows:

1. We create an empty list `words`.
1. We iterate over the tokenized text, and every token that is not a stop word gets added to `words`.
1. When the `for`-loop finishes, `words` will contain all tokens of the text except the stop words.

Here's what this looks like for `hamlet_full`.

In [None]:
# define hamlet_full
hamlet_full = tokenize(get_and_clean("hamlet"))
# define list of stop words
urllib.request.urlretrieve("https://raw.githubusercontent.com/Alir3z4/stop-words/master/english.txt", "stopwords.txt")
stopwords = re.findall(r"[^\n]+", read_file("stopwords.txt"))

# empty list of words
words = []

# start for-loop
for token in hamlet_full:
    if token not in stopwords:
        # add token to words
        list.append(words, token)
        
        
# let's compare the two by looking at the first 50 tokens
print(hamlet_full[:50])
print(words[:50])

**Exercise.**
You might have noticed a minor problem: `tokenize` treats strings like *who's* as two tokens *who* and *s*.
But our list of stop words instead treats *who* and *who's* as stop words, but not *s* because it assumes that *who's* is never split into *who* and *s*.
As a result, the code above removes *who* from `hamlet_full`, but not *s*.

Copy-paste the code into the cell below, then fix it so that *s* is also filtered out correctly.

In [None]:
# copy-paste the code here, then modify it

**Exercise.**
Based on the code above, design a custom function `filter_words` such that:

1. `filter_words` takes two lists as arguments, `token_list` and `stopwords`, and
1. `filter_words` returns a version of `token_list` that only contains those elements that aren't also listed in `stopwords`.

In [None]:
# define hamlet_full
hamlet_full = tokenize(get_and_clean("hamlet"))
faustus_full = tokenize(get_and_clean("faustus"))
johncarter_full = tokenize(get_and_clean("johncarter"))

# define list of stop words
urllib.request.urlretrieve("https://raw.githubusercontent.com/Alir3z4/stop-words/master/english.txt", "stopwords.txt")
stopwords = re.findall(r"[^\n]+", read_file("stopwords.txt"))
# fixme: add s to stopwords

def filter_words(tokens, stopwords):
    words = []
    # some mystery happens here
    return words

hamlet_filtered = filter_words(hamlet_full, stopwords)
faustus_filtered = filter_words(faustus_full, stopwords)
johncarter_filtered = filter_words(johncarter_full, stopwords)

for text in [hamlet_filtered, faustus_filtered, johncarter_filtered]:
    print(avg_word_length(Counter(text)))

**Exercise.**
Has the removal of stop words led to a greater difference in word length between the texts?
Do we see a more pronounced difference between plays and novels now?

*put your answer here*

## Wrapping up

This unit hasn't introduced a lot of new concepts, instead you just got to see how `for`-loops can be used to accomplish a variety of things.
They really are one of Python's most important and versatile tools.

**Exercises.**
We can even use `for`-loops to implement a very simple spellchecker.
[This textfile](https://raw.githubusercontent.com/dwyl/english-words/master/words.txt) lists almost all words of English (it has over 500,000 entries).
It is similar to our list of stopwords in that it contains one word per line.

The code below downloads the file for you and reads it in as a string.
Here is your task:

1.  Tokenize the string to get a list of correctly spelled English words.
1.  Write a function `spellcheck` such that
    1. The function takes as its argument a string and a list of words, the dictionary.
    1. The function tokenizes the string (words like *who's* should be tokenized as *who's*, so you'll have to modify our standard tokenization function).
    1. It then checks every token in the string and returns a list of all misspelled tokens.

In [None]:
# download the file
url = "https://raw.githubusercontent.com/dwyl/english-words/master/words.txt"
urllib.request.urlretrieve(url, "words.txt")
dict_string = read_file("words.txt")

# Step 1: tokenize dict_string
# dictionary = ...

# Step 2: define custom function
# def spellcheck(string, wordlist):
    # some mystery happens here

# let's test the code
examples = ["My neighbour is to tried for work.",
            "I am teh world's gr8test typist",
            "U don't look 2 bed",
            "Their are titilating oportunitys here."]
for example in examples:
    print(example, "contains the following errors:", spellcheck(example, dictionary))