# Additional uses for `for`-loops

We now have a function to calculate the average number of tokens per word type in a text.
This allows us to estimate how often an author reuses words in the text.
But it would still be nice to get a few other metrics, such as

1. the frequencies of word types rather than their total counts (this makes it easier to compare different texts since 50 mentions of "buletic" in a 1000-page novel doesn't have the same weight as 50 mentions in a 1000-word essay),
1. the average word length.

Before we continue, though, we once again have to run all the relevant code to get counts for our three texts *Hamlet*, *Dr. Faustus*, and *Princess of Mars*.
As before, run the cell below to make the appropriate counters available under the variable names `hamlet`, `faustus`, and `mars`.

In [1]:
%run wordcounts.py

## Calculating frequencies

The frequency of a word indicates how many percent of a text are taken up by its tokens.
For example, if a word type has 6 tokens in a text of 1000 words, then its frequency is $\frac{6}{1000} = 0.006 = 0.6\%$.
So we get the frequency of a word *w* by dividing the count of *w* by the total number of tokens in the text.

We already have many of the tools that are needed to calculate frequencies:

1. a counter for each tokenized text, and
1. `for`-loops, and
1. the `len`-function, and
1. the `sum`-function, and
1. the usage of keys such as `[x]` to get the value of a specific item `x` in a counter.

We can recombine these techniques to define a custom function for computing word frequencies.

First, we will need to determine the total number of tokens in the text.
But we already know how to do that with `sum` and `Counter.values`.
Once we know the total, we can calculate the frequency of a word type by dividing its number of tokens by `total`.
Let us put the relevant code for these steps into a function that prints the frequency of every word type.

In [2]:
def frequencies(word_counter):
    """print relative frequency for each word type in counter"""
    total = sum(Counter.values(word_counter))
    # calculate frequencies for all the words in the counter
    for current_word in word_counter:
        number_of_tokens = word_counter[current_word]
        frequency = number_of_tokens / total
        print(frequency)

**Exercise.**
Modify the `frequencies` function so that instead of printing each word's frequency, it returns a list of all the frequencies.
Then use the function `sum` to verify that the the sum of all word frequencies for *Hamlet* is 1.
After all, all the words added together should make up 100% of the text, no more, no less.
But odds are that instead you'll get a number that's very close to 1, but not exactly 1.
Ask your TA what's up with that.

In [4]:
# put your modified version of frequencies here

0.9999999999999578

### Adding frequencies to the counter

The `frequency` function is still somewhat unsatisfying in that it prints the frequency of each word type.
Printing to screen isn't very useful most of the time, in particular with tens of thousands of words.
It would be better if we could simply replace the absolute values in the counter by frequencies.
This is actually fairly easy.
The `[x]` notation is not only useful for retrieving the value of an element, it also allows us to **specify** the value of an element.

In [5]:
# define our test counter
test = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
print(test)

# let's change the value of "a";
# here's what it is right now
print("a's current value:", test["a"])
# and now we'll change it to 0.1
test["a"] = 0.1
print("a's new value:", test["a"])

# and now we add a new element "d" to the counter
test["d"] = 10
print("d was added with count:", test["d"])
print("The new counter is:", test)

Counter({'a': 6, 'b': 4, 'c': 2})
a's current value: 6
a's new value: 0.1
d was added with count: 10
The new counter is: Counter({'d': 10, 'b': 4, 'c': 2, 'a': 0.1})


**Exercise.**
Now we can finalize the `frequencies` function.
Instead of printing the frequency of `current_word`, the function should override the value of `current_word` in `word_counter` with `frequency`.
At the end, the function returns `word_counter`.
You can test your code in the second cell.
The results should be `0.t` for `a`, `0.3333` for `b`, and `0.1666` for `c`.

In [6]:
# change and complete the code below

def frequencies(word_counter):
    # add an updated docstring here
    total = sum(Counter.values(word_counter))
    for current_word in word_counter:
        number_of_tokens = word_counter[current_word]
        frequency = number_of_tokens / total
        print(frequency)  # this part needs to change

In [7]:
# test your code here
test_counts = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
print(test_counts)

test_frequency = frequencies(test_counts)
print(test_frequency)

Counter({'a': 6, 'b': 4, 'c': 2})
0.3333333333333333
0.16666666666666666
0.5
None


### An unintended side-effect

Let us run the test one more time, with just a minor change in the order of the `print`-statements.
Now we first compute `test_frequency` and the print `test_counts` and `test_frequency`.

In [8]:
# test your code here
test_counts = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
test_frequency = frequencies(test_counts)

print(test_counts)
print(test_frequency)

0.3333333333333333
0.16666666666666666
0.5
Counter({'a': 6, 'b': 4, 'c': 2})
None


Uhm, what's going on here?
Why do `test_counts` and `test_frequency` look the same?
Where did the absolute word counts go?

The problem is with how we wrote the function `frequency`.
This is a function that takes a word counter as an argument and then **overwrites** the count of each word type with its frequency.
So if we run `frequency` over `test_counts`, all the values of `test_counts` are replaced by frequencies.
That's not really what we want.
Instead, we want to produce a copy of `test_counts` with frequencies while keeping the original version of `test_counts` untouched.
We can create a copy of a counter with the function `Counter.copy`.

In [None]:
# test your code here
test_counts = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
test_frequency = frequencies(Counter.copy(test_counts))

print(test_counts)
print(test_frequency)

Now we now longer run `frequencies` on `test_counts`, but a dynamically created copy of `test_counts`.
Hence the values of `test_counts` remain unaltered, and we get different outputs for `print` at the end.

**Exercise.**
Copy-paste your definition of the `frequencies` function into the cell below, then change it so that it always creates a copy `temp_copy` of `word_counter` at the beginning and then carries out all operations over `temp_copy` instead of `word_counter`.
Then run the code in the next cell to verify that your new definition of `frequencies` works correctly.

In [None]:
# copy-paste your code for frequencies here, then modify it as described

In [None]:
# test your code here
test_counts = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
test_frequency = frequencies(test_counts)

print(test_counts)
print(test_frequency)

## Calculating average word length

Let us turn to average word length next.
Rather than explain at length how it works, I already created a working solution for you.
Your job is just to figure out what it does.
I will tell you that much, though: `*` is multiplication.

In [9]:
def avg_word_length(word_counter):
    """average word length for counter"""
    total_number = sum(Counter.values(word_counter))
    total_length = 0
    for word in word_counter:
        total_length = total_length + (word_counter[word] * len(word))
    return total_length / total_number

**Exercise.**
Carefully read the code above.
Then describe what this function does.
Pay particular attention to the `for`-loop and the use of `len`, and explain why `len(word)` must be multiplied with `word_counter[word]`.

*put your explanation here*

Let us see what happens when we run this function on our three texts.

In [10]:
for text in [hamlet, faustus, mars]:
    print(avg_word_length(text))

4.058128662261564
4.074778200253485
4.345717874600845


Hmm, that's surprisingly close together.
One would expect a novel like *Princess of Mars* to have longer words than a play like *Hamlet* or *Dr. Faustus*, simply because the latter have to fit a specific meter.

But wait a second!
*Zipf's law* tells us that the majority of texts is made up of a few very high-frequency words, called *stop words*.
So if those stop words are very short, then that will drastically lower average word length.
And that is indeed the case, as you can probably tell from eyeballing [this list of stopwords](https://raw.githubusercontent.com/Alir3z4/stop-words/master/english.txt).
We can even use Python to see this point more clearly.

In [11]:
# download list of stop words
import urllib.request
url = "https://raw.githubusercontent.com/Alir3z4/stop-words/master/english.txt"
urllib.request.urlretrieve(url, "stopwords.txt")

# read it in as a string
with open("stopwords.txt", "r", encoding="utf-8") as stopwords_file:
    stopwords = stopwords_file.read()

# since each word is on its own line,
# we can convert the string to a list of words by
# matching everything except newlines (\n)
stopwords = re.findall(r"[^\n]+", stopwords)

# and here is what we have
print(stopwords)

# and this is the average word length
print("\nAverage word length:", avg_word_length(Counter(stopwords)))

['a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', "can't", 'cannot', 'could', "couldn't", 'did', "didn't", 'do', 'does', "doesn't", 'doing', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', "hadn't", 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'her', 'here', "here's", 'hers', 'herself', 'him', 'himself', 'his', 'how', "how's", 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', "isn't", 'it', "it's", 'its', 'itself', "let's", 'me', 'more', 'most', "mustn't", 'my', 'myself', 'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'ought', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'same', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'so', 'some', 'such', 'than', 'that', "that's", 'the', 'their', 'theirs', 'them', 'themselves', 't

As you can see, the average word length for all three texts is very close to the average word length of English stop words.
So there is a good chance that we will get a much more pronounced difference in average word length between the three texts if we ignore stop words.

## Removing stopwords

Our goal at this point is clear: We have to get rid of those pesky stopwords.
As you might recall from the previous unit, this is actually fairly easy with a `for`-loop.
So what are you waiting for, get hacking!

**Exercise.**
Use what you've learned so far to filter out all the elements of `stopwords` before calculating average word length.
Yes, this is deliberately phrased very vaguely.
There's multiple routes you can take, and it is up to you to decide which one is the most appropriate.

In [None]:
# put your solution here

**Exercise.**
You might have noticed a minor problem.
The counters are built from a tokenized list, and our tokenization function treats words like *who's* as two tokens *who* and *s*.
But our list of stop words instead treats *who* and *who's* as stop words, but not *s* because it assumes that *who's* is never split into *who* and *s*.
As a result, your solution will filter out *who* when calculating average word length, but not *s*.

Copy-paste your previous solution into the cell below, then fix it so that *s* is also filtered out correctly.

In [None]:
# copy-paste the code here, then modify it

**Exercise.**
Has the removal of stop words led to a greater difference in word length between the texts?
Do we see a more pronounced difference between plays and novels now?

*put your answer here*

## Wrapping up

This unit hasn't introduced a lot of new concepts, instead you just got to see how the tools we have encountered so far can be used to accomplish a variety of things.
In particular `for`-loops have quickly become indispensable.
They really are one of Python's most important and versatile tools.

**Exercises.**
We can even use `for`-loops to implement a very simple spellchecker.
[This textfile](https://raw.githubusercontent.com/dwyl/english-words/master/words.txt) lists almost all words of English (it has over 500,000 entries).
It is similar to our list of stopwords in that it contains one word per line.

The code below downloads the file for you and reads it in as a string.
Here is your task:

1.  Tokenize the string to get a list of correctly spelled English words.
1.  Write a function `spellcheck` such that
    1. The function takes as its argument a string and a list of words, the dictionary.
    1. The function tokenizes the string (words like *who's* should be tokenized as *who's*, so you'll have to modify our standard tokenization function).
    1. It then checks every token in the string and returns a list of all misspelled words (types, not tokens!).

In [None]:
# download the file
url = "https://raw.githubusercontent.com/dwyl/english-words/master/words.txt"
urllib.request.urlretrieve(url, "words.txt")

with open("words.txt", "r", encoding="utf-8") as dict_file:
    dict_string = dict_file.read()

# Step 1: tokenize dict_string; don't forget that who's should be treated as one word, not two
# dictionary = for you to do

# Step 2: define custom function
# def spellcheck(string, wordlist):
    # some mystery happens here

# let's test the code
examples = ["My neighbour is to tried for work.",
            "I am teh world's gr8test typist",
            "U don't look 2 bed",
            "Their are titilating oportunitys here."]
for example in examples:
    print(example, "contains the following errors:", spellcheck(example, dictionary))

*Hints:*
If you're stuck with the exercise, highlight the text below to read some tips.

<span style="color:#000000;background-color:#000000;">
For the tokenization function, remember that \w+ matches sequences of word characters.
But you can combine \w with the [...] notation to define alternatives.
So [xyz\w]+ matches sequences that consist of word characters and/or instances of x, y, and z.
Use this to broaden the match from word characters to word characters and apostrophes.
</span>

## Bullet point summary

- Nothing new here, just using familiar tools in creative ways.
  Like language, programming is about combining familiar things to create something new.