<div style="background-color:lightgrey;
            padding:10px;
            color:black;
            border:black dashed 2px; 
            border-radius:5px;
            margin: 20px 0;">
            
            
# Frequency distributions in Python



**Staff:** Walter Daelemans <br/>
**Support Material:** [exercises](https://github.com/dtaantwerp/dtaantwerp.github.io/blob/DTA_Bootcamp_2021_students/exercises/Questions_2023/06_EX_freq_distributions.ipynb) <br/>
**Support Sessions:**  Tuesday, October 3, 10.30AM

</div>

## An evergreen

Let's start this class with a recurrent exercise in every Python 101 course: constructing a character-level frequency distribution of a string, using a `dict`. Let's start from the first lines of Virgil's *Aeneid*:

In [None]:
aeneid = """Arma virumque cano, Troiae qui primus ab oris
Italiam fato profugus Lavinaque venit
litora—multum ille et terris iactatus et alto
vi superum, saevae memorem Iunonis ob iram,
5 multa quoque et bello passus, dum conderet urbem
inferretque deos Latio; genus unde Latinum
Albanique patres atque altae moenia Romae.
Musa, mihi causas memora, quo numine laeso
quidve dolens regina deum tot volvere casus
10 insignem pietate virum, tot adire labores
impulerit. tantaene animis caelestibus irae?"""

aeneid

Let us start with the simplest approach and explore some variations:

In [None]:
d = {}

for char in aeneid:
    if char in d:
        d[char] += 1
    else:
        d[char] = 1

print(d)

Implement the following simple extensions, step by step:
- make sure all characters are lowercased
- ignore all punctuation symbols
- ignore all whitespace characters
- ignore digits (cf. line numbering)
- print the three characters with the highest frequency

In [None]:
# code for variations here



This exercise is important because, if you understand the components well, you can introduce minor adaptations to the code that will help you solve a wide range of related problems (like filtering). We will use this standard of block of code today to introduce two additional concepts in Python: (1) Exceptions and (2) Objects.

## Exceptions

The `if`/`else` statement in our exercise is necessary, because we cannot increment the frequency for a particular character, if we didn't **initialize** that count in the first place:

In [None]:
d = {}

for char in aeneid:
    d[char] += 1


Already in the very first iteration, something goes wrong, because `d` is empty and yet we try to **index** it for the non-existing key `'A'`. We run into a **KeyError** and the execution of our program gets **interrupted** -- that's technical term that you should remember. 

Another way this is commonly described, is that an **"exception gets raised"**. This means that the normal flow of the program gets disturbed because an "exceptional" situation is encountered. The problem isn't so much that you run into an error -- trust me, it's not like your computer will explode in such cases -- but more, that Python isn't explicitly instructed what to do in such situations. This is exactly why the program gets halted: **Python doesn't know what to do next**.

### `try` and `except`

Luckily, we can work around this by using a control structure with two new **key words** : `try` and `except`. They function similarly to `if` and `else`, but they test for errors rather than logical problems.

If we anticipate that certain errors might come about, we can add explicit instructions to our scripts as to what needs to be done in these exceptional circumstances. That might sound a little abstract, so below goes a straightforward example of this:

In [None]:
d = {}

for char in aeneid:
    try:
        d[char] += 1
    except:
        d[char] = 1

print(d)

In terms of syntax, there is nothing new under the sun because the `try` and `except` construction is very similar to the `if`/`else` that you already know. What does it do? In a way, Python tells you already: you instruct Python to `try` and execute the first indented block, `except` if something unexpected would happen, and in that case, the second block should be executed.

This block is perfect for our problem: by default, our script will assume that a certain character is already present in `d` and it will `try` to augment its count. An exception will raised, however, when our script hits the `KeyError` which we anticipate, and in that case the second block will be executed (i.e. the instantiation of a new value:key pair). This behaviour is perfectly equivalent to our earlier solution with `if/else`.

> *Question: in many cases, especially for longer `for`-loops, the use of an `try/except` construction will be faster than using the more straightforward `if/else`* construction. Can you guess why?

The previous code block will "catch" any exception/error. That seem easy, but typically that's considered bad practice. We want to be more specific and only catch the errors that we truly anticipate. If not, we might be ignoring really bad exceptions. Therefore, we want to be more explicit and name the actual exception, like so:

In [None]:
d = {}

for char in aeneid:
    try:
        d[char] += 1
    except KeyError: # more specific error catching!
        d[char] = 1

print(d)

Here, we limit the "emergency" solutions to a more limited set of scenarios, which is safer.

> *Question: Exceptions come in many forms. Can you think of other exceptions that we've already encountered?*

An **IndexError** is another commonly encountered error, which might be very inconvenients:

In [None]:
cnt = 0
while True:
    print(aeneid[cnt], end='')
    cnt += 1

> *Question: change the code block above and jump out of the `while` loop once we have printed all characters.*

Reading files is another context in which exceptions are often used. Often, if you work with large corpora, a small number of files will have encoding errors and you might want to ignore these. Catching the relevant exception (the `UnicodeEncodeError`) will save you a lot of troubles. Do you understand this dummy example?

```python
for filename in filenames:
    try:
        with open(filename) as f:
            text = f.read()
        print(filename, 'correctly parsed!')
    except UnicodeEncodeError:
        pass # do you know what this does?
```

As always, remember that errors are your friend in Python and you should always pay close attention to what they are saying. They might be annoying, but you can put them to use, using `try/except`. Remember that exeptions protect you from something even worse.

## The Secret Counter

We've gone through the frequency dictionary exercise a couple of times by now. You might hate us for saying this, but there's a little secret that we haven't told you about before... Don't hate us, but we are about to tell about one of the best kept secrets in the Python universe:

In [None]:
from collections import Counter

cnt = Counter()
cnt.update(aeneid)

#cnt = Counter(aeneid) would also work

print(cnt)

Wow, that's amazing! This gives us everything we need, in just two lines of code. We introduce it only now, because of two reasons, it's crucial to understand how low-level Python dictionaries work.

Note that we explicitly have to import `Counter` from the `collections` module in Python's Standard Library (which has many other really useful functions). After that, we can **instantiate** or **initialize** a Counter through what is know as the **constructor function**.

In [None]:
counter = Counter() # constructor
print(type(counter))

Our variable `counter` has a type that indicates that it's not of one of the primitive data types that we covered so far but a more specific kind of object. Using `help()`, we can find out about what it has to offer.

In [None]:
help(Counter)

Interestingly, the documentation tells us that `Counter` behaves like a standard `dict`, but one that is specialized in counting. Because it already expects the values to be integers (that's what counts are!), it will assume a default value of 0, whenever we try to access an element:

In [None]:
print(counter['a'])
print(counter['b'])
print(counter['c'])

NOT. A. SINGLE. ERROR. GETS. THROWN. How cool is that? This explains why we *can* do the following:

In [None]:
for char in aeneid:
    counter[char] += 1

print(counter)

Amazing... And that is not all: remember how cumbersome it is in Python to sort a dictionary by its values. This is especially cumbersome for linguists, who often want frequency lists, showing the `.most_common()` items first. This is really easy with `Counter`:

In [None]:
print(counter.most_common(3))

> *Questions*
> - Can you analyze what is being returned by this function?
> - What happens when you change the **argument value** `"3"`?
> - What happens when you don't specify an argument at all, when calling the function?

The `Counter` object is really useful.