## Use case: exploring Hamlet

### ✏️ Exercises that combine reading files, basic text processing and data structures

### Upload a file

Go to [Project Gutenberg](https://www.gutenberg.org/) and download _Hamlet_.

Have a look at the document: get familiarised with the data!

Remove the header and the license and copyright considerations at the end of the file. Remove table of contents, and dramatis personae. Save it and rename it `hamlet.txt`.

### Read the file

Ask python to load the whole book. You can load the whole book at once with the `.read()` method:

In [None]:
book = []
with open("data/hamlet.txt") as fr:
    book = fr.readlines()

We now have the whole book in a **string** variable, called `book`.

You can read the book line per line with the `.readlines()` method:

In [None]:
lines = []
with open("data/hamlet.txt") as fr:
    lines = fr.readlines()

We now have the whole book in a variable, called `lines`, that is a list of strings, each string a line of the book.

Let's first iterate over the first 20 lines of the _Hamlet_ play, and print each line:

In [None]:
for line in lines[:50]:
    print(line)

### Exercise 1: How many times does each character talk?

In [None]:
characters = ['LORD', 'FORTINBRAS', 'ALL', 'FIRST SAILOR', 'SERVANT', 'BOTH',
              'PRIEST', 'PROLOGUE', 'CAPTAIN', 'HORATIO', 'HAMLET', 'DANES',
              'FIRST AMBASSADOR', 'SECOND CLOWN', 'FIRST PLAYER', 'GENTLEMAN',
              'FRANCISCO', 'PLAYER KING', 'PLAYER QUEEN', 'MARCELLUS', 'KING',
              'BARNARDO', 'VOLTEMAND', 'OSRIC', 'FIRST CLOWN', 'GUILDENSTERN',
              'REYNALDO', 'LAERTES', 'OPHELIA', 'GHOST', 'MESSENGER', 'LUCIANUS',
              'QUEEN', 'ROSENCRANTZ', 'POLONIUS']

In [None]:
# Iterate through the lines of the file. Count how many times a character has a line, and sort the
# list of characters by how many times they speak in the play. You can use the list of characters
# that is provided above.
# 
# There are many ways to do that, some more efficient and some less. We don't care about that now!
#
# **Extra points** if you get the list of characters directly from the file, instead of using the
# one provided above! (tip: use Regular Expressions).
#
# Type your code here:



**👀 Tips:**

Don't read this unless you want some tips on how to address this problem!

Think how you would solve this problem without using python.

One possible solution can be achieved in this way:

**Main exercise: counting character lines and sorting them:**
* We can use a dictionary to keep counts of how many times a character speaks. To do so:
  * Instantiate an empty dictionary where we will store counts of character lines.
  * Assuming we already know the list of characters (provided above), we can iterate over the list of characters.
  * Then, for each character, we iterate over the lines in the file.
  * If the line starts with the name of the character, there are two possibilities:
  * 1: if the character is not in the dictionary, we will add it to the dictionary, with value 1 (because we've seen it once)
  * 2: if the character is already in the dictionary, we will not add it to the dictionary (because it's already there!) but we will increase the value by 1.
* After having iterated over all characters, we already have a dictionary of counts per character. We can now sort this dictionary (using the `sorted()` function), and print the contents.

**Extra exercise: finding the characters in the play:**
* Instantiate an empty list where we will keep the list of characters.
* You can iterate over the list of lines in the Hamlet txt file.
* Notice the pattern: character names happen at the beginning of the line, in capital letters, followed by a dot. Create a regular expression that captures the character name. Therefore, if a line starts this way, add the captured character name to the list of characters. Watch out: the word "SCENE" has the same pattern, you may want to write a condition that ignores "SCENE" and does not add it to the list of characters.

**👀 Suggested solution:**

In [None]:
# ---------------------------------------
# Main exercise: get character counts:
dict_character_counts = dict()
for character in characters:
    for line in lines:
        if line.startswith(character + "."):
            if character in dict_character_counts:
                dict_character_counts[character] += 1
            else:
                dict_character_counts[character] = 1

sorted_characters = sorted(dict_character_counts.items(),key=lambda x:x[1], reverse=True)
for ch in sorted_characters:
    print(ch)

In [None]:
# ---------------------------------------
# Extra points: find the characters in the play:
import re
characters = []
for line in lines:
    if re.match(r'^[A-Z ]+\..*', line):
        match = re.match(r'^([A-Z ]+)\..*', line).group(1)
        if not 'SCENE' in match:
            characters.append(match)
characters = list(set(characters))
print(characters)

### Exercise 2: Remove superfluous line breaks

In [None]:
# You will have noticed that there are line breaks in the text file that do not correspond with the
# text logical line breaks (i.e. paragraphs). Paragraphs are marked with two or more line breaks. Could
# you try and have one paragraph per line? This means removing superfluous line breaks. The output should
# be a list of strings, where each string corresponds to a paragraph.
#
# Type your code here:



#### 👀 Suggested solution

In [None]:
book_paragraphs = []
with open("data/hamlet.txt") as fr:
    book = fr.readlines()
    paragraph = ""
    for line in book:
        line = line.strip()
        if line != "":
            paragraph += line + " "
        else:
            book_paragraphs.append(paragraph)
            paragraph = ""

for paragraph in book_paragraphs[:100]:
    print(paragraph)