# Data structures: dictionaries

A dictionary is a collection of elements that can be accessed by their indices.

A dictionary in Python is a mapping between keys and values:
* **Key**: the index of the dictionary, it can be a string, a number, or a tuple. Keys are unique.
* **Value**: the value of a given key, it can be of any type! Values don't need to be unique.

Why is it called dictionary? Similar to how a traditional dictionary is structured, it consists of a collection of entries (i.e. **keys**) with **values** (in a traditional dictionary, that's the definition).

Example from a traditional dictionary:
> _programming (n)_: The art of writing a program.

In python:
* A dictionary is represented in curly brackets.
* A colon (```:```) is used to separate the key from the value.
* Different items in a dictionary are separated by commas (```,```).

For example:

In [None]:
dictionary = {"programming" : "The art of writing a program.",
              "dictionary" : "Collection of words with their definitions."
             }

You can have as many elements as necessary in a dictionary:

In [None]:
shipwrecks_by_year = {
    "Santa Maria": 1492,
    "USS Indianapolis" : 1945,
    "HMS Endeavour": 1778,
    "Endurance": 1915
    }

Values of dictionaries can also be tuples (like here), or lists, or sets...

In [None]:
shipwrecks_by_location = {
    "Santa Maria": (19.00, -71.00),
    "USS Indianapolis" : (12.03, 134.80),
    "HMS Endeavour": (41.60, -71.35),
    "Endurance": (-69.08, -51.50)
    }

... and dictionary values can also be nested dictionaries!

In [None]:
# Here the outer keys are strings, values are dictionaries!
# The keys of the inner dictionary ('Coordinates', 'Year')
# are strings, their values are (1) a tuple of floats for
# key "Coordinates", and (2) an integer for key "Year".
shipwrecks = {
    "Santa Maria": {
        "Coordinates": (19.00, -71.00),
        "Year": 1492
    },
    "USS Indianapolis": {
        "Coordinates": (12.03, 134.80),
        "Year": 1945
    },
    "HMS Endeavour": {
        "Coordinates": (41.60, -71.35),
        "Year": 1778
    },
    "Endurance": {
        "Coordinates": (-69.08, -51.50),
        "Year": 1915
    }
}

## Creating a dictionary

Creating an **empty** dictionary (with no items), two options:

In [None]:
new_dict = {}
new_dict = dict()

Creating a dictionary with some items in it:

In [None]:
new_dict = {
    "item1": 1,
    "item2": 2,
    "item3": 3
    }

✏️ **Exercise:**

In [None]:
# Create a dictionary that maps all places you've lived in with the country they belong to, and print it.
#
# Type your code here:



## Adding and updating items into a dictionary

You add a key-value pair into a dictionary like this, where "London" in this case is the new key you want to add, and "United Kingdom" is its value:

In [None]:
# Instantiate a dictionary of lived places:
dictionary_of_lived_places = dict()
dictionary_of_lived_places = {"Ottawa": "Canada",
                              "Ithaca": "Greece",
                              "Moose Jaw": "Canada"}

# Below, we're adding a new key-value pair in our dicionary_of_lived_places:
dictionary_of_lived_places["London"] = "United Kingdom"

# Print the dictionary:
print(dictionary_of_lived_places)

You can update the value of a given key in the same way. For example, let's suppose for a moment that 'Moose Jaw' is not in Canada, but in Iceland:

In [None]:
dictionary_of_lived_places["Moose Jaw"] = "Iceland"
print(dictionary_of_lived_places)

## Accessing a value from a dictionary

Dictionaries provide a very straightforward way to access the values of a given key.

Suppose we want to get the coordinates of the 'Endurance', given the following dictionary:

In [None]:
shipwrecks_by_location = {
    "Santa Maria": (19.00, -71.00),
    "USS Indianapolis" : (12.03, 134.80),
    "HMS Endeavour": (41.60, -71.35),
    "Endurance": (-69.08, -51.50)
    }

We just need to call the dictionary and the key in square brackets:

In [None]:
print(shipwrecks_by_location["Endurance"])

What happens if you try to access a key that does not exist in the dictionary?

In [None]:
print(shipwrecks_by_location["Titanic"])

To avoid annoying error messages, you can always check beforehand if the key exists:

In [None]:
query = "Titanic"
if query in shipwrecks_by_location:
    print(shipwrecks_by_location[query])
else:
    print("Warning: " + query + " is not in the dictionary.")

Alternatively, you also can use the method ```.get()```: the first element of the method is the key you want to look for in the dictionary, and the second element the message that it is returned if the key has not been found:

In [None]:
query = "Endurance"
print(shipwrecks_by_location.get(query, "Key not in dictionary!"))

In [None]:
query = "Titanic"
print(shipwrecks_by_location.get(query, "Key not in dictionary!"))

## Other methods to retrieve data from a dictionary:

Get all the **keys** in the dictionary:

In [None]:
print(dictionary_of_lived_places.keys())

Get all the **values** in the dictionary (in a list format):

In [None]:
print(dictionary_of_lived_places.values())

Get the **key-value pairs** in the dictionary (in a list of tuples format):

In [None]:
print(dictionary_of_lived_places.items())

## Sorting a dictionary

Dictionaries are unordered and can't be sorted, but we can sort a representation of a dictionary.

Sort **by key** in ascending order:

In [None]:
sorted(dictionary_of_lived_places.items())

Sort **by key** in descending order:

In [None]:
sorted(dictionary_of_lived_places.items(), reverse=True)

Dictionaries are unordered and can't be sorted, but we can sort a representation of a dictionary.

Sort **by value** in ascending order:

In [None]:
sorted(dictionary_of_lived_places.items(),key=lambda x:x[1])

✏️ **Exercise:**

In [None]:
# Can you guess how to sort by value in descending order?
#
# Type your code here:



## Iterate over a dictionary

We can iterate over a dictionary in a similar way to how we interate over a list. By default, iteration is done over the keys of the dictionary. See the example:

In [None]:
dictionary_of_lived_places = {
    "Moose Jaw": "Canada",
    "Saskatoon": "Canada",
    "Ithaca": "Greece"
}

for k in dictionary_of_lived_places:
    print(k)

We can also iterate over the dictionary key-value pairs (`k, v`) with the `.items()` method.

In [None]:
dictionary_of_lived_places = {
    "Moose Jaw": "Canada",
    "Saskatoon": "Canada",
    "Ithaca": "Greece"
}

for k, v in dictionary_of_lived_places.items():
    print(k, v)

Remember that we can access the value of a key like this: ```dictionary[key]```. The cell below achieves teh same as the cell above:

In [None]:
for k in dictionary_of_lived_places:
    print(k, dictionary_of_lived_places[k])

## Use case: counting words

Dictionaries are very helpful to count and keep track. Look at the example below:

In [None]:
text = "This is the second day of the summer school, we're covering data structures in Python."

word_list = text.split(" ")
dict_wordcounts = dict()
for word in word_list:
    if word in dict_wordcounts:
        dict_wordcounts[word] += 1
    else:
        dict_wordcounts[word] = 1

print(dict_wordcounts)

Do you understand what is happening at each line of code?

Now with comments:

In [None]:
text = "This is the second day of the summer school, we're covering data structures in Python."
word_list = text.split(" ") # First of all, split the text by white space.
dict_wordcounts = dict() # We start an empty dictionary, which we will be filling with data as we read it.
for word in word_list: # Read and iterate over the data, in this case a list of words. 
    if word in dict_wordcounts: # Check if the word exists as a key in the dictionary
        dict_wordcounts[word] += 1 # Update the value (i.e. count) of the key (i.e. word) by adding 1 to it.
    else: # If the word does not exist as a key in the dictionary: 
        dict_wordcounts[word] = 1 # Add the key to the dictionary, and give it the value 1.

print(dict_wordcounts)

## Use case: exploring Hamlet

### ✏️ Exercises that combine reading files, basic text processing and data structures

### Upload a file

Go to [Project Gutenberg](https://www.gutenberg.org/) and download _Hamlet_.

Have a look at the document: get familiarised with the data!

Remove the header and the license and copyright considerations at the end of the file. Remove table of contents, and dramatis personae. Save it and rename it `hamlet.txt`.

### Read the file

Ask python to load the whole book. You can load the whole book at once with the `.read()` method:

In [None]:
book = []
with open("../Data/hamlet.txt") as fr:
    book = fr.readlines()

We now have the whole book in a **string** variable, called `book`.

You can read the book line per line with the `.readlines()` method:

In [None]:
lines = []
with open("../Data/hamlet.txt") as fr:
    lines = fr.readlines()

We now have the whole book in a variable, called `lines`, that is a list of strings, each string a line of the book.

Let's first iterate over the first 20 lines of the _Hamlet_ play, and print each line:

In [None]:
for line in lines[:50]:
    print(line)

### Exercise 1: How many times does each character talk?

In [None]:
characters = ['LORD', 'FORTINBRAS', 'ALL', 'FIRST SAILOR', 'SERVANT', 'BOTH',
              'PRIEST', 'PROLOGUE', 'CAPTAIN', 'HORATIO', 'HAMLET', 'DANES',
              'FIRST AMBASSADOR', 'SECOND CLOWN', 'FIRST PLAYER', 'GENTLEMAN',
              'FRANCISCO', 'PLAYER KING', 'PLAYER QUEEN', 'MARCELLUS', 'KING',
              'BARNARDO', 'VOLTEMAND', 'OSRIC', 'FIRST CLOWN', 'GUILDENSTERN',
              'REYNALDO', 'LAERTES', 'OPHELIA', 'GHOST', 'MESSENGER', 'LUCIANUS',
              'QUEEN', 'ROSENCRANTZ', 'POLONIUS']

In [None]:
# Iterate through the lines of the file. Count how many times a character has a line, and sort the
# list of characters by how many times they speak in the play. You can use the list of characters
# that is provided above.
# 
# There are many ways to do that, some more efficient and some less. We don't care about that now!
#
# **Extra points** if you get the list of characters directly from the file, instead of using the
# one provided above! (tip: use Regular Expressions).
#
# Type your code here:



**👀 Tips:**

Don't read this unless you want some tips on how to address this problem!

Think how you would solve this problem without using python.

One possible solution can be achieved in this way:

**Main exercise: counting character lines and sorting them:**
* We can use a dictionary to keep counts of how many times a character speaks. To do so:
  * Instantiate an empty dictionary where we will store counts of character lines.
  * Assuming we already know the list of characters (provided above), we can iterate over the list of characters.
  * Then, for each character, we iterate over the lines in the file.
  * If the line starts with the name of the character, there are two possibilities:
  * 1: if the character is not in the dictionary, we will add it to the dictionary, with value 1 (because we've seen it once)
  * 2: if the character is already in the dictionary, we will not add it to the dictionary (because it's already there!) but we will increase the value by 1.
* After having iterated over all characters, we already have a dictionary of counts per character. We can now sort this dictionary (using the `sorted()` function), and print the contents.

**Extra exercise: finding the characters in the play:**
* Instantiate an empty list where we will keep the list of characters.
* You can iterate over the list of lines in the Hamlet txt file.
* Notice the pattern: character names happen at the beginning of the line, in capital letters, followed by a dot. Create a regular expression that captures the character name. Therefore, if a line starts this way, add the captured character name to the list of characters. Watch out: the word "SCENE" has the same pattern, you may want to write a condition that ignores "SCENE" and does not add it to the list of characters.

**👀 Suggested solution:**

In [None]:
# ---------------------------------------
# Main exercise: get character counts:
dict_character_counts = dict()
for character in characters:
    for line in lines:
        if line.startswith(character + "."):
            if character in dict_character_counts:
                dict_character_counts[character] += 1
            else:
                dict_character_counts[character] = 1

sorted_characters = sorted(dict_character_counts.items(),key=lambda x:x[1], reverse=True)
for ch in sorted_characters:
    print(ch)

In [None]:
# ---------------------------------------
# Extra points: find the characters in the play:
import re
characters = []
for line in lines:
    if re.match(r'^[A-Z ]+\..*', line):
        match = re.match(r'^([A-Z ]+)\..*', line).group(1)
        if not 'SCENE' in match:
            characters.append(match)
characters = list(set(characters))
print(characters)

### Exercise 2: Remove superfluous line breaks

In [None]:
# You will have noticed that there are line breaks in the text file that do not correspond with the
# text logical line breaks (i.e. paragraphs). Paragraphs are marked with two or more line breaks. Could
# you try and have one paragraph per line? This means removing superfluous line breaks. The output should
# be a list of strings, where each string corresponds to a paragraph.
#
# Type your code here:



#### 👀 Suggested solution

In [None]:
book_paragraphs = []
with open("../Data/hamlet.txt") as fr:
    book = fr.readlines()
    paragraph = ""
    for line in book:
        line = line.strip()
        if line != "":
            paragraph += line + " "
        else:
            book_paragraphs.append(paragraph)
            paragraph = ""

for paragraph in book_paragraphs[:100]:
    print(paragraph)