# Diving deeper into Python data structures

As discussed in the last lecture, prefix trees are a very useful data structure.
They provide:

- compact storage
- quick search
- easy addition of new elements

But they aren't as easy to use as simpler data structures.
In particular, Python does not have a built-in implementation of prefix trees.
So we'll have to role our own.
This will be a topic for another notebook, though.
First, let's take another gander at what data structures we already have: lists, sets, and Counters.
As we will see next, sets and Counters are actually specialized instances of another data type, called **dictionaries**.

## Reminder: Lists, sets, counters

By now you have seen your fair share of lists, sets, and counters.
**Lists** are the default data structure:

- They are linearly ordered.
- They can contain duplicates.
- Appending items at the end is cheap.
- Search is slow.
- They lack "semantics": items are only identified by their position in the list, which might change.

The last two points can be big detriments depending on the application.
Let's first consider the speed of search.
Run the cells below to see how much faster it can be to store a word list in a set compared to a dictionary.
Each cell uses the Jupyter notebook command `%%time` to keep track of how long the computation takes.

In [None]:
%%time
from nltk.corpus import brown, words

corpus = brown.words()
wordlist = list(words.words())  # we explicitly tell Python to store this as a list


def spellcheck(text, wordlist):
    """Compute list of misspelled words and their positions"""
    return [[text[n], n] for n in range(len(text))
            if text[n] not in wordlist]


spellcheck(corpus, wordlist)

In [None]:
%%time
from nltk.corpus import brown, words

corpus = brown.words()
wordlist = set(words.words())  # and now we store it as a set


def spellcheck(text, wordlist):
    """Compute list of misspelled words and their positions"""
    return [[text[n], n] for n in range(len(text))
            if text[n] not in wordlist]


spellcheck(corpus, wordlist)

Depending on your machine, the code with sets might have taken 10 seconds to a minute to run.
In that time, it checked over a million words.
The list code, on the other hand, might have taken so long that you just decided to restart the kernel.
So yes, **sets** are much more efficient than lists:

- almost instantaneous search
- no linear order
- no duplicates

But why are sets so much faster than lists?
In order to understand this, we have to look at a more general data type, **dictionaries**.
Dictionaries will look instantly familiar to you, as you already their two most common subtypes: sets, and Counters.

## Python dictionaries (i.e. hashmaps)

Just like a counter, a dictionary is a collection of keys and values.
In all the counters we have used so far, the keys were words and the values were numbers.
Dictionaries work pretty much the same, but allow a wider range of values.
For instance, we could have a dictionary that uses phone numbers as keys and the owner's name as the value.

In [None]:
phone_matcher = {"555-732": "Bruce", "555-238": "Diana", "555-3927": "Clark"}

As with counters, we can use the keys to get the values.
While the bracket notation is available, it is preferable to use the `.get` method (the reason for that will be explained later).

In [None]:
phone_matcher["555-732"]

In [None]:
phone_matcher.get("555-732")

But maybe we want more structured information, such as both first name and last name.
Of course we can just add that to the strings.

In [None]:
phone_matcher = {"555-732": "Bruce Wayne", "555-238": "Diana", "555-3927": "Clark Kent"}
phone_matcher.get("555-732")

But then it gets tricky to extract just first names or just last names.
And what should we do if an entry lacks a last name, like Diana above.
Is that just a mistake in the records?
In the case at hand, it isn't --- while Wonder Woman once used the secret identity of Diana Prince, Prince is not her actual last name.
Her name is just Diana.
To handle cases like this more gracefully, we can make our values more complex by using lists.

In [None]:
phone_matcher = {"555-732": ["Bruce", "Wayne"], "555-238": ["Diana", ""], "555-3927": ["Clark", "Kent"]}
phone_matcher.get("555-732")[1]

But that's still tedious because we have to remember whether the first component is the first name or the last name.
So instead, we can use dictionaries instead of lists.
Yes, we can use dictionaries inside dictionaries!

In [None]:
phone_matcher = {"555-732": {"first name": "Bruce",
                             "last name": "Wayne"},
                 "555-238": {"first name": "Diana",
                             "last name": ""},
                 "555-3927": {"first name": "Clark",
                              "last name": "Kent"}}
phone_matcher.get("555-732").get("last name")

So dictionaries are similar to lists in the sense that they can contain items that themselves contain other items.
They can also contain duplicate items, as long as the keys are distinct.

In [None]:
phone_matcher = {"555-732": {"first name": "Bruce",
                             "last name": "Wayne"},
                 "555-739": {"first name": "Bruce",
                             "last name": "Wayne"},
                 "555-238": {"first name": "Diana",
                             "last name": ""},
                 "555-3927": {"first name": "Clark",
                              "last name": "Kent"}}
print(phone_matcher.get("555-732").get("last name"))
print(phone_matcher.get("555-739").get("last name"))

But in most other respects, dictionaries are much closer to sets and counters.

## Comparing dictionaries and sets

Like sets, dictionaries aren't ordered, and search is very fast:

In [None]:
%%time
from nltk.corpus import brown, words

corpus = brown.words()
# we use a dictionary comprehension to store it as a dictionary
wordlist = {word: word for word in words.words()}

def spellcheck(text, wordlist):
    """Compute list of misspelled words and their positions"""
    return [[text[n], n] for n in range(len(text))
            if wordlist.get(text[n])]


spellcheck(corpus, wordlist)

The code above takes approximately as much time as the one using a set for `wordlist`.
That's because a set is essentially a dictionary where each value is also its own key.

In [None]:
# a set
some_set = {"a", "b", "c"}
# the corresponding dictionary
some_dict = {"a": "a", "b": "b", "c": "c"}

Don't take this metaphor too literally, sets aren't just a shorthand for defining dictionaries.
Any identity tests will fail because sets and dictionaries are still data structures with different types.
And they provide different methods (for instance, `.get` doesn't work with sets).

In [None]:
# a set
some_set = {"a", "b", "c"}
# the corresponding dictionary
some_dict = {"a": "a", "b": "b", "c": "c"}

print("Sets and dictionaries are the same?", some_set == some_dict)
print("Type of set:", type(some_set))
print("Type of dictionary:", type(some_dict))

In [None]:
print(some_dict.get("a"))
print(some_set.get("b"))

So why would one ever use a set instead of a dictionary?
For limited cases, sets are easier to use.
They can be defined directly from a list, and they behave like a list for membership tests (`in`, `not in`).
They are also slightly more efficient than dictionaries.
Whenever you need more speed than what a list provides, both sets and dictionaries are plausible candidates.
But if you have no need for keeping keys and values distinct, or for retrieving a specific element, then a dictionary offers no advantages over a set and is a bit clunkier to work with.

## Comparing dictionaries and counters

Dictionaries and counters are very similar.
That's because the latter is a dictionary with additional tweaks for the specialized purpose of counting.

- Both dictionaries and counters use the key, value format.
- Both dictionaries and counters allow the use of `[key]` and `.get(key)` for retrieving values.
- Only counters provide the `.most_common` method.
- Only counters can be defined from a list of single values.

Let us look at the last point in detail:

In [None]:
from collections import Counter
some_list = ["a", "b", "b", "c", "c", "c"]
# construct counter: succeeds
counts = Counter(some_list)
print(counts)

In [None]:
some_list = ["a", "b", "b", "c", "c", "c"]
# construct dictionary: fails
counts = dict(some_list)
print(counts)

The error message in the code cell above might seem a bit cryptic to you.
What it is saying is that there actually is a way to define a dictionary from a list, but it doesn't work for `some_list` above.
Instead, we need to provide a list of key-value pairs.

In [None]:
some_list = [["a", 1], ["b", 2], ["c", 3]]
# construct dictionary: fails
counts = dict(some_list)
print(counts)

When printed, this dictionary will look exactly like the counter.
But the two are still distinct data types.
For one thing, only the counter provides the `most_common` method.

In [None]:
from collections import Counter

value_list = ["a", "b", "b", "c", "c", "c"]
keyvalue_list = [["a", 1], ["b", 2], ["c", 3]]

# construct counter and dictionary
counts_counter = Counter(value_list)
counts_dict = dict(keyvalue_list)

print(counts_counter.most_common(1))
print(counts_dict.most_common(1))

They also have different types: dictionaries are of type `dict`, whereas counters are of type `collections.Counter`.

In [None]:
print(type(counts_counter))
print(type(counts_dict))

Finally, counters behave more gracefully when you're trying to get the value for a key that does not exist.

In [None]:
from collections import Counter

value_list = ["a", "b", "b", "c", "c", "c"]
keyvalue_list = [["a", 1], ["b", 2], ["c", 3]]

# construct counter and dictionary
counts_counter = Counter(value_list)
counts_dict = dict(keyvalue_list)

print("Counter's value for e", counts_counter["e"])
print("Dictionary's value for e", counts_dict["e"])

As you can see, the counter returns a count of 0, whereas the dictionary aborts with a key error.
This is why you should use the `.get`-method with dictionaries, which returns a default value of `None` instead of crashing the program.
With counters, on the other hand, `.get` is less useful because they already return a reasonable default value.

In [None]:
from collections import Counter

value_list = ["a", "b", "b", "c", "c", "c"]
keyvalue_list = [["a", 1], ["b", 2], ["c", 3]]

# construct counter and dictionary
counts_counter = Counter(value_list)
counts_dict = dict(keyvalue_list)

print("Counter's value for e", counts_counter.get("e"))
print("Dictionary's value for e", counts_dict.get("e"))

In sum, then, counters are a better choice than dictionaries for their specialized purpose of calculating counts from a list of items.
Dictionaries are more versatile, but this also means that they cannot be optimized for the specific needs of each application.

## Interim summary: Overview of data types

With all the different options, it's easy to lose track of their capabilities.
The table below gives a quick overview of the most important distinctions.

|                   | **Integer** | **String** | **List** | **Set** | **Counter**       | **Dictionary**    |
| --:               | :-:         | :-:        | :-:      | :-:     | :-:               | :-:               |
| Container?        | N           | only char  | Y        | Y       | Y                 | Y                 |
| Iterable?         | N           | Y          | Y        | Y       | Y (default: keys) | Y (default: keys) |
| Duplicate values? | N           | Y          | Y        | N       | Y                 | Y                 |
| Order?            | N           | Y          | Y        | N       | N                 | N                 |
| Index?            | N           | Y          | Y        | N       | N                 | N                 |
| Fast search?      | N           | N          | N        | Y       | Y                 | Y                 |
| `[key]`?          | N           | N          | N        | N       | Y (safe)          | Y (not safe)      |
| `.get`?           | N           | N          | N        | N       | Y (disprefered)   | Y (prefered)      |

Here's what each property means:

- **Container**: Can it contain multiple items?
- **Iterable**: Can it be iterated over with a `for`-loop? For dictionaries and counters, `for`-loops by default iterate over keys, not values (with sets, the distinction does not matter). Note that all containers are iterable, and the other way round.
- **Duplicate values**: Can a value appear multiple times? Dictionaries, sets, and counters allow duplicate values, but not duplicate keys.
- **Order**: Does the data type have an intrinsic order?
- **Index**: Can items be retrieved by their index? This presupposes having an intrinsic order, but there are data types that have order but not indices (see the expansion unit on some less common data structures).
- **Fast search**: Can items be retrieved quickly even if the data structure is very large? For small data structures (less than a couple thousand entries), this usually does not matter in practice.
- **`[key]`**: Can items be retrieved using the `[key]`-notation? This is not recommended for dictionaries because it gives an error if the key does not exist.
- **`.get`**: Can items be retrieved using the `.get`-method? This is possible for Counters, but it is rarely needed because they handle non-existing keys more gracefully.

## Explaining the speed of dictionaries

### Hash maps

You should now have a better idea of what all the different data structures are and how they differ from each other.
But our initial question still remains open: why are sets and dictionaries so much faster than lists when it comes to look-up?
The answer lies in the key-value architecture.
Remember, a set is essentially a dictionary where each value is its own key.
Therefore, sets also follow the key-value architecture even though it is not readily apparent.
The key-value architecture allows us to get around major shortcomings of the index-architecture used by lists.

In a list, there is no easy way of telling the position of an item.
If we want to know if `some_list` contains `foo`, there is no obvious starting point.
Just about any index could contain `foo`.
We have to go through the list from left to right until we find `foo` (if the list is ordered, there are some shortcuts, but the basic problem remains).
This necessarily means looking at many items that are not `foo`, wasting time.
The central problem is that there is no direct connection between an item and its position in the list, and thus the only safe option is an exhaustive search.

The key-value architecture works very differently.
It can be used to define a **hash map**.
Without going too much into the math, the hash map provides a mechanism for converting keys to indices.
In a loose sense, a dictionary is a list with a function that converts the key to the index of the item in the list.
Returning to our previous example, suppose that our item `foo` were accessible under the key `bar`.
The dictionary basically takes `bar`, runs it through the hash map to get back some value, say, 17.
It then immediately retrieves `some_list[17]`, giving us `foo`.
The whole process doesn't really involve any search at all.
We never search through `some_list`.
Instead, we immediately get the correct index from the hash map and immediately retrieve the desired item from the list.
Metaphorically speaking, the hash map is a bit like an address book where we can use an item's key to look up its index.
But as all metaphors, this is not quite right because an address book is itself a list, whereas a hash map is more like a system of rules for mechanically transforming a key into an index.
Be that as it may, the bottom line is easy: search in dictionaries (and sets and Counters) is fast because it does not involve any actual searching.

### A limitation of hash maps

Hash maps define a way of converting a key to the address where the corresponding value can be found.
But since this is a mechanical process that operates solely on the shape of the key, any change to the key would also change the address that is returned by the hash map.

Here is a concrete example (again ignoring all the math that's going on under the hood).
Suppose we have a dictionary `ex_dict` with a fixed hash map *h*.
We want to add the value `"I'm here"` to `ex_dict`, and we use the key `bad_choice`.
Our key `bad_choice` is a list of the form `["a"]`.
Based on this key, *h* computes a specific address for our value `"I'm here"`, and Python stores the value at this address in our dictionary `ex_dict`.
Let's say this address is 17.
Now, whenever we want to get the value for the key `bad_choice`, Python runs the key through the hash map *h*, determines that the output is 17, and returns the value at address 17 in `ex_dict`.
So far, so good.

But at a later point, we append `"b"` to `bad_choice`, so that it now is `["a", "b"]` instead of just `["a"]`.
The next time we try to get the value for the key `bad_choice`, the hash map *h* returns a different value because the key has changed.
Instead of 17, we now get, say, 23.
This is because the hash map is a mechnical translation from key shapes to addresses.
If the shape of the key changes, so does the address.
But now we have the wrong address, and all of a sudden we no longer get our desired value `"I'm here"`.
Instead, we get whatever is stored at address 23 (which might be nothing at all).
Changing the key has broken the whole dictionary.

As this example shows, hash maps only work if keys cannot change.
This is why Python requires that the keys of a dictionary must be **hashable**.
The next notebook will explain in detail what this means.
For now, the only hashable data types we know are numbers and strings.
So all dictionary keys must be numbers or strings.
**Keys cannot be lists, sets, counters, or dictionaries** (however, values are not restricted in this way).

In [None]:
# a working dictionary
{"trigram as string": .5}

In [None]:
# a broken dictionary
{["trigram", "as", "list"]: .5}

In [None]:
# since values are their own keys in sets,
# sets cannot contain unhashable data types
{["a", "bigram"], ["another", "bigram"]}

In [None]:
# since sets are not hashable,
# sets cannot contain sets either
{{"no", "sets"}, {"inside", "sets", "please"}}

## Bullet-point summary

- Counters and sets are a subtype of **dictionaries**.
- Dictionaries use hash maps to translate a key into the address of its item.
  This makes search very fast.
- Keys must be hashable.

|                   | **Integer** | **String** | **List** | **Set** | **Counter**       | **Dictionary**    |
| --:               | :-:         | :-:        | :-:      | :-:     | :-:               | :-:               |
| Container?        | N           | only char  | Y        | Y       | Y                 | Y                 |
| Iterable?         | N           | Y          | Y        | Y       | Y (default: keys) | Y (default: keys) |
| Duplicate values? | N           | Y          | Y        | N       | Y                 | Y                 |
| Order?            | N           | Y          | Y        | N       | N                 | N                 |
| Index?            | N           | Y          | Y        | N       | N                 | N                 |
| Fast search?      | N           | N          | N        | Y       | Y                 | Y                 |
| `[key]`?          | N           | N          | N        | N       | Y (safe)          | Y (not safe)      |
| `.get`?           | N           | N          | N        | N       | Y (disprefered)   | Y (prefered)      |
| Hashable?         | Y           | Y          | N        | N       | N                 | N                 |