# Less common data structures

There's a myriad of data structures, each one of them with its unique advantages and drawbacks.
As a beginning programmer, you will rarely need more than the Python data structures covered so far: lists, tuples, dictionaries, sets, and Counters.

But sometimes, another data structure might make your life quite a bit easier.
This extension unit covers three fairly simple ones: ordered dictionaries, frozen sets, and default dictionaries.

## Ordered dictionaries

Dictionaries have no intrinsic order.
But sometimes it is useful to be aware of the order in which items are added to a dictionary.
A concrete example of that is one of your previous assignments, where you had to present the user with a list of their wrong guesses, in exactly the order they made them.
This was probably a little tricky to do with just lists and Counters.
Python's `OrderedDict` would have made your life quite a bit easier.
Just check out this toy example:

In [None]:
from collections import OrderedDict

# let's assume the user will make the following guesses
future_guesses = ["eggs", "piano", "eggs", "tree", "aardvark", "eggs"]

# let's collect the guesses in an ordered dictionary
record = OrderedDict()
for guess in future_guesses:
    if guess not in record:
        record[guess] = 1
    else:
        record[guess] += 1
        
# and now let's show the user their guesses, plus how often they made them
print("Your guesses were", list(record.keys()))
for guess, count in record.items():
    print("You guessed", guess, count, "times")

Notice how the order of printed guesses exactly matches that of the user's guesses.
With a normal dictionary, the order can vary arbitrarily.

In [None]:
# let's assume the user will make the following guesses
future_guesses = ["eggs", "piano", "eggs", "tree", "aardvark", "eggs"]

# let's collect the guesses in a normal dictionary
record = {}
for guess in future_guesses:
    if guess not in record:
        record[guess] = 1
    else:
        record[guess] += 1
        
# and now let's show the user their guesses, plus how often they made them
print("Your guesses were", list(record.keys()))
for guess, count in record.items():
    print("You guessed", guess, count, "times")

So whenever you need the key-value architecture, but also want to keep track of the order in which items are added to a dictionary, use `OrderedDict` from the `collections` library.

# Immutable sets: `frozensets`

Lists have tuples as their immutable counterpart.
This makes it possible to use something list-like as the keys for a dictionary.
But that means that two keys will be distinct if they differ in order or numerosity.

In [None]:
test = {}
test[("a", "b")] = 1
test[("b", "a")] = 2
test[("a", "a", "b")] = 3

for k, v in test.items():
    print(k, v)

Sometimes, though, we may not want these distinctions.
For instance, we might be interested how often *Trump* and *Twitter* show up in a bigram, irrespective of their order.
In that case, we do not want separate counts for `("Trump", "Twitter")` and `("Twitter", "Trump")`.
Instead, the counts of the two should be added up.
This would happen naturally if we used sets instead of tuples, because then the key `{"Trump", "Twitter"}` is the same as the key `{"Twitter", "Trump"}`.
But sets are mutable, and consequently they aren't hashable and cannot be used as keys.

For special cases like this, `frozenset` can be used to create immutable sets.

In [None]:
# this crashes because sets cannot be keys in Counters
from collections import Counter


def text_to_bigram_sets(text):
    """Convert a text (list of strings) to bigrams."""
    return [set(text[n:n+2]) for n in range(len(text) - 1)]


bg_list = text_to_bigram_sets(["A", "simple", "example", "is", "an", "example", "simple", "in", "French"])
print(Counter(bg_list))

In [None]:
# this works because frozensets are immutable and thus hashable
from collections import Counter


def text_to_bigram_sets(text):
    """Convert a text (list of strings) to bigrams."""
    return [frozenset(text[n:n+2]) for n in range(len(text) - 1)]


bg_list = text_to_bigram_sets(["A", "simple", "example", "is", "an", "example", "simple", "in", "French"])
print(Counter(bg_list))

Notice how the counts for *simple example* and *example simple* were added together because they are the same frozen set.

## Default dictionaries

As you know, dictionaries don't respond well if we want to get the value of a key that doesn't exist.

In [None]:
test = {"a": 5}
print(test["b"])

We can use the `.get` method instead to avoid a `KeyError`.
For keys that aren't in the dictionary, we simply get `None` as the default value.
This is a special value that does not do anything.

In [None]:
test = {"a": 5}
print(test.get("b"))

At least sometimes, though, it would be nice if we could tell Python that if we're asking for the value of a key that does not exist, the key should be added to the dictionary with a default value.
This is exactly what the `collections` library's `defaultdict` allows us to do (please don't ask why `Counter` and `OrderedDict` are titlecase but `defaultdict` is all lowercase, I don't think anybody knows).

In [None]:
from collections import defaultdict


def defaultvalue():
    return "added as default"


# instantiate a default dictionary,
# using defaultvalue for unspecified values
test = defaultdict(defaultvalue)
# add a with specific value
test["a"] = 5
# add b with default
test["b"]

print(test)

Here we define a function `defaultvalue` that will act as the **factory** for the default values.
If we request the value for a key that isn't in the list yet, the return value of the factory is set as the value of this key.

A less obvious usage invokes `list` as the factory.

In [None]:
from collections import defaultdict

test = defaultdict(list)
words = ["some", "keys", "for", "testing"]
for pos in range(len(words)):
    key = words[pos]
    for _ in range(pos + 1):  # we use _ to indicate that the variable is not used anywhere
        test[key].append(key.upper())
    
print(test)

By using `list` as the generator for default values, the default value becomes the empty list `[]`.
So when we invoke, say, `test["some"]` for the first time, it is immediately set to `[]` because it has no value yet.
But since the first time we encounter this is as part of `test["some"].append("some".upper())`, we immediately append `"some".upper()` (i.e. `"SOME"`) to `[]`.
Hence a single command immediately sets the value for `"some"` to `["SOME"]`.
With `"some"`, we append only one time.
With the later words, the number of append steps reflect their position in the list `words`.
Hence `"keys"` is appended twice, first to `[]`, then to `["KEYS"]`.
We append three times for `"for"`, and four times for `"testing"`.

We could also use `int` to create a poor man's Counter.
With `int` as the factory, the default value is 0.

In [None]:
from collections import defaultdict
from nltk.corpus import brown

words = brown.words()
counts = defaultdict(int)

for word in words:
    counts[word] += 1
    
print(counts["the"])

This code is more straight-forward than the alternative with a dictionary (but of course a Counter would still be the easiest and also most powerful choice for counting).

In [None]:
from nltk.corpus import brown

words = brown.words()
counts = {}

for word in words:
    if counts.get(word):
        counts[word] += 1
    else:
        counts[word] = 1
    
print(counts["the"])

As a final example, consider how we can use a default dictionary to simplify the translation from an n-gram counter to a prefix tree.

In [None]:
from collections import defaultdict


def default_prefixtree():
    return defaultdict(default_prefixtree)


def ngramcounter_to_prefixtree(counter):
    """Convert counter with n-gram frequencies to prefix tree."""
    tree = defaultdict(default_prefixtree)
    for ngram, freq in counter.items(): 
        current_subtree = tree
        for word in ngram:
            current_subtree = current_subtree[word]
        current_subtree["freq"] = freq
    return tree


example_counts = {("the", "old", "man"): 0.1,
                  ("the", "old", "woman"): 0.2,
                  ("the", "ugly", "man"): 0.3}
ngramcounter_to_prefixtree(example_counts)

This code is quite a bit shorter.
The essential trick is the `default_prefixtree` function, which sets the default value to be an empty default dictionary whose default value is an empty default dictionary whose default value is an empty default dictionary, and so on.
It wouldn't be enough to just create an empty dictionary by default, because then the dictionaries that are created as default values wouldn't have a default value of their own.
We have to keep passing along the default value, and that's exactly what `default_prefixtree` does by mentioning itself in the `return` value.

Thanks to `default_prefixtree`, we can remove the whole `if`-test from the conversion function.
Now it no longer matters if the key already exists or not.
If it does, we retrieve it as usual, it it doesn't we add the key with a default dictionary as its value.

## Bullet-point summary

- If you need a dictionary that keeps track of the order in which key-value pairs are added, use `OrderedDict`.
- The immutable counterpart of `set` is `frozenset`.
  Frozen sets can be used as keys.
- Default dictionaries automatically assign a default value to a key if the key does not exist.
  You have to specify the **factory** of the default value.
  It can be a custom function, `list` (default: `[]`, or `int` (default: `0`).

```python
from collections import OrderedDict, defaultdict
# frozensets are built-in and do not need to be loaded
```