# Section 02 - Sequences

##  Sequence Types

Sequence types have the general concept of a first element, a second element, and so on. Basically an ordering of the sequence items using the natural numbers. In Python (and many other languages) the starting index is set to `0`, not `1`.

So the first item has index `0`, the second item has index `1`, and so on.

Python has built-in mutable and immutable sequence types.

Strings, tuples are immutable - we can access but not modify the **content** of the **sequence**:

In [1]:
t = (1, 2, 3)

In [2]:
t[0]

1

In [3]:
t[0] = 100

TypeError: 'tuple' object does not support item assignment

But of course, if the sequence contains mutable objects, then although we cannot modify the sequence of elements (cannot replace, delete or insert elements), we certainly **can** change the contents of the mutable objects:

In [4]:
t = ( [1, 2], 3, 4)

`t` is immutable, but its first element is a mutable object:

In [5]:
t[0][0] = 100

In [6]:
t

([100, 2], 3, 4)

#### Iterables

An **iterable** is just something that can be iterated over, for example using a `for` loop:

In [7]:
t = (10, 'a', 1+3j)

In [8]:
s = {10, 'a', 1+3j}

In [9]:
for c in t:
    print(c)

10
a
(1+3j)


In [10]:
for c in s:
    print(c)

a
10
(1+3j)


Note how we could iterate over both the tuple and the set. Iterating the tuple preserved the **order** of the elements in the tuple, but not for the set. Sets do not have an ordering of elements - they are iterable, but not sequences.

Most sequence types support the `in` and `not in` operations. Ranges do too, but not quite as efficiently as lists, tuples, strings, etc.

In [11]:
'a' in ['a', 'b', 100]

True

In [12]:
100 in range(200)

True

#### Min, Max and Length

Sequences also generally support the `len` method to obtain the number of items in the collection. Some iterables may also support that method.

In [13]:
len('python'), len([1, 2, 3]), len({10, 20, 30}), len({'a': 1, 'b': 2})

(6, 3, 3, 2)

Sequences (and even some iterables) may support `max` and `min` as long as the data types in the collection can be **ordered** in some sense (`<` or `>`).

In [14]:
a = [100, 300, 200]
min(a), max(a)

(100, 300)

In [15]:
s = 'python'
min(s), max(s)

('h', 'y')

In [16]:
s = {'p', 'y', 't', 'h', 'o', 'n'}
min(s), max(s)

('h', 'y')

But if the elements do not have an ordering defined:

In [17]:
a = [1+1j, 2+2j, 3+3j]
min(a)

TypeError: '<' not supported between instances of 'complex' and 'complex'

`min` and `max` will work for heterogeneous types as long as the elements are pairwise comparable (`<` or `>` is defined). 

For example:

In [18]:
from decimal import Decimal

In [19]:
t = 10, 20.5, Decimal('30.5')

In [20]:
min(t), max(t)

(10, Decimal('30.5'))

In [21]:
t = ['a', 10, 1000]
min(t)

TypeError: '<' not supported between instances of 'int' and 'str'

Even `range` objects support `min` and `max`:

In [22]:
r = range(10, 200)
min(r), max(r)

(10, 199)

#### Concatenation

We can **concatenate** sequences using the `+` operator:

In [23]:
[1, 2, 3] + [4, 5, 6]

[1, 2, 3, 4, 5, 6]

In [24]:
(1, 2, 3) + (4, 5, 6)

(1, 2, 3, 4, 5, 6)

Note that the type of the concatenated result is the same as the type of the sequences being concatenated, so concatenating sequences of varying types will not work:

In [25]:
(1, 2, 3) + [4, 5, 6]

TypeError: can only concatenate tuple (not "list") to tuple

In [26]:
'abc' + ['d', 'e', 'f']

TypeError: must be str, not list

Note: if you really want to concatenate varying types you'll have to transform them to a common type first:

In [27]:
(1, 2, 3) + tuple([4, 5, 6])

(1, 2, 3, 4, 5, 6)

In [28]:
tuple('abc') + ('d', 'e', 'f')

('a', 'b', 'c', 'd', 'e', 'f')

In [29]:
''.join(tuple('abc') + ('d', 'e', 'f'))

'abcdef'

#### Repetition

Most sequence types also support **repetition**, which is essentially concatenating the same sequence an integer number of times:

In [30]:
'abc' * 5

'abcabcabcabcabc'

In [31]:
[1, 2, 3] * 5

[1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]

We'll come back to some caveats of concatenation and repetition in a bit.

#### Finding things in Sequences

We can find the index of the occurrence of an element in a sequence:

In [32]:
s = "gnu's not unix"

In [33]:
s.index('n')

1

In [34]:
s.index('n', 1), s.index('n', 2), s.index('n', 8)

(1, 6, 11)

An exception is raised of the element is not found, so you'll want to catch it if you don't want your app to crash:

In [35]:
s.index('n', 13)

ValueError: substring not found

In [36]:
try:
    idx = s.index('n', 13)
except ValueError:
    print('not found')

not found


Note that these methods of finding objects in sequences do not assume that the objects in the sequence are ordered in any way. These are basically searches that iterate over the sequence until they find (or not) the requested element.

If you have a sorted sequence, then other search techniques are available - such as binary searches. I'll cover some of these topics in the extras section of this course.

#### Slicing

We'll come back to slicing in a later lecture, but sequence types generally support slicing, even ranges (as of Python 3.2). Just like concatenation, slices will return the same type as the sequence being sliced:

In [37]:
s = 'python'
l = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [38]:
s[0:3], s[4:6]

('pyt', 'on')

In [39]:
l[0:3], l[4:6]

([1, 2, 3], [5, 6])

It's ok to extend ranges past the bounds of the sequence:

In [40]:
s[4:1000]

'on'

If your first argument in the slice is `0`, you can even omit it. Omitting the second argument means it will include all the remaining elements:

In [41]:
s[0:3], s[:3]

('pyt', 'pyt')

In [42]:
s[3:1000], s[3:], s[:]

('hon', 'hon', 'python')

We can even have extended slicing, which provides a start, stop and a step:

In [43]:
s, s[0:5], s[0:5:2]

('python', 'pytho', 'pto')

In [44]:
s, s[::2]

('python', 'pto')

Technically we can also use negative values in slices, including extended slices (more on that later):

In [45]:
s, s[-3:-1], s[::-1]

('python', 'ho', 'nohtyp')

In [46]:
r = range(11)  # numbers from 0 to 10 (inclusive)

In [47]:
print(r)
print(list(r))

range(0, 11)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


In [48]:
print(r[:5])

range(0, 5)


In [49]:
print(list(r[:5]))

[0, 1, 2, 3, 4]


As you can see, slicing a range returns a range object as well, as expected.

#### Hashing

Immutable sequences generally support a `hash` method that we'll discuss in detail in the section on mapping types:

In [50]:
l = (1, 2, 3)
hash(l)

2528502973977326415

In [51]:
s = '123'
hash(s)

-1892188276802162953

In [52]:
r = range(10)
hash(r)

-6299899980521991026

But mutable sequences (and mutable types in general) do not:

In [53]:
l = [1, 2, 3]

In [54]:
hash(l)

TypeError: unhashable type: 'list'

Note also that a hashable sequence, is no longer hashable if one (or more) of it's elements are not hashable:

In [55]:
t = (1, 2, [10, 20])
hash(t)

TypeError: unhashable type: 'list'

But this would work:

In [56]:
t = ('python', (1, 2, 3))
hash(t)

-8790163410081325536

In general, immutable types are likely hashable, while immutable types are not. So numbers, strings, tuples, etc are hashable, but lists and sets are not:

In [57]:
from decimal import Decimal
d = Decimal(10.5)
hash(d)

1152921504606846986

Sets are not hashable:

In [58]:
s = {1, 2, 3}
hash(s)

TypeError: unhashable type: 'set'

But frozensets, an immutable variant of the set, are:

In [59]:
s = frozenset({1, 2, 3})

In [60]:
hash(s)

-7699079583225461316

#### Caveats with Concatenation and Repetition

Consider this:

In [61]:
x = [2000]

In [62]:
id(x[0])

2177520743920

In [63]:
l = x + x

In [64]:
l

[2000, 2000]

In [65]:
id(l[0]), id(l[1])

(2177520743920, 2177520743920)

As expected, the objects in `l[0]` and `l[1]` are the same.

Could also use:

In [66]:
l[0] is l[1]

True

This is not a big deal if the objects being concatenated are immutable. But if they are mutable:

In [67]:
x = [ [0, 0] ]
l = x + x

In [68]:
l

[[0, 0], [0, 0]]

In [69]:
l[0] is l[1]

True

And then we have the following:

In [70]:
l[0][0] = 100

In [71]:
l[0]

[100, 0]

In [72]:
l

[[100, 0], [100, 0]]

Notice how changing the 1st item of the 1st element also changed the 1st item of the second element.

While this seems fairly obvious when concatenating using the `+` operator as we have just done, the same actually happens with repetition and may not seem so obvious:

In [73]:
x = [ [0, 0] ]

In [74]:
m = x * 3

In [75]:
m

[[0, 0], [0, 0], [0, 0]]

In [76]:
m[0][0] = 100

In [77]:
m

[[100, 0], [100, 0], [100, 0]]

And in fact, even `x` changed:

In [78]:
x

[[100, 0]]

If you really want these repeated objects to be different objects, you'll have to copy them somehow. A simple list comprehensions would work well here:

In [79]:
x = [ [0, 0] ]
m = [e.copy() for e in x*3]

In [80]:
m

[[0, 0], [0, 0], [0, 0]]

In [81]:
m[0][0] = 100

In [82]:
m

[[100, 0], [0, 0], [0, 0]]

In [83]:
x

[[0, 0]]

##  Mutable Sequences

When dealing with mutable sequences, we have a few more things we can do - essentially adding, removing and replacing elements in the sequence.

This **mutates** the sequence. The sequence's memory address has not changed, but the internal **state** of the sequence has.

#### Replacing Elements

We can replace a single element as follows:

In [1]:
l = [1, 2, 3, 4, 5]
print(id(l))
l[0] = 'a'
print(id(l), l)

1979932141064
1979932141064 ['a', 2, 3, 4, 5]


We can remove all elements from the sequence:

In [2]:
l = [1, 2, 3, 4, 5]
l.clear()
print(l)

[]


Note that this is **NOT** the same as doing this:

In [3]:
l = [1, 2, 3, 4, 5]
l = []
print(l)

[]


The net effect may look the same, `l` is an empty list, but observe the memory addresses:

In [4]:
l = [1, 2, 3, 4, 5]
print(id(l))
l.clear()
print(l, id(l))

1979932698824
[] 1979932698824


vs

In [5]:
l = [1, 2, 3, 4, 5]
print(id(l))
l = []
print(l, id(l))

1979932699144
[] 1979932698824


In the second case you can see that the object referenced by `l` has changed, but not in the first case.

Why might this be important?

Suppose you have the following setup:

In [6]:
suits = ['Spades', 'Hearts', 'Diamonds', 'Clubs']
alias = suits
suits = []
print(suits, alias)

[] ['Spades', 'Hearts', 'Diamonds', 'Clubs']


But using clear:

In [7]:
suits = ['Spades', 'Hearts', 'Diamonds', 'Clubs']
alias = suits
suits.clear()
print(suits, alias)

[] []


Big difference!!

We can also replace elements using slicing and extended slicing. Here's an example, but we'll come back to this in a lot of detail:

In [8]:
l = [1, 2, 3, 4, 5]
print(id(l))
l[0:2] = ['a', 'b', 'c', 'd', 'e']
print(id(l), l)

1979932698504
1979932698504 ['a', 'b', 'c', 'd', 'e', 3, 4, 5]


#### Appending and Extending

We can also append elements to the sequence (note that this is **not** the same as concatenation):

In [9]:
l = [1, 2, 3]
print(id(l))
l.append(4)
print(l, id(l))

1979932697992
[1, 2, 3, 4] 1979932697992


If we had "appended" the value `4` using concatenation:

In [10]:
l = [1, 2, 3]
print(id(l))
l = l + [4]
print(id(l), l)

1979932193288
1979932698312 [1, 2, 3, 4]


If we want to add more than one element at a time, we can extend a sequence with the contents of any iterable (not just sequences):

In [11]:
l = [1, 2, 3, 4, 5]
print(id(l))
l.extend({'a', 'b', 'c'})
print(id(l), l)

1979932844488
1979932844488 [1, 2, 3, 4, 5, 'c', 'b', 'a']


Of course, since we extended using a set, there was not gurantee of positional ordering.

If we extend with another sequence, then positional ordering is retained:

In [12]:
l = [1, 2, 3]
l.extend(('a', 'b', 'c'))
print(l)

[1, 2, 3, 'a', 'b', 'c']


#### Removing Elements

We can remove (and retrieve at the same time) an element from a mutable sequence:

In [13]:
l = [1, 2, 3, 4]
print(id(l))
popped = l.pop(1)
print(id(l), popped, l)

1979932193288
1979932193288 2 [1, 3, 4]


If we do not specify an index for `pop`, then the **last** element is popped:

In [14]:
l = [1, 2, 3, 4]
popped = l.pop()
print(popped)
print(id(l), popped, l)

4
1979932696968 4 [1, 2, 3]


#### Inserting Elements

We can insert an element at a specific index. What this means is that the element we are inserting will be **at** that index position, and element that was at that position and all the remaining elements to the right are pushed out:

In [15]:
l = [1, 2, 3, 4]
print(id(l))
l.insert(1, 'a')
print(id(l), l)

1979932143176
1979932143176 [1, 'a', 2, 3, 4]


#### Reversing a Sequence

We can also do in-place reversal:

In [16]:
l = [1, 2, 3, 4]
print(id(l))
l.reverse()
print(id(l), l)

1979930587080
1979930587080 [4, 3, 2, 1]


We can also reverse a sequence using extended slicing (we'll come back to this later):

In [17]:
l = [1, 2, 3, 4]
l[::-1]

[4, 3, 2, 1]

But this is **NOT** mutating the sequence - the slice is returning a **new** sequence - that happens to be reversed.

In [18]:
l = [1, 2, 3, 4]
print(id(l))
l = l[::-1]
print(id(l), l)

1979932143176
1979932696968 [4, 3, 2, 1]


#### Copying Sequences

We can create a copy of a sequence:

In [19]:
l = [1, 2, 3, 4]
print(id(l))
l2 = l.copy()
print(id(l2), l2)

1979932700040
1979932696968 [1, 2, 3, 4]


Note that the `id` of `l` and `l2` is not the same.

In this case, using slicing does work the same as using the `copy` method:

In [20]:
l = [1, 2, 3, 4]
print(id(l))
l2 = l[:]
print(id(l2), l2)

1979932847304
1979932700040 [1, 2, 3, 4]


As you can see in both cases we end up with new objects.

So, use copy() or [:] - up to you, they end up doing the same thing.

We'll come back to copying in some detail in an upcoming video as this is an important topic with some subtleties.

##  Lists vs Tuples

Remember that both lists and tuples are considered **sequence** types.

Remember also that we should consider tuples as data structures (position has meaning) as we saw in an earlier section on named tuples.

However, in this context we are going to view tuples as "immutable lists".

Generally, tuples are more efficient that lists, so, unless you need mutability of the container, prefer using a tuple over a list.

#### Creating Tuples

We saw some of this already in the first section of this course when we looked at some of the optimizations Python implements, but let's revisit it in this context.

Here is Wikipedia's definition of constant folding:

`
Constant folding is the process of recognizing and evaluating constant expressions at compile time rather than computing them at runtime.
`

To see how this works, we are going to use the `dis` module which allows to see the disassembled Python bytecode - not for the faint of heart, but can be really useful!

In [1]:
from dis import dis

We want to understand what Python does when it compiles statements such as:

In [2]:
(1, 2, 3)
[1, 2, 3]

[1, 2, 3]

In [3]:
dis(compile('(1,2,3, "a")', 'string', 'eval'))

  1           0 LOAD_CONST               4 ((1, 2, 3, 'a'))
              2 RETURN_VALUE


In [4]:
dis(compile('[1,2,3, "a"]', 'string', 'eval'))

  1           0 LOAD_CONST               0 (1)
              2 LOAD_CONST               1 (2)
              4 LOAD_CONST               2 (3)
              6 LOAD_CONST               3 ('a')
              8 BUILD_LIST               4
             10 RETURN_VALUE


Notice how for a tuple containing constants (such as ints and strings in this case), the values are loaded in one step, a single constant value essentially. 

Lists, on the other hand are built-up one element at a time.

So, that's one reason why tuples can "load" faster than a list.

In fact, we can easily time this:

In [5]:
from timeit import timeit

In [6]:
timeit("(1,2,3,4,5,6,7,8,9)", number=10_000_000)

0.10997921960979677

In [7]:
timeit("[1,2,3,4,5,6,7,8,9]", number=10_000_000)

0.8128158471672868

As you can see creating a tuple was faster.

Now this changes if the tuple elements are not constants, such as lists or functions for example

In [8]:
def fn1():
    pass

In [9]:
dis(compile('(fn1, 10, 20)', 'string', 'eval'))

  1           0 LOAD_NAME                0 (fn1)
              2 LOAD_CONST               0 (10)
              4 LOAD_CONST               1 (20)
              6 BUILD_TUPLE              3
              8 RETURN_VALUE


In [10]:
dis(compile('[fn1, 10, 20]', 'string', 'eval'))

  1           0 LOAD_NAME                0 (fn1)
              2 LOAD_CONST               0 (10)
              4 LOAD_CONST               1 (20)
              6 BUILD_LIST               3
              8 RETURN_VALUE


or

In [11]:
dis(compile('([1,2], 10, 20)', 'string', 'eval'))

  1           0 LOAD_CONST               0 (1)
              2 LOAD_CONST               1 (2)
              4 BUILD_LIST               2
              6 LOAD_CONST               2 (10)
              8 LOAD_CONST               3 (20)
             10 BUILD_TUPLE              3
             12 RETURN_VALUE


In [12]:
dis(compile('[[1,2], 10, 20]', 'string', 'eval'))

  1           0 LOAD_CONST               0 (1)
              2 LOAD_CONST               1 (2)
              4 BUILD_LIST               2
              6 LOAD_CONST               2 (10)
              8 LOAD_CONST               3 (20)
             10 BUILD_LIST               3
             12 RETURN_VALUE


And of course this is reflected in the timings too:

In [13]:
timeit("([1, 2], 10, 20)", number=1_000_000)

0.0702705514158215

In [14]:
timeit("[[1, 2], 10, 20]", number=1_000_000)

0.06704527015184514

#### Copying Lists and Tuples

Let's look at creating a copy of both a list and a tuple:

In [15]:
l1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
t1 = (1, 2, 3, 4, 5, 6, 7, 8, 9)

In [16]:
id(l1), id(t1)

(1636690847432, 1636690312912)

In [17]:
l2 = list(l1)
t2 = tuple(t1)

Let's time this:

In [18]:
timeit('tuple((1,2,3,4,5,6,7,8,9))', number=1_000_000)

0.1410398213607027

In [19]:
timeit('list([1,2,3,4,5,6,7,8,9])', number=1_000_000)

0.2807509005242097

That's another win for tuples. But why?

Let's look at the id's of the copies:

In [20]:
id(l1), id(l2), id(t1), id(t2)

(1636690847432, 1636690849096, 1636690312912, 1636690312912)

in other words:

In [21]:
l1 is l2, t1 is t2

(False, True)

Notice how the `l1` and `l2` are **not** the same objects, whereas as `t1` and `t2` are!

So for lists, the elements had to be copied (shallow copy, more on this later), but for tuples it did not.

Note that this is the case even if the tuple contains non constant elements:

In [22]:
t1 = ([1,2], fn1, 3)
t2 = tuple(t1)
t1 is t2

True

#### Storage Efficiency

When mutable container objects such as lists, sets, dictionaries, etc are  created, and during their lifetime, the allocated capacity of these containers (the number of items they can contain) is greater than the number of elements in the container. This is done to make adding elements to the collection more efficient, and is called over-allocating.

Immutable containers on the other hand, since their item count is fixed once they have been created, do not need this overallocation - so their storage efficiency is greater.

Let's look at the size (memory) of lists and tuples as they get larger:

In [23]:
import sys

In [24]:
prev = 0
for i in range(10):
    c = tuple(range(i+1))
    size_c = sys.getsizeof(c)
    delta, prev = size_c - prev, size_c
    print(f'{i+1} items: {size_c}, delta={delta}')

1 items: 56, delta=56
2 items: 64, delta=8
3 items: 72, delta=8
4 items: 80, delta=8
5 items: 88, delta=8
6 items: 96, delta=8
7 items: 104, delta=8
8 items: 112, delta=8
9 items: 120, delta=8
10 items: 128, delta=8


In [25]:
prev = 0
for i in range(10):
    c = list(range(i+1))
    size_c = sys.getsizeof(c)
    delta, prev = size_c - prev, size_c
    print(f'{i+1} items: {size_c}, delta={delta}')

1 items: 96, delta=96
2 items: 104, delta=8
3 items: 112, delta=8
4 items: 120, delta=8
5 items: 128, delta=8
6 items: 136, delta=8
7 items: 144, delta=8
8 items: 160, delta=16
9 items: 192, delta=32
10 items: 200, delta=8


As you can see the size delta for tuples as they get larger, remains a constant 8 bytes (the pointer to the element), but not so for lists which will over-allocate space (this is done to achieve better performance when appending elements to a list).

Let's see what happens to the same list when we keep appending elements to it:

In [26]:
c = []
prev = sys.getsizeof(c)
print(f'0 items: {sys.getsizeof(c)}')
for i in range(255):
    c.append(i)
    size_c = sys.getsizeof(c)
    delta, prev = size_c - prev, size_c
    print(f'{i+1} items: {size_c}, delta={delta}')

0 items: 64
1 items: 96, delta=32
2 items: 96, delta=0
3 items: 96, delta=0
4 items: 96, delta=0
5 items: 128, delta=32
6 items: 128, delta=0
7 items: 128, delta=0
8 items: 128, delta=0
9 items: 192, delta=64
10 items: 192, delta=0
11 items: 192, delta=0
12 items: 192, delta=0
13 items: 192, delta=0
14 items: 192, delta=0
15 items: 192, delta=0
16 items: 192, delta=0
17 items: 264, delta=72
18 items: 264, delta=0
19 items: 264, delta=0
20 items: 264, delta=0
21 items: 264, delta=0
22 items: 264, delta=0
23 items: 264, delta=0
24 items: 264, delta=0
25 items: 264, delta=0
26 items: 344, delta=80
27 items: 344, delta=0
28 items: 344, delta=0
29 items: 344, delta=0
30 items: 344, delta=0
31 items: 344, delta=0
32 items: 344, delta=0
33 items: 344, delta=0
34 items: 344, delta=0
35 items: 344, delta=0
36 items: 432, delta=88
37 items: 432, delta=0
38 items: 432, delta=0
39 items: 432, delta=0
40 items: 432, delta=0
41 items: 432, delta=0
42 items: 432, delta=0
43 items: 432, delta=0
44 ite

As you can see the size of the list doesn't grow every time we append an element - it only does so occasionally. Resizing a list is expensive, so not resizing every time an item is added helps out, so this method called *overallocation* is used that creates a larger container than required is used - on the other hand you don't want to overallocate too much as this has a memory cost.

If you're interested in learning more about why over-allocating is done and how it works (amortization), Wikipedia also has an excellent article on it: https://en.wikipedia.org/wiki/Dynamic_array

The book "Introduction to Algorithms", by "Cormen, Leiserson, Rivest and Stein" has a thorough discussion on it (under dynamic tables).

#### Retrieving Elements

Let's time retrieving an element from a tuple and a list:

In [27]:
t = tuple(range(100_000))
l = list(t)

In [28]:
timeit('t[99_999]', globals=globals(), number=10_000_000)

0.5068397038462553

In [29]:
timeit('l[99_999]', globals=globals(), number=10_000_000)

0.4900615230663412

As you can see, retrieving elements from a tuple is very slightly faster than from a list. But consideting how small the difference really is, I'm not sure I would worry about it too much.

There is a reason why this should be true, and it has to do with how tuples and lists are implemented in CPython. Tuples have direct access (pointers) to their elements, while lists need to first access another array that contains the pointers to the elements of the list.

##  Copying Sequences

#### Shallow Copies

##### Simple Loop

Really not a very Pythonic approach, but it works...

In [1]:
l1 = [1, 2, 3]

l1_copy = []
for item in l1:
    l1_copy.append(item)

print(l1_copy)

[1, 2, 3]


And we can see that `l1` and `l1_copy` are not the same objects:

In [2]:
l1 is l1_copy

False

##### List Comprehension

We can use a list comprehension to do exactly what we did in the previous example:

In [3]:
l1 = [1, 2, 3]
l1_copy = [item for item in l1]
print(l1_copy)

[1, 2, 3]


And once again, the objects are not the same:

In [4]:
l1 is l1_copy

False

##### Using the copy() method

Since lists are mutable sequence types, they have the `copy()` method.

In [5]:
l1 = [1, 2, 3]
l1_copy = l1.copy()
print(l1_copy)

[1, 2, 3]


And once again, the objects are different:

In [6]:
l1 is l1_copy

False

##### Using the built-in list() Function

The built-in `list()` function will make a list out of any iterable. This always ends up with a copy of the iterable:

In [7]:
l1 = [1, 2, 3]

In [8]:
l1_copy = list(l1)
print(l1_copy)

[1, 2, 3]


In [9]:
l1 is l1_copy

False

Note that `list()` will take in any iterable, so you can technically copy any iterable into a list:

In [10]:
t1 = (1, 2, 3)
t1_copy = list(t1)
print(t1_copy)

[1, 2, 3]


Of course, we get a list, not a tuple - so not exactly a copy.

We've seen this before, but be careful with the `tuple()` built-in function. When we copy tuples, since they are immutable, we just get the original tuple back:

In [11]:
t1 = (1, 2, 3)
t1_copy = tuple(t1)
print(t1_copy)

(1, 2, 3)


But here, the objects are the **same**:

In [12]:
t1 is t1_copy

True

##### Using Slicing

We can also use slicing to copy sequences.

We'll cover slicing in detail in an upcoming lecture, but with slicing we can also access subsets of the sequence - here we use slicing to select the entire sequence:

In [13]:
l1 = [1, 2, 3]
l1_copy = l1[:]
print(l1_copy)
print(l1 is l1_copy)

[1, 2, 3]
False


But again, watch out with tuples!!

In [14]:
t1 = (1, 2, 3)
t1_copy = t1[:]
print(t1_copy)
print(t1 is t1_copy)

(1, 2, 3)
True


As you can see, since the slice was the entire tuple, a copy was not made, instead the reference to the original tuple was returned!

Same deal with strings:

In [15]:
s1 = 'python'
s2 = str(s1)
print(s2)
print(s1 is s2)

python
True


In [16]:
s1 = 'python'
s2 = s1[:]
print(s2)
print(s1 is s2)

python
True


If you're wondering why Python has that behavior, just think about it.

If you create a copy of a tuple, what are you going to do to that copy? Modify it?? You can't!

Modify the contents of a contained mutable element? Sure you can, but whether you had a copy or not, you would still be modifying the **same** element - having the sequence copied is no safer than not.

Not needed, so Python basically optimizes things for us.

##### The `copy` module

In [17]:
import copy

The `copy` module has a generic `copy` function as well:

In [18]:
l1 = [1, 2, 3]
l1_copy = copy.copy(l1)
print(l1_copy)
print(l1 is l1_copy)

[1, 2, 3]
False


And for tuples:

In [19]:
t1 = (1, 2, 3)
t1_copy = copy.copy(t1)
print(t1_copy)
print(t1 is t1_copy)

(1, 2, 3)
True


As you can see the same thing happens with tuples as we saw before.

#### Shallow vs Deep Copies

What we have been doing so far is creating **shallow** copies.

This means that when a sequence is copied, each element of the new sequence is bound to precisely the same memory address as the corresponding element in the original sequence:

In [20]:
v1 = [0, 0]
v2 = [0, 0]

line1 = [v1, v2]

In [21]:
print(line1)
print(id(line1[0]), id(line1[1]))

[[0, 0], [0, 0]]
2018950765896 2018950764104


Now let's make a copy of the line using any of the techniques we just looked at:

In [22]:
line2 = line1.copy()

In [23]:
line1 is line2

False

So not the same objects. Now let's look at the contained elements themselves:

In [24]:
print(id(line1[0]), id(line1[1]))
print(id(line2[0]), id(line2[1]))

2018950765896 2018950764104
2018950765896 2018950764104


As you can see, the element references are the same!

So, if we do this:

In [25]:
line2[0][0] = 100

In [26]:
line2

[[100, 0], [0, 0]]

In [27]:
line1

[[100, 0], [0, 0]]

`line1`'s contents has also changed.

If we want the contained elements **also** to be copied, then we need to explicitly do so as well. This is called creating a **deep** copy.

Let's see how we might do this:

In [28]:
v1 = [0, 0]
v2 = [0, 0]

line1 = [v1, v2]

In [29]:
line2 = [item[:] for item in line1]

In [30]:
print(id(line1[0]), id(line1[1]))
print(id(line2[0]), id(line2[1]))

2018950613000 2018948315720
2018950712904 2018950766408


As you can see, now we have copies of the elements as well:

In [31]:
line1[0][0] = 100
print(line1)
print(line2)

[[100, 0], [0, 0]]
[[0, 0], [0, 0]]


and `line2` is unaffacted when we modify `line1`.

So not only did we do a copy of `line1`, but we also made a shallow copy of `v1` and `v2` as well.

But the problem is that we only went two levels deep - what if the variables `v1` and `v2` themselves contained mutable types instead of just integers? We would have to nest deeper and deeper - in general that's what a deep copy needs to do, and usually recursive approaches need to be used.

Fortunately, Python has that functionality built-in for us so we don't have to do that!

The `copy` module has a `deepcopy()` function we can use to create deep copies. It handles all kinds of weird situations where we might have circular references - doing it ourselves is certainly possible, but does take some work.

In [32]:
v1 = [0, 0]
v2 = [0, 0]
line1 = [v1, v2]

In [33]:
line2 = copy.deepcopy(line1)
print(id(line1[0]), id(line1[1]))
print(id(line2[0]), id(line2[1]))

2018950611912 2018948484808
2018950458184 2018950611976


In [34]:
line2[0][0] = 100

In [35]:
print(line1)
print(line2)

[[0, 0], [0, 0]]
[[100, 0], [0, 0]]


And of course, it works with any level of nested objects:

In [36]:
v1 = [11, 12]
v2 = [21, 22]
line1 = [v1, v2]

v3 = [31, 32]
v4 = [41, 42]
line2 = [v3, v4]

plane1 = [line1, line2]
print(plane1)

[[[11, 12], [21, 22]], [[31, 32], [41, 42]]]


In [37]:
plane2 = copy.deepcopy(plane1)

In [38]:
print(plane2)

[[[11, 12], [21, 22]], [[31, 32], [41, 42]]]


In [39]:
print(plane1[0], id(plane1[0]))
print(plane2[0], id(plane2[0]))

[[11, 12], [21, 22]] 2018950458632
[[11, 12], [21, 22]] 2018950611080


In [40]:
print(plane1[0][0], id(plane1[0][0]))
print(plane2[0][0], id(plane2[0][0]))

[11, 12] 2018948481288
[11, 12] 2018950763080


#### Even works with custom classes

In [41]:
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    
    def __repr__(self):
        return f'Point({self.x}, {self.y})'
    
class Line:
    def __init__(self, p1, p2):
        self.p1 = p1
        self.p2 = p2
        
    def __repr__(self):
        return f'Line({self.p1.__repr__()}, {self.p2.__repr__()})'

In [42]:
p1 = Point(0, 0)
p2 = Point(10, 10)
line1 = Line(p1, p2)
line2 = copy.deepcopy(line1)

print(line1.p1, id(line1.p1))
print(line2.p1, id(line2.p1))

Point(0, 0) 2018950806944
Point(0, 0) 2018950807280


As you can see, the memory address of the points are different - that was because of the deep copy.

However, if we had done a shallow copy:

In [43]:
p1 = Point(0, 0)
p2 = Point(10, 10)
line1 = Line(p1, p2)
line2 = copy.copy(line1)

print(line1.p1, id(line1.p1))
print(line2.p1, id(line2.p1))

Point(0, 0) 2018950806832
Point(0, 0) 2018950806832


As you can see, the memory address of the points are now the **same**.

##  Slicing

Slices can actually be defined using the `slice()` function which creates a `slice` object:

In [1]:
s = slice(0, 2)

In [2]:
type(s)

slice

In [3]:
s.start

0

In [4]:
s.stop

2

In [5]:
l = [1, 2, 3, 4, 5]
l[s]

[1, 2]

This can be useful in practice to make code more readable.

Suppose you are parsing fixed-width file. You would need to define the start/end column of each field in the rows of the file.

So you might write something like this:

In [6]:
data = []  # a collection of rows, read from a file maybe
for row in data:
    first_name = row[0:51]
    last_name = row[51:101]
    ssn = row[101:111]
    # etc

Instead, you might write:

In [7]:
range_first_name = slice(0, 51)
range_last_name = slice(51, 101)
range_ssn = slice(101, 111)

These might even be defined in your global scope, or maybe a config file.

Then in your code you would write this instead:

In [8]:
for row in data:
    first_name = row[range_first_name]
    last_name = row[range_last_name]
    ssn = row[range_ssn]

Separating the slice definition from the code that uses the slice makes it now much easier to update your slice definitions in one place, rather than hunt for them all over the place.

#### Slice Fundamentals

Indexing is zero-based in Python, and slices are inclusive of their start-index, and exclusive of their end-index:

In [9]:
l = 'python'
l[0:1], l[0:6]

('p', 'python')

Additionally, extended slicing allows specifying a step value:

In [10]:
l = 'python'
l[0:6:2], l[0:6:3]

('pto', 'ph')

And extended slices can also be defined using `slice`:

In [11]:
s1 = slice(0, 6, 2)
s2 = slice(0, 6, 3)
l[s1], l[s2]

('pto', 'ph')

Unlike regular indexing (e.g. `l[n]`), it's OK for slice indexes to be "out of bounds":

In [12]:
l = [1, 2, 3, 4, 5, 6]
l[0:100]

[1, 2, 3, 4, 5, 6]

In [13]:
l[-10:100]

[1, 2, 3, 4, 5, 6]

But regular indexing will raise exceptions for out of bound errors:

In [14]:
l = [1, 2, 3, 4, 5, 6]
l[100]

IndexError: list index out of range

In slicing, if we do not specify the start/end index, Python will automatically use the start/end of the sequence we are slicing:

In [15]:
l = [1, 2, 3, 4, 5, 6]

In [16]:
l[:4]

[1, 2, 3, 4]

In [17]:
l[4:]

[5, 6]

In fact, we can omit both:

In [18]:
l[:]

[1, 2, 3, 4, 5, 6]

In addition to the start/stop values allowing for negative values, the step value can also be negative. This simply means the sequence will traversed in the opposite direction:

In [19]:
l = [0, 1, 2, 3, 4, 5]

In [20]:
l[3:0:-1]

[3, 2, 1]

Basically we started at `3` (inclusive) and went in steps of `-1`, ending at (but not including) `0`.

If we wanted to include the `0` index element, we could do it by ommitting the end value:

In [21]:
l[3::-1]

[3, 2, 1, 0]

We could also do the following:

In [22]:
l[3:-100:-1]

[3, 2, 1, 0]

But this would not work as expected:

In [23]:
l[3:-1:-1]

[]

Why?

Remember from the lecture that this range equivalence would be:

`3 --> 3`

`-1 < 0 --> max(-1, 6-1) --> max(-1, 5) --> 5`

so equivalent range would be given by:

In [24]:
list(range(3, 5, -1))

[]

which of course is an empty range!

#### Easily Converting a Slice to a Range

We can easily determine the effective range of a slice by using the `indices` method in the `slice` object. The only thing is that in order to do this we must know the length of the sequence we are slicing.

For example, if our list has a length of 10:

In [25]:
slice(1, 5).indices(10)

(1, 5, 1)

In [26]:
list(range(1, 5, 1))

[1, 2, 3, 4]

In [27]:
l = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
l[1:5]

[1, 2, 3, 4]

The `slice` object can also handle extended slicing:

In [28]:
slice(0, 100, 2).indices(10)

(0, 10, 2)

In [29]:
list(range(0, 10, 2))

[0, 2, 4, 6, 8]

In [30]:
l[0:100:2]

[0, 2, 4, 6, 8]

We can easily retrieve a list of indices from a slice by passing the unpacked tuple returned by the `indices` method to the range function's arguments and converting to a list:

In [31]:
list(range(*slice(None, None, -1).indices(10)))

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

As we can see from this example, using a slice such as `[::-1]` returns a sequence that is in reverse order from the original one.

In [32]:
l

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [33]:
l[::]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [34]:
l[::-1]

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

##  Custom Sequences (Part 1)

We'll focus first on how to create a custom sequence type that supports indexing, slicing (read only) and iteration. We'll look into mutable custom sequences in an upcoming video.

First we should understand how the `__getitem__` method works for iteration and retrieving individual elements from a sequence:

In [2]:
my_list = [0, 1, 2, 3, 4, 5]

In [3]:
my_list.__getitem__(0)

0

In [4]:
my_list.__getitem__(5)

5

But if our index is out of bounds:

In [5]:
my_list.__getitem__(6)

IndexError: list index out of range

we get an IndexError.

Technically, the `list` object's `__getitem__` method also supports negative indexing and slicing:

In [6]:
my_list.__getitem__(-1)

5

In [7]:
my_list.__getitem__(slice(0,6,2))

[0, 2, 4]

In [8]:
my_list.__getitem__(slice(None, None, -1))

[5, 4, 3, 2, 1, 0]

#### Mimicking Python's `for` loop using the `__getitem__` method

In [9]:
my_list = [0, 1, 2, 3, 4, 5]

In [10]:
for item in my_list:
    print(item ** 2)

0
1
4
9
16
25


Now let's do the same thing ourselves without a for loop:

In [11]:
index = 0
while True:
    try:
        item = my_list.__getitem__(index)
    except IndexError:
        # reached the end of the sequence
        break
    # do something with the item...
    print(item ** 2)
    index += 1

0
1
4
9
16
25


#### Implementing a custom Sequence 

Custom objects can support slicing - we'll see this later in this course, but for now we'll take a quick peek ahead.

To make a custom classes support indexing (and slicing) we only need to implement the `__getitem__` method which receives the index (or slice) we are interested in.

In [12]:
class MySequence:
    def __getitem__(self, index):
        print(type(index), index)

In [13]:
my_seq = MySequence()

In [14]:
my_seq[0]

<class 'int'> 0


In [15]:
my_seq[100]

<class 'int'> 100


In [16]:
my_seq[0:2]

<class 'slice'> slice(0, 2, None)


In [17]:
my_seq[0:10:2]

<class 'slice'> slice(0, 10, 2)


As you can see, the `__getitem__` method receives an index number of type `int` when we use `[n]` and a `slice` object when we use `[i:j]` or `[i:j:k]`.

As we saw in a previous lecture, given the bounds for a slice, and the length of the sequence we are slicing, we can always define a `range` that will generate the desired indices.

We also saw that the `slice` object has a method, `indices`, that precisely tells us the start/stop/step values we would need for an equivalent `range`, given the length of the sequence we are slicing.

Let's recall a simple example first:

In [18]:
l = 'python'
len(l)

6

In [19]:
s = slice(0, 6, 2)
l[s]

'pto'

In [20]:
s.start, s.stop, s.step

(0, 6, 2)

In [21]:
s.indices(6)

(0, 6, 2)

In [22]:
list(range(0, 6, 2))

[0, 2, 4]

This matches exactly the indices that were selected from the sequence `'python'`

### Example

So, why am I re-emphasizing this equivalence between the indices in a `slice` and and equivalent `range` object?

Let's say we want to implement our own sequence type and we want to support slicing.

For this example we'll create a custom Fibonacci sequence type.

First recall that the `__getitem__` will receive either an integer (for simple indexing), or a slice object:

In [23]:
class Fib:
    def __getitem__(self, s):
        print(type(s), s)

In [24]:
f = Fib()
f[2]
f[2:10:2]

<class 'int'> 2
<class 'slice'> slice(2, 10, 2)


We'll use that to implement both indexing and slicing for our custom Fibonacci sequence type.

We'll make our sequence type bounded (i.e. we'll have to specify the size of the sequence). But we are not going to pre-generate the entire sequence of Fibonacci numbers, we'll only generate the ones that are being requested as needed.

In [25]:
class Fib:
    def __init__(self, n):
        self._n = n
    
    def __getitem__(self, s):
        if isinstance(s, int):
            # single item requested
            print(f'requesting [{s}]')
        else:
            # slice being requested
            print(f'requesting [{s.start}:{s.stop}:{s.step}]')

In [26]:
f = Fib(10)

In [27]:
f[3]

requesting [3]


In [28]:
f[:5]

requesting [None:5:None]


Let's now add in what the equivalent range would be:

In [29]:
class Fib:
    def __init__(self, n):
        self._n = n
    
    def __getitem__(self, s):
        if isinstance(s, int):
            # single item requested
            print(f'requesting [{s}]')
        else:
            # slice being requested
            print(f'requesting [{s.start}:{s.stop}:{s.step}]')
            idx = s.indices(self._n)
            rng = range(*idx)
            print(f'\trange({idx[0]}, {idx[1]}, {idx[2]}) --> {list(rng)}')

In [30]:
f = Fib(10)
f[3:5]
f[::-1]

requesting [3:5:None]
	range(3, 5, 1) --> [3, 4]
requesting [None:None:-1]
	range(9, -1, -1) --> [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]


Next step is for us to actually calculate the n-th Fibonacci number, we'll use memoization as well (see lecture on decorators and memoization if you need to refresh your memory on that):

In [31]:
from functools import lru_cache

In [32]:
@lru_cache(2**10)
def fib(n):
    if n < 2:
        return 1
    else:
        return fib(n-1) + fib(n-2)

In [33]:
fib(0), fib(1), fib(2), fib(3), fib(4), fib(5), fib(50)

(1, 1, 2, 3, 5, 8, 20365011074)

Now, let's make this function part of our class:

In [34]:
class Fib:
    def __init__(self, n):
        self._n = n
    
    def __getitem__(self, s):
        if isinstance(s, int):
            # single item requested
            print(f'requesting [{s}]')
        else:
            # slice being requested
            print(f'requesting [{s.start}:{s.stop}:{s.step}]')
            idx = s.indices(self._n)
            rng = range(idx[0], idx[1], idx[2])
            print(f'\trange({idx[0]}, {idx[1]}, {idx[2]}) --> {list(rng)}')
    
    @staticmethod
    @lru_cache(2**32)
    def _fib(n):
        if n < 2:
            return 1
        else:
            return fib(n-1) + fib(n-2)

The next step is to implement the `__getitem__` method. Let's start by implementing the simple indexing:

In [35]:
class Fib:
    def __init__(self, n):
        self._n = n
    
    def __getitem__(self, s):
        if isinstance(s, int):
            # single item requested
            return self._fib(s)
        else:
            # slice being requested
            print(f'requesting [{s.start}:{s.stop}:{s.step}]')
            idx = s.indices(self._n)
            rng = range(idx[0], idx[1], idx[2])
            print(f'\trange({idx[0]}, {idx[1]}, {idx[2]}) --> {list(rng)}')
            
    @staticmethod
    @lru_cache(2**32)
    def _fib(n):
        if n < 2:
            return 1
        else:
            return fib(n-1) + fib(n-2)

Let's test that out:

In [36]:
f = Fib(100)

In [37]:
f[0], f[1], f[2], f[3], f[4], f[5], f[50]

(1, 1, 2, 3, 5, 8, 20365011074)

But we still have a few problems.

First we do not handle negative values, and we also will return results for indices that should technically be out of bounds, so we can't really iterate through this sequence yet as we would end up with an infinite iteration!

In [38]:
f[200], f[-5]

(453973694165307953197296969697410619233826, 1)

So we first need to raise an `IndexError` exception when the index is out of bounds, and we also need to remap negative indices (for example `-1` should correspond to the last element of the sequence, and so on)

In [39]:
class Fib:
    def __init__(self, n):
        self._n = n
    
    def __getitem__(self, s):
        if isinstance(s, int):
            # single item requested
            if s < 0:
                s = self._n + s
            if s < 0 or s > self._n - 1:
                raise IndexError
            return self._fib(s)
        else:
            # slice being requested
            print(f'requesting [{s.start}:{s.stop}:{s.step}]')
            idx = s.indices(self._n)
            rng = range(idx[0], idx[1], idx[2])
            print(f'\trange({idx[0]}, {idx[1]}, {idx[2]}) --> {list(rng)}')
            
    @staticmethod
    @lru_cache(2**32)
    def _fib(n):
        if n < 2:
            return 1
        else:
            return fib(n-1) + fib(n-2)

In [40]:
f = Fib(10)

In [41]:
f[9], f[-1]

(55, 55)

In [42]:
f[10]

IndexError: 

In [43]:
f[-100]

IndexError: 

In [44]:
for item in f:
    print(item)

1
1
2
3
5
8
13
21
34
55


We still don't support slicing though...

In [45]:
f[0:2]

requesting [0:2:None]
	range(0, 2, 1) --> [0, 1]


So let's implement slicing as well:

In [46]:
class Fib:
    def __init__(self, n):
        self._n = n
    
    def __getitem__(self, s):
        if isinstance(s, int):
            # single item requested
            if s < 0:
                s = self._n + s
            if s < 0 or s > self._n - 1:
                raise IndexError
            return self._fib(s)
        else:
            # slice being requested
            idx = s.indices(self._n)
            rng = range(idx[0], idx[1], idx[2])
            return [self._fib(n) for n in rng]
            
    @staticmethod
    @lru_cache(2**32)
    def _fib(n):
        if n < 2:
            return 1
        else:
            return fib(n-1) + fib(n-2)

In [47]:
f = Fib(10)

In [48]:
f[0:5]

[1, 1, 2, 3, 5]

In [49]:
f[5::-1]

[8, 5, 3, 2, 1, 1]

In [50]:
list(f)

[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

In [51]:
f[::-1]

[55, 34, 21, 13, 8, 5, 3, 2, 1, 1]

One other thing, is that the built-in `len` function will not work with our class:

In [52]:
f = Fib(10)

In [53]:
len(f)

TypeError: object of type 'Fib' has no len()

That's an easy fix, we just need to implement the `__len__` method:

In [54]:
class Fib:
    def __init__(self, n):
        self._n = n
    
    def __len__(self):
        return self._n
    
    def __getitem__(self, s):
        if isinstance(s, int):
            # single item requested
            if s < 0:
                s = self._n + s
            if s < 0 or s > self._n - 1:
                raise IndexError
            return self._fib(s)
        else:
            # slice being requested
            idx = s.indices(self._n)
            rng = range(idx[0], idx[1], idx[2])
            return [self._fib(n) for n in rng]
            
    @staticmethod
    @lru_cache(2**32)
    def _fib(n):
        if n < 2:
            return 1
        else:
            return fib(n-1) + fib(n-2)

In [55]:
f = Fib(10)

In [56]:
len(f)

10

One thing I want to point out here: we did not need to use inheritance! There was no need to inherit from another sequence type. All we really needed was to implement the `__getitem__` and `__len__` methods.

The other thing I want to mention, is that I would not use recursion for production purposes for a Fibonacci sequence, even with memoization - partly because of the cost of recursion and the limit to the recursion depth that is possible.

Also, when we look at generators, and more particularly generator expressions, we'll see better ways of doing this as well.

I really wanted to show you a simple example of how to create your own sequence types.

##  In-Place Concatenation and Repetition

##### In-Place Concatenation

We saw that using concatenation ended up creating a new sequence object:

In [1]:
l1 = [1, 2, 3, 4]
l2 = [5, 6]
print(id(l1), l1)
print(id(l2), l2)

2674852946824 [1, 2, 3, 4]
2674852947208 [5, 6]


In [2]:
l1 = l1 + l2
print(id(l1), l1)

2674853399624 [1, 2, 3, 4, 5, 6]


But watch what happens when we use the in-place concatenation operator `+=:

In [3]:
l1 = [1, 2, 3, 4]
l2 = [5, 6]
print(id(l1), l1)
print(id(l2), l2)

2674853400520 [1, 2, 3, 4]
2674852590920 [5, 6]


In [4]:
l1 += l2
print(id(l1), l1)

2674853400520 [1, 2, 3, 4, 5, 6]


Notice how the `id` of `l1` has **not** changed - it is the same object, just mutated!

So far in this course I have often said that:

`a = a + 1`

and 

`a += 1`

are the same thing.

And for immutable objects such as integers, that is indeed true.

But in fact `+` and `+=` are two different operators.

It is interesting to note that the implementation of `+=` for lists will actually extend the list given any iterable, not just another list. This is really just the particular implementation of that operator for lists.

In [5]:
l1 = [1, 2, 3, 4]
t1 = 5, 6, 7
print(id(l1), l1)
print(id(t1), t1)

2674853566344 [1, 2, 3, 4]
2674853559968 (5, 6, 7)


In [6]:
l1 += t1
print(id(l1), l1)

2674853566344 [1, 2, 3, 4, 5, 6, 7]


And this will work with other iterables as well:

In [7]:
l1 += range(8, 11)
print(id(l1), l1)

2674853566344 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


or even with iterable non-sequence types:

In [8]:
l1 += {11, 12, 13}
print(id(l1), l1)

2674853566344 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]


Of course, this will **not work** with **immutable** sequence types, such as tuples or strings:

In [9]:
t1 = 1, 2, 3
t2 = 4, 5, 6
print(id(t1), t1)
print(id(t2), t2)

2674852634768 (1, 2, 3)
2674853559968 (4, 5, 6)


In [10]:
print(id(t1), t1)

2674852634768 (1, 2, 3)


We cannot mutate an immutable container!
What happens is that `+=` is not actually defined for the `tuple`, and so Python essentially executed this code:

`t1 = t1 + t2`

which, as we already know, always creates a new object.

##### In-Place Repetition

A similar result holds for in-place repetition.

Let's see this using a list (mutable sequence type) first:

In [11]:
l = [1, 2, 3]
print(id(l), l)

2674853567560 [1, 2, 3]


In [12]:
l *= 2
print(id(l), l)

2674853567560 [1, 2, 3, 1, 2, 3]


But obviously this operator will work differently if the sequence type is immutable:

In [13]:
t = (1, 2, 3)
print(id(t), t)

2674853646840 (1, 2, 3)


In [14]:
t *= 2
print(id(t), t)

2674829349224 (1, 2, 3, 1, 2, 3)


##  Assignments in Mutable Sequences

We have seen how to mutate mutable sequences using append, insert, extend and in-place concatenation (`+=`).

But mutable sequences also allow us to mutate the sequence by assigning values (iterables) to slices. Depending on how we specify the slice, and what the value is, we can actually insert, modify and delete elements of the sequence.

#### Standard Slices

##### Replacement

We can replace a slice of a sequence with any other iterable. They need not even be of the same length.

In [35]:
l = [1, 2, 3, 4, 5]
id(l)

1449221423176

In [36]:
l[0:3]

[1, 2, 3]

In [37]:
l[0:3] = ['a', 'b', 'c', 'd']

In [38]:
l, id(l)

(['a', 'b', 'c', 'd', 4, 5], 1449221423176)

In fact, since strings are iterables, this would work too:

In [39]:
l = [1, 2, 3, 4, 5]

In [40]:
l[0:3] = 'python'

In [41]:
l

['p', 'y', 't', 'h', 'o', 'n', 4, 5]

##### Deleting

Delete is really just a special case of replacement, where we replace with an empty iterable.

In [27]:
l = [1, 2, 3, 4, 5]
id(l)

1449221214024

We can delete a single element by defining a slice that precisely selects that element, and replacing it with an empty iterable.

In [28]:
l[0:1]

[1]

In [29]:
l[0:1] = []

In [30]:
l, id(l)

([2, 3, 4, 5], 1449221214024)

If we want, we can delete multiple element at once using the same technique:

In [31]:
l = [1, 2, 3, 4, 5]
id(l)

1449221446792

In [32]:
l[0:2]

[1, 2]

In [33]:
l[0:2] = []

In [34]:
l, id(l)

([3, 4, 5], 1449221446792)

##### Inserting

In [18]:
l = [1, 2, 3, 4, 5]
id(l)

1449221213576

Here we have to be careful if we want to insert something.

If we replace a slice that contains elements from the sequence, we'll be replacing those elements.

So what we really want here is a way to replace an empty slice in our sequence!

Let's say we want to insert some elements at the second position in the sequence (index 1):

In [43]:
l = [1, 2, 3, 4, 5]
id(l)

1449221445832

In [44]:
l[1:2]

[2]

The problem is that if we assign something to that slice it will replace the element `2`, and that's not what we want.

Instead, we have to define an empty slice at the location where we want the insert to take place, in this case at index `1`:

In [45]:
l[1:1]

[]

So, this is an empty slice start at `1` and ending *before* `1`

In [46]:
l[1:1] = 'abc'

In [47]:
l, id(l)

([1, 'a', 'b', 'c', 2, 3, 4, 5], 1449221445832)

And of course the memory address of `l` has not changed. We mutated the list.

#### Side Note: Immutable Sequences

As a side note, what would happen if we tried the same technique using immutable sequences, such as tuples for example?

In [48]:
t = 1, 2, 3, 4, 5

In [49]:
t[0:3] = (10, 20)

TypeError: 'tuple' object does not support item assignment

As expected, we cannot mutate an immutable type.

#### Extended Slices

So now let's explore what happens if we assign iterables to extended slices.

In [51]:
l = [1, 2, 3, 4, 5]
id(l)

1449221445384

In [52]:
l[::2]

[1, 3, 5]

Let's see if we can replace those items:

In [54]:
l[::2] = ['a', 'b', 'c']

In [55]:
l, id(l)

(['a', 2, 'b', 4, 'c'], 1449221445384)

Yes!! That worked, and the id of `l` is unchanged.

There is one thing about assigning to extended slices - the length of the slice and the length of the iterable we are setting on the right hand side must have the **same length**:

In [56]:
l = [1, 2, 3, 4, 5]

In [57]:
l[::2]

[1, 3, 5]

In [58]:
l[::2] = ['a', 'b']

ValueError: attempt to assign sequence of size 2 to extended slice of size 3

In [59]:
l[::2] = ['a', 'b', 'c', 'd']

ValueError: attempt to assign sequence of size 4 to extended slice of size 3

This means that we cannot delete items using extended slicing:

In [60]:
l[::2] = []

ValueError: attempt to assign sequence of size 0 to extended slice of size 3

And of course insertion does not even make sense with an extended slice - remember that for insertion we need an empty slice...

#### Note

One last note, the right hand side can be any iterable. We saw how we could use a string for example. But any iterable, even non-sequence types will work as well.

In [61]:
l = [1, 2, 3, 4, 5]

In [62]:
l[3:3] = {'c', 'b', 'a'}

In [63]:
l

[1, 2, 3, 'b', 'a', 'c', 4, 5]

Of course, since we are inserting the values form a set, there is no guarantee of the order in which the elements will be inserted into our list.

##  Custom Sequences (Part 2a)

We have seen before how we could define our own custom sequence type by implementing the `__len__` and `__getitem__` methods.

Here we are going to look at how to implement:
* concatenation (`+`)
* in-place concatenation (`+=`)
* repetition (`*`)
* in-place repetition (`*=`)
* index assignment (`seq[i]=val`)
* slice assignment (`seq[i:j]=iter` and `seq[i:j:k]=iter`)
* append, extend, in, del, pop

#### The `+` and `+=` Operators

First we look at how we can overload the `+` and `+=` operators in a custom class in general. Then we'll look at how to use this in the context of sequences.

We use the special functions `__add__` and `__iadd__`.

Just to see how those methods get called, we're actually going to implement them to just print out that they were called. As you can see, we can implement them however we want!

In [1]:
class MyClass:
    def __init__(self, name):
        self.name = name
        
    def __repr__(self):
        return f'MyClass(name={self.name})'
    
    def __add__(self, other):
        print(f'You called + on {self} and {other}')
        return 'Hello from __add__'
        
    def __iadd__(self, other):
        print(f'You called += on {self} and {other}')
        return 'Hello from __iadd__'

In [2]:
c1 = MyClass('instance 1')
c2 = MyClass('instance 2')

In [3]:
c3 = c1 + c2

You called + on MyClass(name=instance 1) and MyClass(name=instance 2)


In [4]:
c3

'Hello from __add__'

In [5]:
c1 += c2

You called += on MyClass(name=instance 1) and MyClass(name=instance 2)


In [6]:
c1

'Hello from __iadd__'

Now let's tweak this code to make those operators concatenate the `name` property.

The thing to note is that when we add two objects together we generally expect them to be of the same type and to return an object of the same type (and in the case of `+=` it needs to return the original object).

Let's quickly recall how those operators behave with lists:

In [7]:
l1 = [1, 2, 3]
l2 = [4, 5, 6]
id(l1)

1727174473032

In [8]:
l1 = l1 + l2
id(l1), l1

(1727175173064, [1, 2, 3, 4, 5, 6])

Notice how the `id` of `l1` changed.

But, with `+=`:

In [9]:
l1 = [1, 2, 3]
l2 = [4, 5, 6]
id(l1)

1727175172552

In [10]:
l1 += l2
id(l1), l1

(1727175172552, [1, 2, 3, 4, 5, 6])

we can see that the concatenation results in the same elements, but this time the `id` of `l1` has not changed - an in-place operation took place.

Let's do something similar:

In [11]:
class MyClass:
    def __init__(self, name):
        self.name = name
        
    def __repr__(self):
        return f'MyClass(name={self.name})'
    
    def __add__(self, other):
        return MyClass(self.name + ' ' + other.name)
        
    def __iadd__(self, other):
        self.name += ' ' + other.name
        return self
        

In [12]:
c1 = MyClass('Eric')
c2 = MyClass('Idle')

In [13]:
c3 = c1 + c2

In [14]:
c3

MyClass(name=Eric Idle)

In [15]:
c1, c2

(MyClass(name=Eric), MyClass(name=Idle))

In [16]:
c1 += c2

In [17]:
c1

MyClass(name=Eric Idle)

#### The `*` and `*=` Operators

Just as easily we can overload the `*` and `*=` operators too, using the `__mul__` and `__imul__` methods.

In [18]:
class MyClass:
    def __init__(self, name):
        self.name = name
        
    def __repr__(self):
        return f'MyClass(name={self.name})'
    
    def __add__(self, other):
        return MyClass(self.name + ' ' + other.name)
        
    def __iadd__(self, other):
        self.name += ' ' + other.name
        return self
    
    def __mul__(self, n):
        return MyClass(self.name * n)
        
    def __imul__(self, n):
        self.name *= n
        return self

In [19]:
c1 = MyClass('Eric')

In [20]:
c1 * 3

MyClass(name=EricEricEric)

In [21]:
c1

MyClass(name=Eric)

In [22]:
c1 *= 4 

In [23]:
c1

MyClass(name=EricEricEricEric)

And if we try something not supported:

In [24]:
c1 = MyClass('Eric')
c1 * 'hello'

TypeError: can't multiply sequence by non-int of type 'str'

As you can see, we get the correct exception - and we didn't even have to guard against that exception and raise our own error. Since we delegated our `*` call to multiplying a sequence by something else, we could simply let Python handle any exceptions.

We'll actually get into a lot of detail with exception handling later in this course.

What about multiplying an integer by the sequence?

In [25]:
c1 = MyClass('Monty')
2 * c1

TypeError: unsupported operand type(s) for *: 'int' and 'MyClass'

To handle this we need to implement the `__rmul__` method:

In [26]:
class MyClass:
    def __init__(self, name):
        self.name = name
        
    def __repr__(self):
        return f'MyClass(name={self.name})'
    
    def __add__(self, other):
        return MyClass(self.name + ' ' + other.name)
        
    def __iadd__(self, other):
        self.name += ' ' + other.name
        return self
    
    def __mul__(self, n):
        return MyClass(self.name * n)
        
    def __imul__(self, n):
        self.name *= n
        return self
    
    def __rmul__(self, n):
        self.name *= n
        return self

In [27]:
c1 = MyClass('Monty')

In [28]:
2 * c1

MyClass(name=MontyMonty)

#### Implementing the `in` operator

For this example, we'll want `in` to test if the something is contained in the name string of our class:

In [29]:
class MyClass:
    def __init__(self, name):
        self.name = name
        
    def __repr__(self):
        return f'MyClass(name={self.name})'
    
    def __add__(self, other):
        return MyClass(self.name + ' ' + other.name)
        
    def __iadd__(self, other):
        self.name += ' ' + other.name
        return self
    
    def __mul__(self, n):
        return MyClass(self.name * n)
        
    def __imul__(self, n):
        self.name *= n
        return self
    
    def __rmul__(self, n):
        self.name *= n
        return self
    
    def __contains__(self, value):
        return value in self.name

In [30]:
c1 = MyClass('MontyPython')

In [31]:
'ty' in c1

True

##  Custom Sequences (Part 2b/c)

For this example we'll re-use the Polygon class from a previous lecture on extending sequences.

We are going to consider a polygon as nothing more than a collection of points (and we'll stick to a 2-dimensional space).

So, we'll need a `Point` class, but we're going to use our own custom class instead of just using a named tuple.

We do this because we want to enforce a rule that our Point co-ordinates will be real numbers. We would not be able to use a named tuple to do that and we could end up with points whose `x` and `y` coordinates could be of any type.

First we'll need to see how we can test if a type is a numeric real type.

We can do this by using the numbers module.

In [1]:
import numbers

This module contains certain base types for numbers that we can use, such as Number, Real, Complex, etc.

In [2]:
isinstance(10, numbers.Number)

True

In [3]:
isinstance(10.5, numbers.Number)

True

In [4]:
isinstance(1+1j, numbers.Number)

True

We will want our points to be real numbers only, so we can do it this way:

In [5]:
isinstance(1+1j, numbers.Real)

False

In [6]:
isinstance(10, numbers.Real)

True

In [7]:
isinstance(10.5, numbers.Real)

True

So now let's write our Point class. We want it to have these properties:

  1. The `x` and `y` coordinates should be real numbers only
  2. Point instances should be a sequence type so that we can unpack it as needed in the same way we were able to unpack the values of a named tuple.

In [8]:
class Point:
    def __init__(self, x, y):
        if isinstance(x, numbers.Real) and isinstance(y, numbers.Real):
            self._pt = (x, y)
        else:
            raise TypeError('Point co-ordinates must be real numbers.')
            
    def __repr__(self):
        return f'Point(x={self._pt[0]}, y={self._pt[1]})'
    
    def __len__(self):
        return 2
    
    def __getitem__(self, s):
        return self._pt[s]

Let's use our point class and make sure it works as intended:

In [9]:
p = Point(1, 2)

In [10]:
p

Point(x=1, y=2)

In [11]:
len(p)

2

In [13]:
p[0], p[1]

(1, 2)

In [14]:
x, y = p

In [15]:
x, y

(1, 2)

Now, we can start creatiung our Polygon class, that will essentially be a mutable sequence of points making up the verteces of the polygon.

In [21]:
class Polygon:
    def __init__(self, *pts):
        if pts:
            self._pts = [Point(*pt) for pt in pts]
        else:
            self._pts = []
            
    def __repr__(self):
        return f'Polygon({self._pts})'

Let's try it and see if everything is as we expect:

In [22]:
p = Polygon()

In [23]:
p

Polygon([])

In [24]:
p = Polygon((0,0), [1,1])

In [25]:
p

Polygon([Point(x=0, y=0), Point(x=1, y=1)])

In [26]:
p = Polygon(Point(0, 0), [1, 1])

In [27]:
p

Polygon([Point(x=0, y=0), Point(x=1, y=1)])

That seems to be working, but only one minor thing - our representation contains those square brackets which technically should not be there as the Polygon class init assumes multiple arguments, not a single iterable.

So we should fix that:

In [37]:
class Polygon:
    def __init__(self, *pts):
        if pts:
            self._pts = [Point(*pt) for pt in pts]
        else:
            self._pts = []
            
    def __repr__(self):
        pts_str = ', '.join(self._pts)
        return f'Polygon({pts_str})'

But that still won't work, because the `join` method expects an iterable of **strings** - here we are passing it an iterable of `Point` objects:

In [29]:
p = Polygon((0,0), (1,1))

In [30]:
p

TypeError: sequence item 0: expected str instance, Point found

So, let's fix that:

In [34]:
class Polygon:
    def __init__(self, *pts):
        if pts:
            self._pts = [Point(*pt) for pt in pts]
        else:
            self._pts = []
            
    def __repr__(self):
        pts_str = ', '.join([str(pt) for pt in self._pts])
        return f'Polygon({pts_str})'

In [35]:
p = Polygon((0,0), (1,1))

In [36]:
p

Polygon(Point(x=0, y=0), Point(x=1, y=1))

Ok, so now we can start making our Polygon into a sequence type, by implementing methods such as `__len__` and `__getitem__`:

In [39]:
class Polygon:
    def __init__(self, *pts):
        if pts:
            self._pts = [Point(*pt) for pt in pts]
        else:
            self._pts = []
            
    def __repr__(self):
        pts_str = ', '.join([str(pt) for pt in self._pts])
        return f'Polygon({pts_str})'
    
    def __len__(self):
        return len(self._pts)
    
    def __getitem__(self, s):
        return self._pts[s]

Notice how we are simply delegating those methods to the ones supported by lists since we are storing our sequence of points internally using a list!

In [40]:
p = Polygon((0,0), Point(1,1), [2,2])

In [41]:
p

Polygon(Point(x=0, y=0), Point(x=1, y=1), Point(x=2, y=2))

In [42]:
p[0]

Point(x=0, y=0)

In [43]:
p[::-1]

[Point(x=2, y=2), Point(x=1, y=1), Point(x=0, y=0)]

Now let's implement concatenation (we'll skip repetition - wouldn't make much sense anyway):

In [45]:
class Polygon:
    def __init__(self, *pts):
        if pts:
            self._pts = [Point(*pt) for pt in pts]
        else:
            self._pts = []
            
    def __repr__(self):
        pts_str = ', '.join([str(pt) for pt in self._pts])
        return f'Polygon({pts_str})'
    
    def __len__(self):
        return len(self._pts)
    
    def __getitem__(self, s):
        return self._pts[s]
    
    def __add__(self, other):
        if isinstance(other, Polygon):
            new_pts = self._pts + other._pts
            return Polygon(*new_pts)
        else:
            raise TypeError('can only concatenate with another Polygon')

In [47]:
p1 = Polygon((0,0), (1,1))
p2 = Polygon((2,2), (3,3))
print(id(p1), p1)
print(id(p2), p2)

1869044255880 Polygon(Point(x=0, y=0), Point(x=1, y=1))
1869044253528 Polygon(Point(x=2, y=2), Point(x=3, y=3))


In [48]:
result = p1 + p2

In [49]:
print(id(result), result)

1869044256552 Polygon(Point(x=0, y=0), Point(x=1, y=1), Point(x=2, y=2), Point(x=3, y=3))


Now, let's handle in-place concatenation. Let's start by only allowing the RHS of the in-place concatenation to be another Polygon:

In [71]:
class Polygon:
    def __init__(self, *pts):
        if pts:
            self._pts = [Point(*pt) for pt in pts]
        else:
            self._pts = []
            
    def __repr__(self):
        pts_str = ', '.join([str(pt) for pt in self._pts])
        return f'Polygon({pts_str})'
    
    def __len__(self):
        return len(self._pts)
    
    def __getitem__(self, s):
        return self._pts[s]
    
    def __add__(self, other):
        if isinstance(other, Polygon):
            new_pts = self._pts + other._pts
            return Polygon(*new_pts)
        else:
            raise TypeError('can only concatenate with another Polygon')
            
    def __iadd__(self, pt):
        if isinstance(pt, Polygon):
            self._pts = self._pts + pt._pts
            return self
        else:
            raise TypeError('can only concatenate with another Polygon')

In [72]:
p1 = Polygon((0,0), (1,1))
p2 = Polygon((2,2), (3,3))
print(id(p1), p1)
print(id(p2), p2)

1869044255600 Polygon(Point(x=0, y=0), Point(x=1, y=1))
1869044255656 Polygon(Point(x=2, y=2), Point(x=3, y=3))


In [73]:
p1 += p2

In [74]:
print(id(p1), p1)

1869044255600 Polygon(Point(x=0, y=0), Point(x=1, y=1), Point(x=2, y=2), Point(x=3, y=3))


So that worked, but this would not:

In [75]:
p1 = Polygon((0,0), (1,1))

In [76]:
p1 += [(2,2), (3,3)]

TypeError: can only concatenate with another Polygon

As you can see we get that type error. But we really should be able to handle appending any iterable of Points - and of course Points could also be specified as just iterables of length 2 containing numbers:

In [77]:
class Polygon:
    def __init__(self, *pts):
        if pts:
            self._pts = [Point(*pt) for pt in pts]
        else:
            self._pts = []
            
    def __repr__(self):
        pts_str = ', '.join([str(pt) for pt in self._pts])
        return f'Polygon({pts_str})'
    
    def __len__(self):
        return len(self._pts)
    
    def __getitem__(self, s):
        return self._pts[s]
    
    def __add__(self, pt):
        if isinstance(pt, Polygon):
            new_pts = self._pts + pt._pts
            return Polygon(*new_pts)
        else:
            raise TypeError('can only concatenate with another Polygon')
            
    def __iadd__(self, pts):
        if isinstance(pts, Polygon):
            self._pts = self._pts + pts._pts
        else:
            # assume we are being passed an iterable containing Points
            # or something compatible with Points
            points = [Point(*pt) for pt in pts]
            self._pts = self._pts + points
        return self

In [78]:
p1 = Polygon((0,0), (1,1))

In [79]:
p1 += [(2,2), (3,3)]

In [80]:
p1

Polygon(Point(x=0, y=0), Point(x=1, y=1), Point(x=2, y=2), Point(x=3, y=3))

Now let's implement some methods such as `append`, `extend` and `insert`:

In [81]:
class Polygon:
    def __init__(self, *pts):
        if pts:
            self._pts = [Point(*pt) for pt in pts]
        else:
            self._pts = []
            
    def __repr__(self):
        pts_str = ', '.join([str(pt) for pt in self._pts])
        return f'Polygon({pts_str})'
    
    def __len__(self):
        return len(self._pts)
    
    def __getitem__(self, s):
        return self._pts[s]
    
    def __add__(self, pt):
        if isinstance(pt, Polygon):
            new_pts = self._pts + pt._pts
            return Polygon(*new_pts)
        else:
            raise TypeError('can only concatenate with another Polygon')
            
    def __iadd__(self, pts):
        if isinstance(pts, Polygon):
            self._pts = self._pts + pts._pts
        else:
            # assume we are being passed an iterable containing Points
            # or something compatible with Points
            points = [Point(*pt) for pt in pts]
            self._pts = self._pts + points
        return self
    
    def append(self, pt):
        self._pts.append(Point(*pt))
        
    def extend(self, pts):
        if isinstance(pts, Polygon):
            self._pts = self._pts + pts._pts
        else:
            # assume we are being passed an iterable containing Points
            # or something compatible with Points
            points = [Point(*pt) for pt in pts]
            self._pts = self._pts + points
            
    def insert(self, i, pt):
        self._pts.insert(i, Point(*pt))

Notice how we used almost the same code for `__iadd__` and `extend`?
The only difference is that `__iadd__` returns the object, while `extend` does not - so let's clean that up a bit:

In [82]:
class Polygon:
    def __init__(self, *pts):
        if pts:
            self._pts = [Point(*pt) for pt in pts]
        else:
            self._pts = []
            
    def __repr__(self):
        pts_str = ', '.join([str(pt) for pt in self._pts])
        return f'Polygon({pts_str})'
    
    def __len__(self):
        return len(self._pts)
    
    def __getitem__(self, s):
        return self._pts[s]
    
    def __add__(self, pt):
        if isinstance(pt, Polygon):
            new_pts = self._pts + pt._pts
            return Polygon(*new_pts)
        else:
            raise TypeError('can only concatenate with another Polygon')

    def append(self, pt):
        self._pts.append(Point(*pt))
        
    def extend(self, pts):
        if isinstance(pts, Polygon):
            self._pts = self._pts + pts._pts
        else:
            # assume we are being passed an iterable containing Points
            # or something compatible with Points
            points = [Point(*pt) for pt in pts]
            self._pts = self._pts + points
    
    def __iadd__(self, pts):
        self.extend(pts)
        return self
    
    def insert(self, i, pt):
        self._pts.insert(i, Point(*pt))

Now let's give all this a try:

In [93]:
p1 = Polygon((0,0), Point(1,1))
p2 = Polygon([2, 2], [3, 3])
print(id(p1), p1)
print(id(p2), p2)

1869044425392 Polygon(Point(x=0, y=0), Point(x=1, y=1))
1869044427464 Polygon(Point(x=2, y=2), Point(x=3, y=3))


In [94]:
p1 += p2

In [95]:
print(id(p1), p1)

1869044425392 Polygon(Point(x=0, y=0), Point(x=1, y=1), Point(x=2, y=2), Point(x=3, y=3))


That worked still, now let's see `append`:

In [96]:
p1

Polygon(Point(x=0, y=0), Point(x=1, y=1), Point(x=2, y=2), Point(x=3, y=3))

In [97]:
p1.append((4, 4))

In [98]:
p1

Polygon(Point(x=0, y=0), Point(x=1, y=1), Point(x=2, y=2), Point(x=3, y=3), Point(x=4, y=4))

In [99]:
p1.append(Point(5,5))

In [104]:
print(id(p1), p1)

1869044425392 Polygon(Point(x=0, y=0), Point(x=1, y=1), Point(x=2, y=2), Point(x=3, y=3), Point(x=4, y=4), Point(x=5, y=5), Point(x=6, y=6), Point(x=7, y=7))


`append` seems to be working, now for `extend`:

In [101]:
p3 = Polygon((6,6), (7,7))

In [102]:
p1.extend(p3)

In [103]:
print(id(p1), p1)

1869044425392 Polygon(Point(x=0, y=0), Point(x=1, y=1), Point(x=2, y=2), Point(x=3, y=3), Point(x=4, y=4), Point(x=5, y=5), Point(x=6, y=6), Point(x=7, y=7))


In [106]:
p1.extend([(8,8), Point(9,9)])

In [107]:
print(id(p1), p1)

1869044425392 Polygon(Point(x=0, y=0), Point(x=1, y=1), Point(x=2, y=2), Point(x=3, y=3), Point(x=4, y=4), Point(x=5, y=5), Point(x=6, y=6), Point(x=7, y=7), Point(x=8, y=8), Point(x=9, y=9))


Now let's see if `insert` works as expected:

In [108]:
p1 = Polygon((0,0), (1,1), (2,2))

In [109]:
print(id(p1), p1)

1869044022576 Polygon(Point(x=0, y=0), Point(x=1, y=1), Point(x=2, y=2))


In [110]:
p1.insert(1, (100, 100))

In [111]:
print(id(p1), p1)

1869044022576 Polygon(Point(x=0, y=0), Point(x=100, y=100), Point(x=1, y=1), Point(x=2, y=2))


In [112]:
p1.insert(1, Point(50, 50))

In [113]:
print(id(p1), p1)

1869044022576 Polygon(Point(x=0, y=0), Point(x=50, y=50), Point(x=100, y=100), Point(x=1, y=1), Point(x=2, y=2))


Now that we have that working, let's turn our attention to the `__setitem__` method so we can support index and slice assignments:

In [114]:
class Polygon:
    def __init__(self, *pts):
        if pts:
            self._pts = [Point(*pt) for pt in pts]
        else:
            self._pts = []
            
    def __repr__(self):
        pts_str = ', '.join([str(pt) for pt in self._pts])
        return f'Polygon({pts_str})'
    
    def __len__(self):
        return len(self._pts)
    
    def __getitem__(self, s):
        return self._pts[s]
    
    def __setitem__(self, s, value):
        # value could be a single Point (or compatible type) for s an int
        # or it could be an iterable of Points if s is a slice
        # let's start by handling slices only first
        self._pts[s] = [Point(*pt) for pt in value]
            
    def __add__(self, pt):
        if isinstance(pt, Polygon):
            new_pts = self._pts + pt._pts
            return Polygon(*new_pts)
        else:
            raise TypeError('can only concatenate with another Polygon')

    def append(self, pt):
        self._pts.append(Point(*pt))
        
    def extend(self, pts):
        if isinstance(pts, Polygon):
            self._pts = self._pts + pts._pts
        else:
            # assume we are being passed an iterable containing Points
            # or something compatible with Points
            points = [Point(*pt) for pt in pts]
            self._pts = self._pts + points
    
    def __iadd__(self, pts):
        self.extend(pts)
        return self
    
    def insert(self, i, pt):
        self._pts.insert(i, Point(*pt))

So, we are only handling slice assignments at this point, not assignments such as `p[0] = Point(0,0)`:

In [117]:
p = Polygon((0,0), (1,1), (2,2))
print(id(p), p)

1869044422304 Polygon(Point(x=0, y=0), Point(x=1, y=1), Point(x=2, y=2))


In [118]:
p[0:2] = [(10, 10), (20, 20), (30, 30)]

In [119]:
print(id(p), p)

1869044422304 Polygon(Point(x=10, y=10), Point(x=20, y=20), Point(x=30, y=30), Point(x=2, y=2))


So this seems to work fine. But this won't yet:

In [120]:
p[0] = Point(100, 100)

TypeError: type object argument after * must be an iterable, not int

If we look at the precise error, we see that our list comprehension is the cause of the error - we fail to correctly handle the case where the value passed in is not an iterable of Points...

In [124]:
class Polygon:
    def __init__(self, *pts):
        if pts:
            self._pts = [Point(*pt) for pt in pts]
        else:
            self._pts = []
            
    def __repr__(self):
        pts_str = ', '.join([str(pt) for pt in self._pts])
        return f'Polygon({pts_str})'
    
    def __len__(self):
        return len(self._pts)
    
    def __getitem__(self, s):
        return self._pts[s]
    
    def __setitem__(self, s, value):
        # value could be a single Point (or compatible type) for s an int
        # or it could be an iterable of Points if s is a slice
        # we could do this:
        if isinstance(s, int):
            self._pts[s] = Point(*value)
        else:
            self._pts[s] = [Point(*pt) for pt in value]
            
    def __add__(self, pt):
        if isinstance(pt, Polygon):
            new_pts = self._pts + pt._pts
            return Polygon(*new_pts)
        else:
            raise TypeError('can only concatenate with another Polygon')

    def append(self, pt):
        self._pts.append(Point(*pt))
        
    def extend(self, pts):
        if isinstance(pts, Polygon):
            self._pts = self._pts + pts._pts
        else:
            # assume we are being passed an iterable containing Points
            # or something compatible with Points
            points = [Point(*pt) for pt in pts]
            self._pts = self._pts + points
    
    def __iadd__(self, pts):
        self.extend(pts)
        return self
    
    def insert(self, i, pt):
        self._pts.insert(i, Point(*pt))

This will now work as expected:

In [125]:
p = Polygon((0,0), (1,1), (2,2))
print(id(p), p)

1869044254368 Polygon(Point(x=0, y=0), Point(x=1, y=1), Point(x=2, y=2))


In [126]:
p[0] = Point(10, 10)

In [127]:
print(id(p), p)

1869044254368 Polygon(Point(x=10, y=10), Point(x=1, y=1), Point(x=2, y=2))


What happens if we try to assign a single Point to a slice:

In [128]:
p[0:2] = Point(10, 10)

TypeError: type object argument after * must be an iterable, not int

As expected this will not work. What about assigning an iterable of points to an index:

In [130]:
p[0] = [Point(10, 10), Point(20, 20)]

TypeError: Point co-ordinates must be real numbers.

This works fine, but the error messages are a bit misleading - we probably should do something about that:

In [162]:
class Polygon:
    def __init__(self, *pts):
        if pts:
            self._pts = [Point(*pt) for pt in pts]
        else:
            self._pts = []
            
    def __repr__(self):
        pts_str = ', '.join([str(pt) for pt in self._pts])
        return f'Polygon({pts_str})'
    
    def __len__(self):
        return len(self._pts)
    
    def __getitem__(self, s):
        return self._pts[s]
    
    def __setitem__(self, s, value):
        # we first should see if we have a single Point
        # or an iterable of Points in value
        try:
            rhs = [Point(*pt) for pt in value]
            is_single = False
        except TypeError:
            # not a valid iterable of Points
            # maybe a single Point?
            try:
                rhs = Point(*value)
                is_single = True
            except TypeError:
                # still no go
                raise TypeError('Invalid Point or iterable of Points')
        
        # reached here, so rhs is either an iterable of Points, or a Point
        # we want to make sure we are assigning to a slice only if we 
        # have an iterable of points, and assigning to an index if we 
        # have a single Point only
        if (isinstance(s, int) and is_single) \
            or isinstance(s, slice) and not is_single:
            self._pts[s] = rhs
        else:
            raise TypeError('Incompatible index/slice assignment')
                
    def __add__(self, pt):
        if isinstance(pt, Polygon):
            new_pts = self._pts + pt._pts
            return Polygon(*new_pts)
        else:
            raise TypeError('can only concatenate with another Polygon')

    def append(self, pt):
        self._pts.append(Point(*pt))
        
    def extend(self, pts):
        if isinstance(pts, Polygon):
            self._pts = self._pts + pts._pts
        else:
            # assume we are being passed an iterable containing Points
            # or something compatible with Points
            points = [Point(*pt) for pt in pts]
            self._pts = self._pts + points
    
    def __iadd__(self, pts):
        self.extend(pts)
        return self
    
    def insert(self, i, pt):
        self._pts.insert(i, Point(*pt))

So now let's see if we get better error messages:

In [154]:
p1 = Polygon((0,0), (1,1), (2,2))

In [155]:
p1[0:2] = (10,10)

TypeError: Incompatible index/slice assignment

In [156]:
p1[0] = [(0,0), (1,1)]

TypeError: Incompatible index/slice assignment

And the allowed slice/index assignments work as expected:

In [157]:
p[0] = Point(100, 100)

In [158]:
p

Polygon(Point(x=100, y=100), Point(x=1, y=1), Point(x=2, y=2))

In [159]:
p[0:2] = [(0,0), (1,1), (2,2)]

In [160]:
p

Polygon(Point(x=0, y=0), Point(x=1, y=1), Point(x=2, y=2), Point(x=2, y=2))

And if we try to replace with bad Point data:

In [161]:
p[0] = (0, 2+2j)

TypeError: Point co-ordinates must be real numbers.

We also get a better error message.

Lastly let's see how we would implement the `del` keyword and the `pop` method.

Recall how the `del` keyword works for a list:

In [163]:
l = [1, 2, 3, 4, 5]

In [164]:
del l[0]

In [165]:
l

[2, 3, 4, 5]

In [166]:
del l[0:2]

In [167]:
l

[4, 5]

In [168]:
del l[-1]

In [169]:
l

[4]

So, `del` works with indices (positive or negative) and slices too. We'll do the same:

In [180]:
class Polygon:
    def __init__(self, *pts):
        if pts:
            self._pts = [Point(*pt) for pt in pts]
        else:
            self._pts = []
            
    def __repr__(self):
        pts_str = ', '.join([str(pt) for pt in self._pts])
        return f'Polygon({pts_str})'
    
    def __len__(self):
        return len(self._pts)
    
    def __getitem__(self, s):
        return self._pts[s]
    
    def __setitem__(self, s, value):
        # we first should see if we have a single Point
        # or an iterable of Points in value
        try:
            rhs = [Point(*pt) for pt in value]
            is_single = False
        except TypeError:
            # not a valid iterable of Points
            # maybe a single Point?
            try:
                rhs = Point(*value)
                is_single = True
            except TypeError:
                # still no go
                raise TypeError('Invalid Point or iterable of Points')
        
        # reached here, so rhs is either an iterable of Points, or a Point
        # we want to make sure we are assigning to a slice only if we 
        # have an iterable of points, and assigning to an index if we 
        # have a single Point only
        if (isinstance(s, int) and is_single) \
            or isinstance(s, slice) and not is_single:
            self._pts[s] = rhs
        else:
            raise TypeError('Incompatible index/slice assignment')
                
    def __add__(self, pt):
        if isinstance(pt, Polygon):
            new_pts = self._pts + pt._pts
            return Polygon(*new_pts)
        else:
            raise TypeError('can only concatenate with another Polygon')

    def append(self, pt):
        self._pts.append(Point(*pt))
        
    def extend(self, pts):
        if isinstance(pts, Polygon):
            self._pts = self._pts + pts._pts
        else:
            # assume we are being passed an iterable containing Points
            # or something compatible with Points
            points = [Point(*pt) for pt in pts]
            self._pts = self._pts + points
    
    def __iadd__(self, pts):
        self.extend(pts)
        return self
    
    def insert(self, i, pt):
        self._pts.insert(i, Point(*pt))
        
    def __delitem__(self, s):
        del self._pts[s]

In [181]:
p = Polygon(*zip(range(6), range(6)))

In [182]:
p

Polygon(Point(x=0, y=0), Point(x=1, y=1), Point(x=2, y=2), Point(x=3, y=3), Point(x=4, y=4), Point(x=5, y=5))

In [183]:
del p[0]

In [184]:
p

Polygon(Point(x=1, y=1), Point(x=2, y=2), Point(x=3, y=3), Point(x=4, y=4), Point(x=5, y=5))

In [185]:
del p[-1]

In [186]:
p

Polygon(Point(x=1, y=1), Point(x=2, y=2), Point(x=3, y=3), Point(x=4, y=4))

In [187]:
del p[0:2]

In [188]:
p

Polygon(Point(x=3, y=3), Point(x=4, y=4))

Now, we just have to implement `pop`:

In [189]:
class Polygon:
    def __init__(self, *pts):
        if pts:
            self._pts = [Point(*pt) for pt in pts]
        else:
            self._pts = []
            
    def __repr__(self):
        pts_str = ', '.join([str(pt) for pt in self._pts])
        return f'Polygon({pts_str})'
    
    def __len__(self):
        return len(self._pts)
    
    def __getitem__(self, s):
        return self._pts[s]
    
    def __setitem__(self, s, value):
        # we first should see if we have a single Point
        # or an iterable of Points in value
        try:
            rhs = [Point(*pt) for pt in value]
            is_single = False
        except TypeError:
            # not a valid iterable of Points
            # maybe a single Point?
            try:
                rhs = Point(*value)
                is_single = True
            except TypeError:
                # still no go
                raise TypeError('Invalid Point or iterable of Points')
        
        # reached here, so rhs is either an iterable of Points, or a Point
        # we want to make sure we are assigning to a slice only if we 
        # have an iterable of points, and assigning to an index if we 
        # have a single Point only
        if (isinstance(s, int) and is_single) \
            or isinstance(s, slice) and not is_single:
            self._pts[s] = rhs
        else:
            raise TypeError('Incompatible index/slice assignment')
                
    def __add__(self, pt):
        if isinstance(pt, Polygon):
            new_pts = self._pts + pt._pts
            return Polygon(*new_pts)
        else:
            raise TypeError('can only concatenate with another Polygon')

    def append(self, pt):
        self._pts.append(Point(*pt))
        
    def extend(self, pts):
        if isinstance(pts, Polygon):
            self._pts = self._pts + pts._pts
        else:
            # assume we are being passed an iterable containing Points
            # or something compatible with Points
            points = [Point(*pt) for pt in pts]
            self._pts = self._pts + points
    
    def __iadd__(self, pts):
        self.extend(pts)
        return self
    
    def insert(self, i, pt):
        self._pts.insert(i, Point(*pt))
        
    def __delitem__(self, s):
        del self._pts[s]
        
    def pop(self, i):
        return self._pts.pop(i)

In [190]:
p = Polygon(*zip(range(6), range(6)))

In [191]:
p

Polygon(Point(x=0, y=0), Point(x=1, y=1), Point(x=2, y=2), Point(x=3, y=3), Point(x=4, y=4), Point(x=5, y=5))

In [192]:
p.pop(1)

Point(x=1, y=1)

In [193]:
p

Polygon(Point(x=0, y=0), Point(x=2, y=2), Point(x=3, y=3), Point(x=4, y=4), Point(x=5, y=5))

##  Sorting Sequences

Just like with the concatenation and in-place concatenation we saw previously, we have two different ways of sorting a mutable sequence:

* returning a new sorted sequence
* in-place sorting (mutating sequence) - obviously this works for mutable sequence types only!


For any iterable, the built-in `sorted` function will return a **list** containing the sorted elements of the iterable.

So a few things here: 
* any iterable can be sorted (as long as it is finite)
* the elements must be pair-wise comparable (possibly indirectly via a sort key)
* the returned result is always a list
* the original iterable is not mutated

In addition:
* optionally specify a `key` - a function that extracts a comparison key for each element. If that key is not specified, Python will use the natural ordering of the elements (such as __gt__, etc, so that fails if they do not!)
* optional specify the `reverse` argument which will return the reversed sort

Numbers have a natural ordering for example, so sorting an iterable of numbers is easy:

In [1]:
t = 10, 3, 5, 8, 9, 6, 1
sorted(t)

[1, 3, 5, 6, 8, 9, 10]

As you can see we sorted a `tuple` and got a `list` back.

We can sort non-sequence iterables too:

In [2]:
s = {10, 3, 5, 8, 9, 6, 1}
sorted(s)

[1, 3, 5, 6, 8, 9, 10]

For things like dictionaries, this works slightly differently. Remember what happens when we iterate a dictionary?

In [8]:
d = {3: 100, 2: 200, 1: 10}
for item in d:
    print(item)

3
2
1


We actually are iterating the keys.

Same thing happens with sorting - we'll end up just sorting the keys:

In [9]:
d = {3: 100, 2: 200, 1: 10}
sorted(d)

[1, 2, 3]

But what if we wanted to sort the dictionary keys based on the values instead?

This is where the `key` argument of `sorted` will come in handy.

We are going to specify to the `sorted` function that it should use the value of each item to use as a sort key:

In [11]:
d = {'a': 100, 'b': 50, 'c': 10}
sorted(d, key=lambda k: d[k])

['c', 'b', 'a']

Basically the `key` argument was called on every item being sorted - these items were the keys of the dictionary: `a`, `b`, `c`.
For every key it used the result of the lambda as the sorting key:

dictionary keys --> sorting key:
* `a  --> 100`
* `b --> 50`
* `c --> 10`

Hence the sort order was 10, 20, 100, which means `c, b, a`

Here's a different example, where we want to sort strings, not based on the lexicographic ordering, but based on the length of the string.

We can easily do this as follows:

In [12]:
t = 'this', 'parrot', 'is', 'a', 'late', 'bird'
sorted(t)

['a', 'bird', 'is', 'late', 'parrot', 'this']

As you can see the natural ordering for strings was used here, but we can change the behavior by specifying the sort key:

Remember that the `key` is a function that receives the item being sorted, and should return something (else usually!) that we want to use as the sort key. We use lambdas, but you can also use a straight `def` function too:

In [13]:
def sort_key(s):
    return len(s)

In [14]:
sorted(t, key=sort_key)

['a', 'is', 'this', 'late', 'bird', 'parrot']

or, using a lambda:

In [15]:
sorted(t, key=lambda s: len(s))

['a', 'is', 'this', 'late', 'bird', 'parrot']

#### Stable Sorting

You might have noticed that the words `this`,  `late` and `bird` all have four characters - so how did Python determine which one should come first? Randomly? No!

The sort algorithm that Python uses, called the *TimSort* (named after Python core developer Tim Peters - yes, the same Tim Peters that wrote the Zen of Python!!), is what is called a **stable** sort algorithm.

This means that items with equal sort keys maintain their relative position.

but first:

In [16]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


If you haven't read this in a while, take a few minutes now to do so again!

Now back to stable sorting:

In [20]:
t = 'aaaa', 'bbbb', 'cccc', 'dddd', 'eeee'

In [21]:
sorted(t, key = lambda s: len(s))

['aaaa', 'bbbb', 'cccc', 'dddd', 'eeee']

Now let's change our tuple a bit:

In [22]:
t = 'bbbb', 'cccc', 'aaaa', 'eeee', 'dddd'

In [23]:
sorted(t, key = lambda s: len(s))

['bbbb', 'cccc', 'aaaa', 'eeee', 'dddd']

As you can see, when the sort keys are equal (they are all equal to 4), the original ordering of the iterable is preserved.

So in our original example:

In [24]:
t = 'this', 'parrot', 'is', 'a', 'late', 'bird'

In [25]:
sorted(t, key = lambda s: len(s))

['a', 'is', 'this', 'late', 'bird', 'parrot']

So, `this`, will come before `late` which will come before `bird`.

If we change it up a bit:

In [26]:
t = 'this', 'bird', 'is', 'a', 'late', 'parrot'
sorted(t, key = lambda s: len(s))

['a', 'is', 'this', 'bird', 'late', 'parrot']

you'll notice that now `bird` ends up before `late`.

So this `key` argument makes the `sorted` function extremely flexible. We can now even sort objects that are not even comparable!

In [27]:
c1 = 10 + 2j
c2 = 5 - 3j

In [28]:
c1 < c2

TypeError: '<' not supported between instances of 'complex' and 'complex'

As you can we do not have an ordering defined for complex numbers.

But we may want to sort a sequence of complex numbers based on their distance from the origin:

In [30]:
t = 0, 10+10j, 3-3j, 4+4j, 5-2j

We can easily calculate the distace from the origin by using the `abs` function:

In [33]:
abs(3+4j)

5.0

So now we can use that as a sort key:

In [34]:
sorted(t, key=abs)

[0, (3-3j), (5-2j), (4+4j), (10+10j)]

Of course, you could decide to sort based on the imaginary component instead:

In [36]:
sorted(t, key=lambda c: c.imag)

[(3-3j), (5-2j), 0, (4+4j), (10+10j)]

#### Reversed Sort

We also have the `reverse` keyword-only argument that we can use - basically it sorts the iterable, but returns it reversed:

In [37]:
t = 'this', 'bird', 'is', 'a', 'late', 'parrot'

In [38]:
sorted(t, key=lambda s: len(s))

['a', 'is', 'this', 'bird', 'late', 'parrot']

In [40]:
sorted(t, key=lambda s: len(s), reverse=True)

['parrot', 'this', 'bird', 'late', 'is', 'a']

Of course in this case we could have done it this way too:

In [41]:
sorted(t, key=lambda s: -len(s))

['parrot', 'this', 'bird', 'late', 'is', 'a']

#### In-Place Sorting

So far we have seen the `sorted` function - it returns a new (list) containing the sorted elements, and the original iterable remains the same.

But mutable sequence types, such as lists, also implement in-place sorting - where the original list is sorted (the memory address does not change, the object is actually mutated).

The syntax for calling the sorted method is identical to the `sorted` function, and is implemented using the same TimSort algorithm.

Of course, this will not work with tuples, which are immutable.

In [42]:
l = ['this', 'bird', 'is', 'a', 'late', 'parrot']

In [43]:
id(l)

1437890262984

In [44]:
sorted(l, key=lambda s: len(s))

['a', 'is', 'this', 'bird', 'late', 'parrot']

In [46]:
l, id(l)

(['this', 'bird', 'is', 'a', 'late', 'parrot'], 1437890262984)

As you can see, the list `l` was not mutated and is still the same object.

But this way is different:

In [48]:
result = l.sort(key=lambda s: len(s))

First, the `sort` **method** does not return anything:

In [49]:
type(result)

NoneType

and the original list is still the same object:

In [50]:
id(l)

1437890262984

but it has mutated:

In [51]:
l

['a', 'is', 'this', 'bird', 'late', 'parrot']

That's really the only fundamental difference between the two sorts - one is in-place, while the other is not.

You might be wondering if one is more efficient than the other. 

As far as algorithms go, they are the same, so no difference there (one sort is not more efficient than the other). 

But `list.sort()` will be faster than `sorted()` because it does not have to create a copy of the sequence. 

Of course, for iterables other than lists, you don't have much of a choice, and need to use `sorted` anyways.

Let's try timing this a bit to see if we can see the difference:

In [77]:
from timeit import timeit
import random

In [95]:
random.seed(0)
n = 10_000_000
l = [random.randint(0, 100) for n in range(n)]

This produces a list of `n` random integers between 0 and 100. 

If you're wondering about what the seed does, look at my video on random seeds in Part 1|Extras of this course - basically it makes sure I will generate the same random sequence every time.

If you're unsure about the `timeit` module, again I have a video on that in Part 1|Extras of this course.

Now, I'm only going to run the tests once, because when using in-place sorting of `l` we'll end up sorting an already sorted list - and that may very well affect the timing...

In [96]:
timeit(stmt='sorted(l)', globals=globals(), number=1)

2.1579852871381036

In [97]:
timeit(stmt='l.sort()', globals=globals(), number=1)

2.088879100541135

As you can see, the time difference between the two methods, even for `n=10_000_000` is quite small.

I also just want to point out that sorting a list that is already sorted results in much better performance!

In [99]:
random.seed(0)
n = 10_000_000
l = [random.randint(0, 100) for n in range(n)]
timeit(stmt='l.sort()', globals=globals(), number=1)

2.0949547388245264

So now `l` is sorted, and if re-run the sort on it (either method), here's what we get:

In [100]:
timeit(stmt='sorted(l)', globals=globals(), number=1)

0.1799462218783674

In [101]:
timeit(stmt='l.sort()', globals=globals(), number=1)

0.11247461711673168

Substantially faster!!

Hence why I only timed using a single iteration...

#### Natural Ordering for Custom Classes

I just want to quickly show you that in order to have a "natural ordering" for our custom classes, we just need to implement the `<` or `>` operators. (I discuss these operators in Part 1 of this course)

In [1]:
class MyClass:
    def __init__(self, name, val):
        self.name = name
        self.val = val
        
    def __repr__(self):
        return f'MyClass({self.name}, {self.val})'
    
    def __lt__(self, other):
        return self.val < other.val

In [2]:
c1 = MyClass('c1', 20)
c2 = MyClass('c2', 10)
c3 = MyClass('c3', 20)
c4 = MyClass('c4', 10)

Now we can sort those objects, without specifying a key, since that class has a natural ordering (`<` in this case). Moreover, notice that the sort is stable.

In [4]:
sorted([c1, c2, c3, c4])

[MyClass(c2, 10), MyClass(c4, 10), MyClass(c1, 20), MyClass(c3, 20)]

In fact, we can modify our class slightly so we can see that `sorted` is calling our `__lt__` method repeatedly to perform the sort:

In [8]:
class MyClass:
    def __init__(self, name, val):
        self.name = name
        self.val = val
        
    def __repr__(self):
        return f'MyClass({self.name}, {self.val})'
    
    def __lt__(self, other):
        print(f'called {self.name} < {other.name}')
        return self.val < other.val

In [9]:
c1 = MyClass('c1', 20)
c2 = MyClass('c2', 10)
c3 = MyClass('c3', 20)
c4 = MyClass('c4', 10)

In [10]:
sorted([c1, c2, c3, c4])

called c2 < c1
called c3 < c2
called c3 < c1
called c4 < c1
called c4 < c2


[MyClass(c2, 10), MyClass(c4, 10), MyClass(c1, 20), MyClass(c3, 20)]

##  List Comprehensions

We've used list comprehensions throughout this course quite a bit, so the concept should not be new, but let's recap quickly what we have seen so far with list comprehensions.

A list comprehension is language construct that allows to easily build a list by transforming, and optionally, filtering, another iterable.

For example, using a more traditional Java style approach we might create a list of squares of the first 100 positive integers in this way:

In [1]:
squares = []  # create an empty list
for i in range(1, 101):
    squares.append(i**2)

We now have a list containing the desired numbers:

In [2]:
squares[0:10]

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

Using a list comprehension we can achieve the same results in a far more expressive way:

In [3]:
squares = [i**2 for i in range(1, 101)]

In [4]:
squares[0:10]

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

When building a list from another iterable we may sometimes want to skip certain values.

For example, we may want to build a list of squares for even positive integers only, up to 100.

The more traditional way would go like this:

In [5]:
squares = []
for i in range(1, 101):
    if i % 2 == 0:
        squares.append(i**2)

In [6]:
squares[0:10]

[4, 16, 36, 64, 100, 144, 196, 256, 324, 400]

We can also use a list comprehension to achieve the same thing:

In [7]:
squares = [i**2 for i in range(1, 101) if i % 2 == 0]

In [8]:
squares[0:10]

[4, 16, 36, 64, 100, 144, 196, 256, 324, 400]

Although I have been writing the list comprehension on a single line, we can write them over multiple lines if we prefer:

In [9]:
squares = [i**2
          for i in range(1, 101)
          if i % 2 == 0]

In [10]:
squares[0:10]

[4, 16, 36, 64, 100, 144, 196, 256, 324, 400]

Internal Mechanics of List Comprehensions

As we discussed in the lecture, we need to recognize that list comprehensions are essentially temporary functions that Python creates, executes and returns the resulting list from it.

We can see this by compiling a comprehension, and then disassembling the compiled code to see what happened:

In [11]:
import dis

In [12]:
compiled_code = compile('[i**2 for i in (1, 2, 3)]', 
                        filename='', mode='eval')

In [13]:
dis.dis(compiled_code)

  1           0 LOAD_CONST               0 (<code object <listcomp> at 0x000001F77210ED20, file "", line 1>)
              2 LOAD_CONST               1 ('<listcomp>')
              4 MAKE_FUNCTION            0
              6 LOAD_CONST               5 ((1, 2, 3))
              8 GET_ITER
             10 CALL_FUNCTION            1
             12 RETURN_VALUE


As you can see, in step 4, Python created a function (`MAKE_FUNCTION`), called it (`CALL_FUNCTION`), and then returned the result (`RETURN_VALUE`) in the last step.

So, comprehensions will behave like functions in terms of **scope**. They have local scope, and can access global and nonlocal scopes too. And nested comprehensions will also behave like nested functions and closures.

#### Nested Comprehensions

Let's look at a simple example that uses nested comprehensions.

For example, suppose we want to generate a multiplication table:

The traditional way first:

In [14]:
table = []
for i in range(1, 11):
    row = []
    for j in range(1, 11):
        row.append(i*j)
    table.append(row)

In [15]:
table

[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
 [3, 6, 9, 12, 15, 18, 21, 24, 27, 30],
 [4, 8, 12, 16, 20, 24, 28, 32, 36, 40],
 [5, 10, 15, 20, 25, 30, 35, 40, 45, 50],
 [6, 12, 18, 24, 30, 36, 42, 48, 54, 60],
 [7, 14, 21, 28, 35, 42, 49, 56, 63, 70],
 [8, 16, 24, 32, 40, 48, 56, 64, 72, 80],
 [9, 18, 27, 36, 45, 54, 63, 72, 81, 90],
 [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]]

We can easily do the same thing using a list comprehension:

In [1]:
table2 = [ [i * j for j in range(1, 11)] 
          for i in range(1, 11)]

In [2]:
table2

[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
 [3, 6, 9, 12, 15, 18, 21, 24, 27, 30],
 [4, 8, 12, 16, 20, 24, 28, 32, 36, 40],
 [5, 10, 15, 20, 25, 30, 35, 40, 45, 50],
 [6, 12, 18, 24, 30, 36, 42, 48, 54, 60],
 [7, 14, 21, 28, 35, 42, 49, 56, 63, 70],
 [8, 16, 24, 32, 40, 48, 56, 64, 72, 80],
 [9, 18, 27, 36, 45, 54, 63, 72, 81, 90],
 [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]]

You'll notice here that we nested one list comprehension inside another.

You should also notice that the inner comprehension (the one that has `i*j`) is accessing a local variable `i`, as well as a variable from the enclosing comprehension - the `j` variable. Just like a closure! And in fact, it is exactly that. We'll come back to that in a bit.

Let's do another example - we'll construct Pascal's triangle - which is basically just a triangle of binomial coefficients:

```
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
```

we just need to know how to calculate combinations:
```
C(n, k) = n! / (k! (n-k)!)
```

* row 0, column 0: n=0, k=0: c(0, 0) = 0! / 0! 0! = 1/1 = 1
* row 4, column 2: n=4, k=2: c(4, 2) = 4! / 2! 2! = 4x3x2 / 2x2 = 6

In other words, we need to calculate the following list of lists:
```
c(0,0)
c(1,0) c(1,1)
c(2,0) c(2,1) c(2,3)
c(3,0) c(3,1) c(3,2) c(3,3)
...
```

We can use a nested comprehension for that!

In [18]:
from math import factorial

def combo(n, k):
    return factorial(n) // (factorial(k) * factorial(n-k))

size = 10  # global variable
pascal = [ [combo(n, k) for k in range(n+1)] for n in range(size+1) ]

In [19]:
pascal

[[1],
 [1, 1],
 [1, 2, 1],
 [1, 3, 3, 1],
 [1, 4, 6, 4, 1],
 [1, 5, 10, 10, 5, 1],
 [1, 6, 15, 20, 15, 6, 1],
 [1, 7, 21, 35, 35, 21, 7, 1],
 [1, 8, 28, 56, 70, 56, 28, 8, 1],
 [1, 9, 36, 84, 126, 126, 84, 36, 9, 1],
 [1, 10, 45, 120, 210, 252, 210, 120, 45, 10, 1]]

Again note how the outer comprehension accessed a global variable (`size`), created a local variable (`n`), and the inner comprehension created its own local variable (`k`) and also accessed the nonlocal variable `n`.

#### Nested Loops

We can also created comprehensions that use nested loops (not nested comprehensions, just nested loops).

Let's start with a simple example.

Suppose we have two lists of characters, and we want to produce a new list consisting of the pairwise concatenated characters.

e.g. 
`l1 = ['a', 'b', 'c']`

`l2 = ['x', 'y', 'z']`

and we want to produce the result:

`['ax', 'ay', 'az', 'bx', 'by', 'bz', 'cx', 'cy', 'cz']`


The traditional way first:

In [20]:
l1 = ['a', 'b', 'c']
l2 = ['x', 'y', 'z']
result = []
for s1 in l1:
    for s2 in l2:
        result.append(s1+s2)


In [21]:
result

['ax', 'ay', 'az', 'bx', 'by', 'bz', 'cx', 'cy', 'cz']

We can do the same nested loop using a comprehension instead:

In [22]:
result = [s1 + s2 for s1 in l1 for s2 in l2]

In [23]:
result

['ax', 'ay', 'az', 'bx', 'by', 'bz', 'cx', 'cy', 'cz']

We could expand this slightly by specifying that pairs resulting in the same letter twice should be ommitted:

In [24]:
l1 = ['a', 'b', 'c']
l2 = ['b', 'c', 'd']

In [25]:
result = []
for s1 in l1:
    for s2 in l2:
        if s1 != s2:
            result.append(s1 + s2)

In [26]:
result

['ab', 'ac', 'ad', 'bc', 'bd', 'cb', 'cd']

And the comprehension equivalent:

In [27]:
result = [s1 + s2 for s1 in l1 for s2 in l2 if s1 != s2]

In [28]:
result

['ab', 'ac', 'ad', 'bc', 'bd', 'cb', 'cd']

Building up the complexity, let's see how we might reproduce the `zip` function.

Remember what the `zip` function does:

In [29]:
l1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
l2 = ['a', 'b', 'c', 'd']
list(zip(l1, l2))

[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]

We can do the same thing using a traditional nested loop:

In [30]:
result = []
for index_1, item_1 in enumerate(l1):
    for index_2, item_2 in enumerate(l2):
        if index_1 == index_2:
            result.append((item_1, item_2))

In [31]:
result

[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]

But we can do this using a list comprehension as well:

In [32]:
result = [ (item_1, item_2)
         for index_1, item_1 in enumerate(l1)
         for index_2, item_2 in enumerate(l2)
         if index_1 == index_2]

In [33]:
result

[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]

Of course, using `zip` is way simpler!

List comprehensions can also be quite handy when used in conjunction with functions such as `sum` for example.

Suppose we have two n-dimensional vectors, represented as tuple of numbers, and we want to find the dot product of the two vectors:

`
v1 = (c1, c2, c3, ..., cn)
v2 = (d1, d2, d3, ..., dn)
`

Then, the dot product is:

`
c1 * d1 + c2 * d2 + ... + cn * dn
`

The trick here is that we want to step through each vectors at the same time (a simple nested loop would not work), so a Java-like approach might be:

In [34]:
v1 = (1, 2, 3, 4, 5, 6)
v2 = (10, 20, 30, 40, 50, 60)

In [35]:
dot = 0
for i in range(len(v1)):
    dot += (v1[i] * v2[i])
print(dot)

910


But using zip and a list comprehension we can do it this way:

In [36]:
dot = sum([i * j for i, j in zip(v1, v2)])
print(dot)

910


In fact, and we'll cover this later in generator expressions, we don't even need the `[]`:

In [37]:
dot = sum(i * j for i, j in zip(v1, v2))
print(dot)

910


#### Things to watch out for

There are a few things we have to be careful with, and that relates to the scope of variables used inside a comprehension.

Let's first make sure we don't have the `number` symbol in our global scope:

In [38]:
if 'number' in globals():
    del number

In [39]:
l = [number**2 for number in range(5)]
print(l)

[0, 1, 4, 9, 16]


What was the scope of `number`?

In [40]:
'number' in globals()

False

As you can see, `number` was local to the comprehension, not the enclosing (global in this case) scope.

But what if `number` was in our global scope:

In [41]:
number = 100

In [42]:
l = [number**2 for number in range(5)]

In [43]:
number

100

As you can see, `number` in the comprehension was still local to the comprehension, and our global `number` was not affected. 

This is similar to global and nonlocal variables in functions.

Because `number` is the loop item, it means that it gets *assigned* a value before being referenced, hence it is considered local - even if that symbol exists in a global or nonlocal scope.

On the other hand, consider this example:


In [44]:
number = 100
l = [number * i for i in range(5)]
print(l)

[0, 100, 200, 300, 400]


As you can see, the scope of the comprehension was able to reach out for `number` in the global scope. Same as functions.

Now let's look at an example we've seen before when we studied closures.

Suppose we want to generate a list of functions that will calculate powers of their argument, i.e. we want to define a bunch of functions

* `fn_1(arg) --> arg ** 1`
* `fn_2(arg) --> arg ** 2`
* `fn_3(arg) --> arg ** 3`
etc...

We could certainly define a bunch of functions one by one:

In [45]:
fn_0 = lambda x: x**0
fn_1 = lambda x: x**1
fn_2 = lambda x: x**2
fn_3 = lambda x: x**3
# etc

But this would be very tedious if we had to do it more than just a few times.

Instead, why don't we create those functions as lambdas and put them into a list where the index of the list will correspond to the power we are looking for.

Something like this if we were doing it manually:

In [46]:
funcs = [lambda x: x**0, lambda x: x**1, lambda x: x**2, lambda x: x**3]

Now we can call these functions this way:

In [47]:
print(funcs[0](10))
print(funcs[1](10))
print(funcs[2](10))
print(funcs[3](10))

1
10
100
1000


Now all we need to do is to create these functions using a loop - the traditional way first:

First let's make sure `i` is not in our global symbol table:

In [1]:
if 'i' in globals():
    del i

In [2]:
funcs = []
for i in range(6):
    funcs.append(lambda x: x**i)

And let's use them as before:

In [3]:
print(funcs[0](10))
print(funcs[1](10))
print(funcs[2](10))
print(funcs[3](10))

100000
100000
100000
100000


What happened?? It looks like every function is actually calculating `10**5`

Let's break down what happened in the loop, but without using a loop.

Firs notice that `i` is now in our global symbol table:

In [4]:
print(i)

5


You'll also note that it has a value of `5` (from the last iteration that ran).

Now let's walk through what happened manually:

In the first iteration, the symbol `i` was created, and assigned a value of `0`:

In [50]:
i = 0
def fn_0(x):
    return x ** i

The `i` in `fn_0` is actually the global variable `i`.

For the next 'iteration' we increment `i` by `1`:

In [51]:
i=1
def fn_1(x):
    return x ** i

The `i` in `fn_1` is still the global variable `i`.

Now let's set `i` to something else:

In [52]:
i = 5

In [53]:
fn_0(10)

100000

In [54]:
fn_1(10)

100000

and if we change `i` again:

In [55]:
i = 10

In [56]:
fn_0(10)

10000000000

And this is **exactly** what happened in our loop based approach:

In [57]:
funcs = []
for i in range(6):
    funcs.append(lambda x: x**i)

When the loop ran, `i` was created in our **global** scope.

By the time the loop finished running, `i` was 5

In [58]:
print(i)

5


So when we call the functions, they are referencing the global variable `i` which is now set to `5`.

And the same precise thing will happen if we use a comprehension to do the same thing:

Let's delete the global `i` symbol first:

In [61]:
del i

In [62]:
'i' in globals()

False

In [63]:
funcs = [lambda x: x**i for i in range(6)]

In [64]:
'i' in globals()

False

As we can see `i` is not in our globals, but `i` was a **local** variable in the list comprehension, and each function created in the comprehension is referencing the same `i` - it is local to the comprehension, and each lambda is therefore a closure with (the same) free variable `i`. And by the time the comprehension has finished running, `i` had a value of 5:

In [65]:
funcs[0](10), funcs[1](10)

(100000, 100000)

Can we somehow fix this problem?

Yes, and it relies on default values and when default values are calculated and stored with the function definition. Recall that default values are evaluated and stored with the function's definition **when the function is being created (i.e. compiled)**. Right now we are running into a problem because the free variable `i` is being evauated inside each function's body at **run time**.

So, we can fix this by making each current value of `i` a paramer default of each lambda - this will get evaluated at the functions creation time - i.e. at each loop iteration:

In [66]:
funcs = [lambda x, pow=i: x**pow for i in range(6)]

In [67]:
funcs[0](10), funcs[1](10), funcs[2](10)

(1, 10, 100)

As you can see that solved the problem. But this relies on some pretty detailed understanding of Python's behavior, and it is better not to use such techniques - other people reading your code will find it confusing and will make the code much harder to understand.

We will come back to this comprehension syntax. We used it so far to create lists, but the same syntax will be used to create sets, dictionaries, and generators.

# Section 03 - Project 1

##  Project

In this project you are asked to create a sequence type that will return a series of (regular convex) Polygon objects.

Each polygon will be uniquely defined by:
* it is a regular convex polygon:
    * edges (sides) are all of equal length
    * angles between edges are all equal
* the center of the polygon is `(0,0)`
* the number of vertices (minimum `3`)
* the distance from the center to any vertex should be `R` unit (this is sometimes described as the polygon having a *circumradius* of `R`)

The sequence should be finite - so creating an instance of this sequence will require the passing in the number of polygons in the sequence to the initializer.

The Polygon objects should be immutable, as should the sequence itself.

In addition, each Polygon should have the following properties:
* number of vertices
* number of edges (sides)
* the edge length
* the apothem (distance from center to mid-point of any edge)
* surface area
* perimeter
* interior angle (angle between each edge) - in degrees
* supports equality based on # edges and circumradius
* supports ordering based on number of edges only

The sequence object should also have the following properties:

* should support fully-featured slicing and indexing (positive indices, negative indices, slicing, and extended slicing)
* should support the `length()` function
* should provide the polygon with the highest `area:perimeter` ratio

You will need to do a little bit of math for this project. The necessary formulas are included in the video.

##### Goal 1

Create a Polygon class with the properties defined above. The initializer for the class will need the number of vertices (or edges, same), and the circumradius (`R`).

Make sure you test all your methods and properties. (This is called unit testing)

##### Goal 2

Create a finite sequence type that is a sequence of Polygons start with `3` vertices, up to, and including some maximum value `m` which will need to be passed to the initializer of the sequence type.

The value for the circumradius `R`, will also need to be provided to the initializer.

Again make sure you test your code!

##  Project Solution:  Goal 1

We need to create a Polygon class with the following properties:

* number of vertices `n` - passed to the initializer
* circumradius `R` - passed to the initializer
* number of edges
* number of sides
* interior angle (in degrees)
* side length
* apothem
* surface area
* perimeter
* supports equality based on number of vertices and circumradius
* supports `>` based on number of vertices

Let's start building our Polygon class.

Apart from number of edges / vertices (`n`) and circumradius (`R`), all the other properties are computed properties.

We will make our Polygon immutable (by basically making `n` and `R` "private" variables - by convention using the `_` prefix).

In [1]:
import math

class Polygon:
    def __init__(self, n, R):
        self._n = n
        self._R = R
        
    def __repr__(self):
        return f'Polygon(n={self._n}, R={self._R})'
    
    @property
    def count_vertices(self):
        return self._n
    
    @property
    def count_edges(self):
        return self._n
    
    @property
    def circumradius(self):
        return self._R
    
    @property
    def interior_angle(self):
        return (self._n - 2) * 180 / n

    @property
    def side_length(self):
        return 2 * self._R * math.sin(math.pi / self._n)
    
    @property
    def apothem(self):
        return self._R * math.cos(math.pi / self._n)
    
    @property
    def area(self):
        return self._n / 2 * self.side_length * self.apothem
    
    @property
    def perimeter(self):
        return self._n * self.side_length

Let's make sure everything works as expected.

To do that we are going to use the fact that we have pre-calculated what some results should evaluate to, and we'll make sure they match.

For example:
* the side length of a square whose circumradius is `1`, should be `sqrt(2)`
* the area of a square whose circumradius is `1`, should be `2`
* 

Let me show you the `assert` statement - used extensively for unit testing:

In [2]:
assert 1 == 1

No output...

In [3]:
assert 1 > 10

AssertionError: 

We get an assertion error.

We can even specify what should be in the message of the assertion exception:

In [4]:
assert 1 > 10, '1 is not greater than 10'

AssertionError: 1 is not greater than 10

Let's start with just testing the representation of our Polygon:

In [5]:
def test_polygon():
    n=3
    R=1
    p = Polygon(n, R)
    assert str(p) == f'Polygon(n=3,R=1)', f'actual: {str(p)}'

In [6]:
test_polygon()

AssertionError: actual: Polygon(n=3, R=1)

As we can see, we have an exception - that's because our test was incorrect - we need to include that space just before `R`.

Let's fix it and add a few more tests:

In [7]:
def test_polygon():
    n = 3
    R = 1
    p = Polygon(n, R)
    assert str(p) == 'Polygon(n=3, R=1)', f'actual: {str(p)}'
    assert p.count_vertices == n, (f'actual: {p.count_vertices},'
                                   f' expected: {n}')
    assert p.count_edges == n, f'actual: {p.count_edges}, expected: {n}'
    assert p.circumradius == R, f'actual: {p.circumradius}, expected: {n}'
    assert p.interior_angle == 60, (f'actual: {p.interior_angle},'
                                    f' expected: 60')

In [8]:
test_polygon()

NameError: name 'n' is not defined

Ok, so as we can see here, we have a bug in our code - we used `n` instead of `self._n`, let's go back and fix it, and then run the tests again:

In [9]:
import math

class Polygon:
    def __init__(self, n, R):
        self._n = n
        self._R = R
        
    def __repr__(self):
        return f'Polygon(n={self._n}, R={self._R})'
    
    @property
    def count_vertices(self):
        return self._n
    
    @property
    def count_edges(self):
        return self._n
    
    @property
    def circumradius(self):
        return self._R
    
    @property
    def interior_angle(self):
        return (self._n - 2) * 180 / self._n

    @property
    def side_length(self):
        return 2 * self._R * math.sin(math.pi / self._n)
    
    @property
    def apothem(self):
        return self._R * math.cos(math.pi / self._n)
    
    @property
    def area(self):
        return self._n / 2 * self.side_length * self.apothem
    
    @property
    def perimeter(self):
        return self._n * self.side_length

In [10]:
test_polygon()

Let's continue writing our tests:

In [11]:
def test_polygon():
    n = 3
    R = 1
    p = Polygon(n, R)
    assert str(p) == 'Polygon(n=3, R=1)', f'actual: {str(p)}'
    assert p.count_vertices == n, (f'actual: {p.count_vertices},'
                                   f' expected: {n}')
    assert p.count_edges == n, f'actual: {p.count_edges}, expected: {n}'
    assert p.circumradius == R, f'actual: {p.circumradius}, expected: {n}'
    assert p.interior_angle == 60, (f'actual: {p.interior_angle},'
                                    ' expected: 60')
    n = 4
    R = 1
    p = Polygon(n, R)
    assert p.interior_angle == 90, (f'actual: {p.interior_angle}, '
                                    f' expected: 90')
    assert p.area == 2.0, (f'actual: {p.area}, '
                           'expected: 2.0')

In [12]:
test_polygon()

AssertionError: actual: 2.0000000000000004, expected: 2.0

As you should already be aware, comparing floats for equality is not something we should do.

Instead, we are going to use the math module's `isclose` function with relative and absolute tolerances set to `0.001`. (I cover this in Part 1 of this series, but you can also see the documentation here: 
* https://docs.python.org/3/library/math.html
* https://www.python.org/dev/peps/pep-0485/

In [13]:
def test_polygon():
    abs_tol = 0.001
    rel_tol = 0.001
    
    n = 3
    R = 1
    p = Polygon(n, R)
    assert str(p) == 'Polygon(n=3, R=1)', f'actual: {str(p)}'
    assert p.count_vertices == n, (f'actual: {p.count_vertices},'
                                   f' expected: {n}')
    assert p.count_edges == n, f'actual: {p.count_edges}, expected: {n}'
    assert p.circumradius == R, f'actual: {p.circumradius}, expected: {n}'
    assert p.interior_angle == 60, (f'actual: {p.interior_angle},'
                                    ' expected: 60')
    
    n = 4
    R = 1
    p = Polygon(n, R)
    assert p.interior_angle == 90, (f'actual: {p.interior_angle}, '
                                    ' expected: 90')
    assert math.isclose(p.area, 2, 
                        rel_tol=abs_tol, 
                        abs_tol=abs_tol), (f'actual: {p.area},'
                                           ' expected: 2.0')

In [14]:
test_polygon()

Let's continue testing a few things:

In [15]:
def test_polygon():
    abs_tol = 0.001
    rel_tol = 0.001
    
    n = 3
    R = 1
    p = Polygon(n, R)
    assert str(p) == 'Polygon(n=3, R=1)', f'actual: {str(p)}'
    assert p.count_vertices == n, (f'actual: {p.count_vertices},'
                                   f' expected: {n}')
    assert p.count_edges == n, f'actual: {p.count_edges}, expected: {n}'
    assert p.circumradius == R, f'actual: {p.circumradius}, expected: {n}'
    assert p.interior_angle == 60, (f'actual: {p.interior_angle},'
                                    ' expected: 60')
    n = 4
    R = 1
    p = Polygon(n, R)
    assert p.interior_angle == 90, (f'actual: {p.interior_angle}, '
                                    ' expected: 90')
    assert math.isclose(p.area, 2, 
                        rel_tol=abs_tol, 
                        abs_tol=abs_tol), (f'actual: {p.area},'
                                           ' expected: 2.0')
    
    assert math.isclose(p.side_length, math.sqrt(2),
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.side_length},'
                                          f' expected: {math.sqrt(2)}')
    
    assert math.isclose(p.perimeter, 4 * math.sqrt(2),
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.perimeter},'
                                          f' expected: {4 * math.sqrt(2)}')

In [16]:
test_polygon()

So far so good.
Now we have to work a little harder to the apothem...
But I'm just going to use an online calculator to come up with some numbers!
* https://www.calculatorsoup.com/calculators/geometry-plane/polygon.php

In [19]:
def test_polygon():
    abs_tol = 0.001
    rel_tol = 0.001
    
    n = 3
    R = 1
    p = Polygon(n, R)
    assert str(p) == 'Polygon(n=3, R=1)', f'actual: {str(p)}'
    assert p.count_vertices == n, (f'actual: {p.count_vertices},'
                                   f' expected: {n}')
    assert p.count_edges == n, f'actual: {p.count_edges}, expected: {n}'
    assert p.circumradius == R, f'actual: {p.circumradius}, expected: {n}'
    assert p.interior_angle == 60, (f'actual: {p.interior_angle},'
                                    ' expected: 60')
    n = 4
    R = 1
    p = Polygon(n, R)
    assert p.interior_angle == 90, (f'actual: {p.interior_angle}, '
                                    ' expected: 90')
    assert math.isclose(p.area, 2, 
                        rel_tol=abs_tol, 
                        abs_tol=abs_tol), (f'actual: {p.area},'
                                           ' expected: 2.0')
    
    assert math.isclose(p.side_length, math.sqrt(2),
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.side_length},'
                                          f' expected: {math.sqrt(2)}')
    
    assert math.isclose(p.perimeter, 4 * math.sqrt(2),
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.perimeter},'
                                          f' expected: {4 * math.sqrt(2)}')
    
    assert math.isclose(p.apothem, 0.707,
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.perimeter},'
                                          ' expected: 0.707')

In [20]:
test_polygon()

For good measure I'm going to add a few more assertions using that online calculator and comparing to the values I get from my class:

For n = 6, R = 2:
* side = 2 m
* apothem = 1.73205 m
* area = 10.3923 m2
* perim = 12 m
* int_angle = 120 °

For n = 12, R = 3:
* side = 1.55291 m
* apothem = 2.89778 m
* area = 27 m2
* perimeter = 18.635 m
* int_angle = 150 °

In [21]:
def test_polygon():
    abs_tol = 0.001
    rel_tol = 0.001
    
    n = 3
    R = 1
    p = Polygon(n, R)
    assert str(p) == 'Polygon(n=3, R=1)', f'actual: {str(p)}'
    assert p.count_vertices == n, (f'actual: {p.count_vertices},'
                                   f' expected: {n}')
    assert p.count_edges == n, f'actual: {p.count_edges}, expected: {n}'
    assert p.circumradius == R, f'actual: {p.circumradius}, expected: {n}'
    assert p.interior_angle == 60, (f'actual: {p.interior_angle},'
                                    ' expected: 60')
    n = 4
    R = 1
    p = Polygon(n, R)
    assert p.interior_angle == 90, (f'actual: {p.interior_angle}, '
                                    ' expected: 90')
    assert math.isclose(p.area, 2, 
                        rel_tol=abs_tol, 
                        abs_tol=abs_tol), (f'actual: {p.area},'
                                           ' expected: 2.0')
    
    assert math.isclose(p.side_length, math.sqrt(2),
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.side_length},'
                                          f' expected: {math.sqrt(2)}')
    
    assert math.isclose(p.perimeter, 4 * math.sqrt(2),
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.perimeter},'
                                          f' expected: {4 * math.sqrt(2)}')
    
    assert math.isclose(p.apothem, 0.707,
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.perimeter},'
                                          ' expected: 0.707')
    p = Polygon(6, 2)
    assert math.isclose(p.side_length, 2,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.apothem, 1.73205,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.area, 10.3923,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.perimeter, 12,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.interior_angle, 120,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    
    p = Polygon(12, 3)
    assert math.isclose(p.side_length, 1.55291,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.apothem, 2.89778,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.area, 27,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.perimeter, 18.635,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.interior_angle, 150,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    

In [22]:
test_polygon()

Next we need to add support for equality and ordering based on number of vertices.

We'll do that by implementing the `__eq__` and `__gt__` methods.

In [23]:
import math

class Polygon:
    def __init__(self, n, R):
        self._n = n
        self._R = R
        
    def __repr__(self):
        return f'Polygon(n={self._n}, R={self._R})'
    
    @property
    def count_vertices(self):
        return self._n
    
    @property
    def count_edges(self):
        return self._n
    
    @property
    def circumradius(self):
        return self._R
    
    @property
    def interior_angle(self):
        return (self._n - 2) * 180 / self._n

    @property
    def side_length(self):
        return 2 * self._R * math.sin(math.pi / self._n)
    
    @property
    def apothem(self):
        return self._R * math.cos(math.pi / self._n)
    
    @property
    def area(self):
        return self._n / 2 * self.side_length * self.apothem
    
    @property
    def perimeter(self):
        return self._n * self.side_length
    
    def __eq__(self, other):
        if isinstance(other, self.__class__):
            return (self.count_edges == other.count_edges 
                    and self.circumradius == other.circumradius)
        else:
            return NotImplemented
        
    def __gt__(self, other):
        if isinstance(other, self.__class__):
            return self.count_vertices > other.count_vertices
        else:
            return NotImplemented
            

Let's add these to our unit tests:

In [24]:
def test_polygon():
    abs_tol = 0.001
    rel_tol = 0.001
    
    n = 3
    R = 1
    p = Polygon(n, R)
    assert str(p) == 'Polygon(n=3, R=1)', f'actual: {str(p)}'
    assert p.count_vertices == n, (f'actual: {p.count_vertices},'
                                   f' expected: {n}')
    assert p.count_edges == n, f'actual: {p.count_edges}, expected: {n}'
    assert p.circumradius == R, f'actual: {p.circumradius}, expected: {n}'
    assert p.interior_angle == 60, (f'actual: {p.interior_angle},'
                                    ' expected: 60')
    n = 4
    R = 1
    p = Polygon(n, R)
    assert p.interior_angle == 90, (f'actual: {p.interior_angle}, '
                                    ' expected: 90')
    assert math.isclose(p.area, 2, 
                        rel_tol=abs_tol, 
                        abs_tol=abs_tol), (f'actual: {p.area},'
                                           ' expected: 2.0')
    
    assert math.isclose(p.side_length, math.sqrt(2),
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.side_length},'
                                          f' expected: {math.sqrt(2)}')
    
    assert math.isclose(p.perimeter, 4 * math.sqrt(2),
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.perimeter},'
                                          f' expected: {4 * math.sqrt(2)}')
    
    assert math.isclose(p.apothem, 0.707,
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.perimeter},'
                                          ' expected: 0.707')
    p = Polygon(6, 2)
    assert math.isclose(p.side_length, 2,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.apothem, 1.73205,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.area, 10.3923,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.perimeter, 12,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.interior_angle, 120,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    
    p = Polygon(12, 3)
    assert math.isclose(p.side_length, 1.55291,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.apothem, 2.89778,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.area, 27,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.perimeter, 18.635,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.interior_angle, 150,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    
    p1 = Polygon(3, 10)
    p2 = Polygon(10, 10)
    p3 = Polygon(15, 10)
    p4 = Polygon(15, 100)
    p5 = Polygon(15, 100)
    
    assert p2 > p1
    assert p2 < p3
    assert p3 != p4
    assert p1 != p4
    assert p4 == p5

In [25]:
test_polygon()

Now, there's one last thing we need to take care of:

In [26]:
p = Polygon(1, 10)

That's not right, a strictly convex regular polygon must have a minimum of 3 vertices. We should raise an exception!

Let's fix that:

In [27]:
import math

class Polygon:
    def __init__(self, n, R):
        if n < 3:
            raise ValueError('Polygon must have at least 3 vertices.')
        self._n = n
        self._R = R
        
    def __repr__(self):
        return f'Polygon(n={self._n}, R={self._R})'
    
    @property
    def count_vertices(self):
        return self._n
    
    @property
    def count_edges(self):
        return self._n
    
    @property
    def circumradius(self):
        return self._R
    
    @property
    def interior_angle(self):
        return (self._n - 2) * 180 / self._n

    @property
    def side_length(self):
        return 2 * self._R * math.sin(math.pi / self._n)
    
    @property
    def apothem(self):
        return self._R * math.cos(math.pi / self._n)
    
    @property
    def area(self):
        return self._n / 2 * self.side_length * self.apothem
    
    @property
    def perimeter(self):
        return self._n * self.side_length
    
    def __eq__(self, other):
        if isinstance(other, self.__class__):
            return (self.count_edges == other.count_edges 
                    and self.circumradius == other.circumradius)
        else:
            return NotImplemented
        
    def __gt__(self, other):
        if isinstance(other, self.__class__):
            return self.count_vertices > other.count_vertices
        else:
            return NotImplemented
            

And add this to our unit tests:

In [28]:
def test_polygon():
    abs_tol = 0.001
    rel_tol = 0.001
    
    try:
        p = Polygon(2, 10)
        assert False, ('Creating a Polygon with 2 sides: '
                       ' Exception expected, not received')
    except ValueError:
        pass
                       
    n = 3
    R = 1
    p = Polygon(n, R)
    assert str(p) == 'Polygon(n=3, R=1)', f'actual: {str(p)}'
    assert p.count_vertices == n, (f'actual: {p.count_vertices},'
                                   f' expected: {n}')
    assert p.count_edges == n, f'actual: {p.count_edges}, expected: {n}'
    assert p.circumradius == R, f'actual: {p.circumradius}, expected: {n}'
    assert p.interior_angle == 60, (f'actual: {p.interior_angle},'
                                    ' expected: 60')
    n = 4
    R = 1
    p = Polygon(n, R)
    assert p.interior_angle == 90, (f'actual: {p.interior_angle}, '
                                    ' expected: 90')
    assert math.isclose(p.area, 2, 
                        rel_tol=abs_tol, 
                        abs_tol=abs_tol), (f'actual: {p.area},'
                                           ' expected: 2.0')
    
    assert math.isclose(p.side_length, math.sqrt(2),
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.side_length},'
                                          f' expected: {math.sqrt(2)}')
    
    assert math.isclose(p.perimeter, 4 * math.sqrt(2),
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.perimeter},'
                                          f' expected: {4 * math.sqrt(2)}')
    
    assert math.isclose(p.apothem, 0.707,
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.perimeter},'
                                          ' expected: 0.707')
    p = Polygon(6, 2)
    assert math.isclose(p.side_length, 2,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.apothem, 1.73205,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.area, 10.3923,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.perimeter, 12,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.interior_angle, 120,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    
    p = Polygon(12, 3)
    assert math.isclose(p.side_length, 1.55291,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.apothem, 2.89778,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.area, 27,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.perimeter, 18.635,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.interior_angle, 150,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    
    p1 = Polygon(3, 10)
    p2 = Polygon(10, 10)
    p3 = Polygon(15, 10)
    p4 = Polygon(15, 100)
    p5 = Polygon(15, 100)
    
    assert p2 > p1
    assert p2 < p3
    assert p3 != p4
    assert p1 != p4
    assert p4 == p5

In [29]:
test_polygon()

Ok! I think this good enough unit testing.

Looks like our `Polygon` class is working properly.

By the way, did you notice that we spent at the same, if not more, amount of time **testing** our code as we did **writing** it?

In practice, that is often how that goes - you should always test your code - you obviously cannot test every data combination, but you should always try to test all your methods and code branches at least once (coverage), and then also cover edge cases if there are any to make sure those are handled as expected.

You should also try to ensure, within reason, that all the code you wrote is tested (i.e. executed, or *exercised*) during your tests - this is called **test coverage**, or sometimes **code coverage**.

##  Project Solution: Goal 2

Here is the final `Polygon` class we ended up with in goal 1:

In [1]:
import math

class Polygon:
    def __init__(self, n, R):
        if n < 3:
            raise ValueError('Polygon must have at least 3 vertices.')
        self._n = n
        self._R = R
        
    def __repr__(self):
        return f'Polygon(n={self._n}, R={self._R})'
    
    @property
    def count_vertices(self):
        return self._n
    
    @property
    def count_edges(self):
        return self._n
    
    @property
    def circumradius(self):
        return self._R
    
    @property
    def interior_angle(self):
        return (self._n - 2) * 180 / self._n

    @property
    def side_length(self):
        return 2 * self._R * math.sin(math.pi / self._n)
    
    @property
    def apothem(self):
        return self._R * math.cos(math.pi / self._n)
    
    @property
    def area(self):
        return self._n / 2 * self.side_length * self.apothem
    
    @property
    def perimeter(self):
        return self._n * self.side_length
    
    def __eq__(self, other):
        if isinstance(other, self.__class__):
            return (self.count_edges == other.count_edges 
                    and self.circumradius == other.circumradius)
        else:
            return NotImplemented
        
    def __gt__(self, other):
        if isinstance(other, self.__class__):
            return self.count_vertices > other.count_vertices
        else:
            return NotImplemented

Now we need to create a sequence type that will return these Polygons, starting with 3 vertices, up to (and including) a polygon of `m` sides.

Our sequence type will need to implement:
* a `__len__` method
* a `__getitem__` method
* a method that identifies the polygon with largest area to perimeter ratio: let's call it `max_efficiency_polygon` - note that the Polygon class does not have an `efficiency` method, so we'll have to calculate it outside of the Polygon class.

Let's start with some of the basics:

In [2]:
class Polygons:
    def __init__(self, m, R):
        if m < 3:
            raise ValueError('m must be greater than 3')
        self._m = m
        self._R = R
        
    def __len__(self):
        return self._m - 2
    
    def __repr__(self):
        return f'Polygons(m={self._m}, R={self._R})'

Let's make sure this works as intended:

In [3]:
polygons = Polygons(2, 10)

ValueError: m must be greater than 3

That's exactly what we want, so that's good.

In [4]:
polygons = Polygons(3, 1)

In [5]:
len(polygons)

1

In [6]:
polygons = Polygons(6, 1)
len(polygons)

4

Let's also test the representation:

In [7]:
polygons

Polygons(m=6, R=1)

 Let's now implement a list that will contain all the polygons.
 
 We'll do that in the `__init__` method as well.

In [8]:
class Polygons:
    def __init__(self, m, R):
        if m < 3:
            raise ValueError('m must be greater than 3')
        self._m = m
        self._R = R
        self._polygons = [Polygon(i, R) for i in range(3, m+1)]
        
    def __len__(self):
        return self._m - 2
    
    def __repr__(self):
        return f'Polygons(m={self._m}, R={self._R})'

Before we can test this, we need to implement the `__getitem__` method:

In [9]:
class Polygons:
    def __init__(self, m, R):
        if m < 3:
            raise ValueError('m must be greater than 3')
        self._m = m
        self._R = R
        self._polygons = [Polygon(i, R) for i in range(3, m+1)]
        
    def __len__(self):
        return self._m - 2
    
    def __repr__(self):
        return f'Polygons(m={self._m}, R={self._R})'
    
    def __getitem__(self, s):
        return self._polygons[s]

Notice how easy it was using delegation to the underlying list of polygons!

Let's test this out:

In [10]:
polygons = Polygons(8, 1)

In [11]:
for p in polygons:
    print(p)

Polygon(n=3, R=1)
Polygon(n=4, R=1)
Polygon(n=5, R=1)
Polygon(n=6, R=1)
Polygon(n=7, R=1)
Polygon(n=8, R=1)


In [12]:
for p in polygons[2:5]:
    print(p)

Polygon(n=5, R=1)
Polygon(n=6, R=1)
Polygon(n=7, R=1)


In [13]:
for p in polygons[::-1]:
    print(p)

Polygon(n=8, R=1)
Polygon(n=7, R=1)
Polygon(n=6, R=1)
Polygon(n=5, R=1)
Polygon(n=4, R=1)
Polygon(n=3, R=1)


We still need to implement a method that identifies the polygon with highest area:perimeter ratio.

In [14]:
class Polygons:
    def __init__(self, m, R):
        if m < 3:
            raise ValueError('m must be greater than 3')
        self._m = m
        self._R = R
        self._polygons = [Polygon(i, R) for i in range(3, m+1)]
        
    def __len__(self):
        return self._m - 2
    
    def __repr__(self):
        return f'Polygons(m={self._m}, R={self._R})'
    
    def __getitem__(self, s):
        return self._polygons[s]
    
    @property
    def max_efficiency_polygon(self):
        sorted_polygons = sorted(self._polygons, 
                                 key=lambda p: p.area/p.perimeter,
                                reverse=True)
        return sorted_polygons[0]

In [15]:
polygons = Polygons(10, 1)

In [16]:
polygons.max_efficiency_polygon

Polygon(n=10, R=1)

Let's test this to make sure that is correct:

In [17]:
[(p, p.area/p.perimeter) for p in polygons]

[(Polygon(n=3, R=1), 0.25000000000000006),
 (Polygon(n=4, R=1), 0.35355339059327384),
 (Polygon(n=5, R=1), 0.4045084971874737),
 (Polygon(n=6, R=1), 0.4330127018922193),
 (Polygon(n=7, R=1), 0.4504844339512096),
 (Polygon(n=8, R=1), 0.4619397662556434),
 (Polygon(n=9, R=1), 0.46984631039295427),
 (Polygon(n=10, R=1), 0.47552825814757677)]

So, looks like our `max_efficiency_polygon` method is working correctly.

As one last thing, we could look at the surface area of our polygons, as the number of vertices become larger and larger.

As we have more and more sides, the polygon becomes a closer and closer approximation to a circle. So, the area should get closer and closer to $\pi$ if we use a circumradius of `1`.

In [18]:
polygons = Polygons(500, 1)

In [19]:
polygons[-1].area

3.1415099708381518

Yep, seems to be working!

# Section 04 - Iterables and Iterators

##  Iterating Collections

We saw how sequence types support iteration by being able to access elements by index. We could even write our custom sequence types by implementing the `__getitem__` method.

But there are some limitations:

* items must be numerically indexable, with indexing starting at `0`
* cannot be used with unordered collections, such as sets

If we think about iterating over a collection, what we really need is a way to request the **next** item in the collection.

If we can do that, our collection does not require being indexable, nor does it need to be ordered (i.e. we don't need the notion of relative positions of elements in the container).

This is exactly what iterables are in general - they provide a method that returns the "next" element in the collection. This approach works equally well with sequence type collections, as well as unordered collection types such as sets.

Of course, the order in which **next** returns items from an unordered colllection is not known in advance - and we see that when we iterate over a set for example:

In [1]:
s = {'x', 'y', 'b', 'c', 'a'}
for item in s:
    print(item)

y
a
c
b
x


As you can see the order in which the elements of the set was returned, did not match the order in which we added elements to the set.

Furthermore, we cannot use indexing to access elements in a set:

In [2]:
s[0]

TypeError: 'set' object does not support indexing

### Rolling our own Next method

Let's go ahead and define a kind of iterable ourselves. 

What we'll want to do is to have a container type of class that implements a `next` method, instead of that `__getitem__` method. 

Every time we call `next`, it should return the next element in the collection - so we'll have to keep track of where we are in the iteration somehow.

Since `next` is a built-in function, which we'll look at in a bit, we'll use `next_` instead.

In [3]:
class Squares:
    def __init__(self):
        self.i = 0
    
    def next_(self):
        result = self.i ** 2
        self.i += 1
        return result

In [4]:
sq = Squares()

In [5]:
sq.next_()

0

In [6]:
sq.next_()

1

In [7]:
sq.next_()

4

How do we re-start the iteration from the beginning?

We can't - we have to create a new instance of `Squares`:

In [8]:
sq = Squares()

In [9]:
for i in range(10):
    print(sq.next_())

0
1
4
9
16
25
36
49
64
81


We even are able to iterate over the squares.

But you'll notice that we essentially have an **infinite** number of items.

We can fix that easily enough - by specifying a length when we create the collection, and raise an exception if `next_()` goes beyond the number of elements in the collection - we'll raise a `StopIteration` exception -- that's a built-in exception Python provides us specifically for this kind of scenario!!

We'll even implement a `__len__` method to support the `len()` function:

In [10]:
class Squares:
    def __init__(self, length):
        self.length = length
        self.i = 0
    
    def next_(self):
        if self.i >= self.length:
            raise StopIteration
        else:
            result = self.i ** 2
            self.i += 1
            return result           
        
    def __len__(self):
        return self.length

In [11]:
sq = Squares(3)

In [12]:
len(sq)

3

In [13]:
sq.next_()

0

In [14]:
sq.next_()

1

In [15]:
sq.next_()

4

In [16]:
sq.next_()

StopIteration: 

So now, we can essentially loop over the collection in a very similar way to how we did it with sequences and the `__getitem__` method:

In [17]:
sq = Squares(5)
while True:
    try:
        print(sq.next_())
    except StopIteration:
        # reached end of iteration
        # stop looping
        break       

0
1
4
9
16


There are two issues here.
The first is that the "iterable" `sq` has been exhausted - we can't just "re-start" the iteration:

In [18]:
sq.next_()

StopIteration: 

The second problem is that we can't use a `for` loop - Python does not know about our `next_()` method:

In [19]:
for i in Squares(10):
    print(i)

TypeError: 'Squares' object is not iterable

Of course if we had a `__getitem__` method, everything would work again - but remember that `__getitem__` means we have a sequence type. Although our Squares is actually a sequence, we want to look at a more general way of creating containers that are not necessarily sequences.

Much like Python's `len()` function and the `__len__()` method, Python has a built-in `next()` function - it calls the `__next__()` method in our class if there is one.

Let's see this:

In [20]:
class Squares:
    def __init__(self, length):
        self.length = length
        self.i = 0
    
    def __next__(self):
        if self.i >= self.length:
            raise StopIteration
        else:
            result = self.i ** 2
            self.i += 1
            return result   
    
    def __len__(self):
        return self.length

In [21]:
sq = Squares(3)

In [22]:
next(sq)

0

In [23]:
next(sq)

1

In [24]:
next(sq)

4

In [25]:
next(sq)

StopIteration: 

So that's nice, makes typing a bit easier - our loop we wrote earlier would look something like this now:

In [26]:
sq = Squares(5)
while True:
    try:
        print(next(sq))
    except StopIteration:
        break  

0
1
4
9
16


Does this mean Python can now iterate over an instance of Squares?

In [27]:
for i in Squares(10):
    print(i)

TypeError: 'Squares' object is not iterable

Nope, Python still does not recognize our class as an iterable collection.

We need to do a little bit more work to get there.

We also are going to need to look at how to "reset" the iteration without having to create a whole new object.

You'll notice that technically our `Squares` class could be built as a sequence type - it was just a very simple example.

Instead, let's build another collection that is a container of random numbers, but in no particular order.

In [28]:
import random

In [29]:
class RandomNumbers:
    def __init__(self, length, *, range_min=0, range_max=10):
        self.length = length
        self.range_min = range_min
        self.range_max = range_max
        self.num_requested = 0
        
    def __len__(self):
        return self.length
    
    def __next__(self):
        if self.num_requested >= self.length:
            raise StopIteration
        else:
            self.num_requested += 1
            return random.randint(self.range_min, self.range_max)

We can now iterate over instances of this object:

In [30]:
numbers = RandomNumbers(10)

In [31]:
len(numbers)

10

In [32]:
while True:
    try:
        print(next(numbers))
    except StopIteration:
        break

8
9
3
10
10
9
0
10
10
1


We still cannot use a `for` loop, and if we want to 'restart' the iteration, we have to create a new object every time.

In [33]:
numbers = RandomNumbers(10)

In [34]:
for item in numbers:
    print(item)

TypeError: 'RandomNumbers' object is not iterable

##  Iterators

In the last lecture we saw that we could approach iterating over a collection using this concept of `next`.

But there were some downsides that did not resolve (yet!):
* we cannot use a `for` loop
* once we exhaust the iteration (repeatedly calling next), we're essentially done with object. The only way to iterate through it again is to create a new instance of the object.

First we are going to look at making our `next` be usable in a for loop.

This idea of using `__next__` and the `StopIteration` exception is exactly what Python does.

So, somehow we need to tell Python that the object we are dealing with can be used with `next`.

To do so, we create an `iterator` type object.

Iterators are objects that implement:
* a `__next__` method
* an `__iter__` method that simply returns the object itself

That's it - that's all there is to an iterator - two methods, `__iter__` and `__next__`.

Let's go back to our `Squares` example:

In [1]:
class Squares:
    def __init__(self, length):
        self.length = length
        self.i = 0
        
    def __iter__(self):
        return self
    
    def __next__(self):
        if self.i >= self.length:
            raise StopIteration
        else:
            result = self.i ** 2
            self.i += 1
            return result

Now we can still call `next`:

In [2]:
sq = Squares(5)

In [3]:
print(next(sq))
print(next(sq))
print(next(sq))

0
1
4


Of course, our iterator still suffers from not being able to "reset" it - we just have to create a new instance:

In [4]:
sq = Squares(5)

But now, we can also use a `for` loop:

In [5]:
for item in sq:
    print(item)

0
1
4
9
16


Now `sq` is **exhausted**, so if we try to loop through again:

In [6]:
for item in sq:
    print(item)

We get nothing...

All we need to do is create a new iterator:

In [7]:
sq = Squares(5)

In [8]:
for item in sq:
    print(item)

0
1
4
9
16


Just like Python's built-in `next` function calls our `__next__` method, Python has a built-in function `iter` which calls the `__iter__` method:

In [9]:
sq = Squares(5)

In [10]:
id(sq)

1965579635736

In [11]:
id(sq.__iter__())

1965579635736

In [12]:
id(iter(sq))

1965579635736

And of course we can also use a list comprehension on our iterator object:

In [13]:
sq = Squares(5)

In [14]:
[item for item in sq if item%2==0]

[0, 4, 16]

We can even use any function that requires an iterable as an argument (iterators are iterable):

In [15]:
sq = Squares(5)
list(enumerate(sq))

[(0, 0), (1, 1), (2, 4), (3, 9), (4, 16)]

But of course we have to be careful, our iterator was exhausted, so if try that again:

In [16]:
list(enumerate(sq))

[]

we get an empty list - instead we have to create a new iterator first:

In [17]:
sq = Squares(5)
list(enumerate(sq))

[(0, 0), (1, 1), (2, 4), (3, 9), (4, 16)]

We can even use the `sorted` method on it:

In [18]:
sq = Squares(5)
sorted(sq, reverse=True)

[16, 9, 4, 1, 0]

#### Python Iterators Summary

Iterators are objects that implement the `__iter__` and `__next__` methods.

The `__iter__` method of an iterator just returns itself.

Once we fully iterate over an iterator, the iterator is **exhausted** and we can no longer use it for iteration purposes.

The way Python applies a `for` loop to an iterator object is basically what we saw with the `while` loop and the `StopIteration` exception.

In [19]:
sq = Squares(5)
while True:
    try:
        print(next(sq))
    except StopIteration:
        break

0
1
4
9
16


In fact we can easily see this by tweaking our iterator a bit:

In [20]:
class Squares:
    def __init__(self, length):
        self.length = length
        self.i = 0
        
    def __iter__(self):
        print('calling __iter__')
        return self
    
    def __next__(self):
        print('calling __next__')
        if self.i >= self.length:
            raise StopIteration
        else:
            result = self.i ** 2
            self.i += 1
            return result

In [21]:
sq = Squares(5)

In [22]:
for i in sq:
    print(i)

calling __iter__
calling __next__
0
calling __next__
1
calling __next__
4
calling __next__
9
calling __next__
16
calling __next__


As you can see Python calls `__next__` (and stops once a `StopIteration` exception is raised).

But you'll notice that it also called the `__iter__` method.

In fact we'll see this happening in other places too:

In [23]:
sq = Squares(5)
[item for item in sq if item%2==0]

calling __iter__
calling __next__
calling __next__
calling __next__
calling __next__
calling __next__
calling __next__


[0, 4, 16]

In [24]:
sq = Squares(5)
list(enumerate(sq))

calling __iter__
calling __next__
calling __next__
calling __next__
calling __next__
calling __next__
calling __next__


[(0, 0), (1, 1), (2, 4), (3, 9), (4, 16)]

In [25]:
sq = Squares(5)
sorted(sq, reverse=True)

calling __iter__
calling __next__
calling __next__
calling __next__
calling __next__
calling __next__
calling __next__


[16, 9, 4, 1, 0]

Why is `__iter__` being called? After all, it just returns itself!

That's the topic of the next lecture!

But let's see how we can mimic what Python is doing:

In [26]:
sq = Squares(5)
sq_iterator = iter(sq)
print(id(sq), id(sq_iterator))
while True:
    try:
        item = next(sq_iterator)
        print(item)
    except StopIteration:
        break

calling __iter__
1965579704808 1965579704808
calling __next__
0
calling __next__
1
calling __next__
4
calling __next__
9
calling __next__
16
calling __next__


As you can see, we first request an iterator from `sq` using the `iter` function, and then we iterate using the returned iterator. In the case of an iterator, the `iter` function just gets the iterator itself back.

##  Iterators and Iterables

Previously we saw that we could create **iterator** objects by simply implementing:

* a `__next__` method that returns the next element in the container
* an `__iter__` method that just returns the object itself (the iterator object)

Doing that we could use a `for` loop, list comprehensions, and in fact use that iterator object anywhere an iterable was expected (like `enumerate`, `sorted`, and so on).

However, we had two outstanding issues/questions:
* when we looped over the iterator using a `for` loop (or a comprehension, or other functions that do some form of iteration), we saw that the `__iter__` was always called first.
* the iterator gets exhausted after we have finished iterating it fully - which means we have to create a new iterator every time we want to use a new iteration over the collection - can we somehow avoid having to remember to do that every time?

The answer to both of these questions are related.

Let's start by looking at how we might avoid having to create a new instance of the collection every time we want to iterate over it.

After all, we don't need a new instance of the elements, just some kind of *resetting* of *current* item.

Let's start with a simple example that has those issues:

In [1]:
class Cities:
    def __init__(self):
        self._cities = ['Paris', 'Berlin', 'Rome', 'Madrid', 'London']
        self._index = 0
    
    def __iter__(self):
        return self
    
    def __next__(self):
        if self._index >= len(self._cities):
            raise StopIteration
        else:
            item = self._cities[self._index]
            self._index += 1
            return item

Now, we have an **iterator** object, but we need to re-create it every time we want to start the iterations from the beginning:

In [2]:
cities = Cities()
list(enumerate(cities))

[(0, 'Paris'), (1, 'Berlin'), (2, 'Rome'), (3, 'Madrid'), (4, 'London')]

In [3]:
cities = Cities()
[item.upper() for item in cities]

['PARIS', 'BERLIN', 'ROME', 'MADRID', 'LONDON']

In [4]:
cities = Cities()
sorted(cities)

['Berlin', 'London', 'Madrid', 'Paris', 'Rome']

So, we basically have to "restart" an iterator by **creating a new one each time**.

But in this case, we are also re-creating the underlying data every time - seems wasteful!

Instead, maybe we can split the **iterator** part of our code from the **data** part of our code.

In [5]:
class Cities:
    def __init__(self):
        self._cities = ['New York', 'Newark', 'New Delhi', 'Newcastle']
        
    def __len__(self):
        return len(self._cities)

And let's create our iterator this way:

In [6]:
class CityIterator:
    def __init__(self, city_obj):
        # cities is an instance of Cities
        self._city_obj = city_obj
        self._index = 0
        
    def __iter__(self):
        return self
    
    def __next__(self):
        if self._index >= len(self._city_obj):
            raise StopIteration
        else:
            item = self._city_obj._cities[self._index]
            self._index += 1
            return item

So now we can create our `Cities` instance **once**:

In [7]:
cities = Cities()

and create as many iterators as we want, but passing it the same `Cities` instance everyt time:

In [8]:
iter_1 = CityIterator(cities)

In [9]:
for city in iter_1:
    print(city)

New York
Newark
New Delhi
Newcastle


In [10]:
iter_2 = CityIterator(cities)
[city.upper() for city in iter_2]

['NEW YORK', 'NEWARK', 'NEW DELHI', 'NEWCASTLE']

So, we're almost at a solution now. At least we can create the **iterator** objects without having to recreate the `Cities` object every time.

But, we still have to remember to create a new iterator, **and** we can no longer iterate over the `cities` object anymore!

In [11]:
for city in cities:
    print(city)

TypeError: 'Cities' object is not iterable

This is where the first question we asked comes into play. Whenever we iterated our iterator, the first thing Python did was call `__iter__`.

In fact, let's just check that again:

In [12]:
class CityIterator:
    def __init__(self, city_obj):
        # cities is an instance of Cities
        print('Calling CityIterator __init__')
        self._city_obj = city_obj
        self._index = 0
        
    def __iter__(self):
        print('Calling CitiyIterator instance __iter__')
        return self
    
    def __next__(self):
        print('Calling __next__')
        if self._index >= len(self._city_obj):
            raise StopIteration
        else:
            item = self._city_obj._cities[self._index]
            self._index += 1
            return item

In [13]:
iter_1 = CityIterator(cities)

Calling CityIterator __init__


In [14]:
for city in iter_1:
    print(city)

Calling CitiyIterator instance __iter__
Calling __next__
New York
Calling __next__
Newark
Calling __next__
New Delhi
Calling __next__
Newcastle
Calling __next__


#### Iterables

Now we finally come to how an **iterable** is defined in Python.

An **iterable** is an object that:
* implements the `__iter__` method
* and that method returns an **iterator** which can be used to iterate over the object

What would happen if we put an `__iter__` method in the `Cities` object and then try to iterate?

When we try to iterate over the `Cities` instance, Python will first call `__iter__`. The `__iter__` method should then return an **iterator** which Python will use for the iteration.

We actually have everything we need to now make `Cities` an **iterable** since we already have the `CityIterator` created:

In [15]:
class CityIterator:
    def __init__(self, city_obj):
        # cities is an instance of Cities
        print('Calling CityIterator __init__')
        self._city_obj = city_obj
        self._index = 0
        
    def __iter__(self):
        print('Calling CitiyIterator instance __iter__')
        return self
    
    def __next__(self):
        print('Calling __next__')
        if self._index >= len(self._city_obj):
            raise StopIteration
        else:
            item = self._city_obj._cities[self._index]
            self._index += 1
            return item

In [16]:
class Cities:
    def __init__(self):
        self._cities = ['New York', 'Newark', 'New Delhi', 'Newcastle']
        
    def __len__(self):
        return len(self._cities)
    
    def __iter__(self):
        print('Calling Cities instance __iter__')
        return CityIterator(self)

In [17]:
cities = Cities()

In [18]:
for city in cities:
    print(city)

Calling Cities instance __iter__
Calling CityIterator __init__
Calling __next__
New York
Calling __next__
Newark
Calling __next__
New Delhi
Calling __next__
Newcastle
Calling __next__


And watch what happens if we try to run that loop again:

In [19]:
for city in cities:
    print(city)

Calling Cities instance __iter__
Calling CityIterator __init__
Calling __next__
New York
Calling __next__
Newark
Calling __next__
New Delhi
Calling __next__
Newcastle
Calling __next__


A new **iterator** was created when the `for` loop started.

In fact, same happens for anything that is going to iterate our iterable - it first calls the `__iter__` method of the itrable to get a **new** iterator, then uses the iterator to call `__next__`.

In [20]:
list(enumerate(cities))

Calling Cities instance __iter__
Calling CityIterator __init__
Calling __next__
Calling __next__
Calling __next__
Calling __next__
Calling __next__


[(0, 'New York'), (1, 'Newark'), (2, 'New Delhi'), (3, 'Newcastle')]

In [21]:
sorted(cities, reverse=True)

Calling Cities instance __iter__
Calling CityIterator __init__
Calling __next__
Calling __next__
Calling __next__
Calling __next__
Calling __next__


['Newcastle', 'Newark', 'New York', 'New Delhi']

Now we can put the iterator class inside our `Cities` class to keep the code self-contained:

In [22]:
del CityIterator  # just to make sure CityIterator is not in our global scope

In [23]:
class Cities:
    def __init__(self):
        self._cities = ['New York', 'Newark', 'New Delhi', 'Newcastle']
        
    def __len__(self):
        return len(self._cities)
    
    def __iter__(self):
        print('Calling Cities instance __iter__')
        return self.CityIterator(self)
    
    class CityIterator:
        def __init__(self, city_obj):
            # cities is an instance of Cities
            print('Calling CityIterator __init__')
            self._city_obj = city_obj
            self._index = 0

        def __iter__(self):
            print('Calling CitiyIterator instance __iter__')
            return self

        def __next__(self):
            print('Calling __next__')
            if self._index >= len(self._city_obj):
                raise StopIteration
            else:
                item = self._city_obj._cities[self._index]
                self._index += 1
                return item

In [24]:
cities = Cities()

In [25]:
list(enumerate(cities))

Calling Cities instance __iter__
Calling CityIterator __init__
Calling __next__
Calling __next__
Calling __next__
Calling __next__
Calling __next__


[(0, 'New York'), (1, 'Newark'), (2, 'New Delhi'), (3, 'Newcastle')]

Technically we can even get an iterator instance ourselves directly, by calling `iter()` on the `cities` object:

In [26]:
iter_1 = iter(cities)
iter_2 = iter(cities)

Calling Cities instance __iter__
Calling CityIterator __init__
Calling Cities instance __iter__
Calling CityIterator __init__


As you can see, Python created and returned two different instances of the `CityIterator` object.

In [27]:
id(iter_1), id(iter_2)

(1741231353928, 1741231354320)

And now we also have should understand why **iterators** also implement the `__iter__` method (that just returns themselves) - it makes them **iterables** too!

#### Mixing Iterables and Sequences

`Cities` is an iterable, but it is not a sequence type:

In [28]:
cities = Cities()

In [29]:
len(cities)

4

In [30]:
cities[1]

TypeError: 'Cities' object does not support indexing

Since our Cities **could** also be a sequence, we could also decide to implement the `__getitem__` method to make it into a sequence:

In [31]:
class Cities:
    def __init__(self):
        self._cities = ['New York', 'Newark', 'New Delhi', 'Newcastle']
        
    def __len__(self):
        return len(self._cities)
    
    def __getitem__(self, s):
        print('getting item...')
        return self._cities[s]
    
    def __iter__(self):
        print('Calling Cities instance __iter__')
        return self.CityIterator(self)
    
    class CityIterator:
        def __init__(self, city_obj):
            # cities is an instance of Cities
            print('Calling CityIterator __init__')
            self._city_obj = city_obj
            self._index = 0

        def __iter__(self):
            print('Calling CitiyIterator instance __iter__')
            return self

        def __next__(self):
            print('Calling __next__')
            if self._index >= len(self._city_obj):
                raise StopIteration
            else:
                item = self._city_obj._cities[self._index]
                self._index += 1
                return item

In [32]:
cities = Cities()

It's a sequence:

In [33]:
cities[0]

getting item...


'New York'

It's also an iterable:

In [34]:
next(iter(cities))

Calling Cities instance __iter__
Calling CityIterator __init__
Calling __next__


'New York'

Now that Cities is both a sequence type (`__getitem__`) and an iterable (`__iter__`), when we loop over `cities`, is Python going to use `__getitem__` or `__iter__`?

In [35]:
cities = Cities()
for city in cities:
    print(city)

Calling Cities instance __iter__
Calling CityIterator __init__
Calling __next__
New York
Calling __next__
Newark
Calling __next__
New Delhi
Calling __next__
Newcastle
Calling __next__


It uses the iterator - so Python will use the iterator if there is one, otherwise it will fall back to using `__getitem__`. If neither is implemented, we'll get an exception.

Of course, for selection by index or slice, the `__getitem__` method **must** be implemented.

We'll come back to this very topic in an upcoming video, because behind the scenes, even if we only implement the `__getitem__` method, Python will auto-generate an iterator for us!

### Python Built-In Iterables and Iterators

The way iterables and iterators work in our custom `Cities` example is exactly the way Python iterables work too.

In [36]:
l = [1, 2, 3]

Since lists are iterables, they implement the `__iter__` method and we can get an **iterator** for the list:

In [37]:
iter_l = iter(l)
#or could use iter_1 = l.__iter__()

In [38]:
type(iter_l)

list_iterator

In [39]:
next(iter_l)

1

In [40]:
next(iter_l)

2

In [41]:
next(iter_l)

3

In [42]:
next(iter_l)

StopIteration: 

See? The same `StopIteration` exception is raised.

Since `iter_l` is an iterator, it also implements the `__iter__` method, which just returns the iterator itself:

In [43]:
id(iter_l), id(iter(iter_l))

(1741231347248, 1741231347248)

In [44]:
'__next__' in dir(iter_l)

True

In [45]:
'__iter__' in dir(iter_l)

True

Since the list `l` is an iterable it also implements the `__iter__` method:

In [46]:
'__iter__' in dir(l)

True

but does not implement a `__next__` method:

In [47]:
'__next__' in dir(l)

False

Of course, since lists are also sequence types, they also implement the `__getitem__` method:

In [48]:
'__getitem__' in dir(l)

True

Sets and dictionaries on the other hand are not sequence types:

In [49]:
'__getitem__' in dir(set)

False

In [50]:
'__iter__' in dir(set)

True

In [51]:
s = {1, 2, 3}
'__next__' in dir(iter(s))

True

In [52]:
'__iter__' in dir(dict)

True

But what does the iterator for a dictionary actually return? It iterates over what? You should probably already guess the answer to that one!

In [53]:
d = dict(a=1, b=2, c=3)

In [54]:
iter_d = iter(d)

In [55]:
next(iter_d)

'a'

Dictionary iterators will iterate over the **keys** of the dictionary.

To iterate over the values, we could use the `values()` method which returns an **iterable** over the values of the dictionary:

In [56]:
iter_vals = iter(d.values())

In [57]:
next(iter_vals)

1

And to iterate over both the keys and values, dictionaries provide an `items()` iterable:

In [58]:
iter_items = iter(d.items())

In [59]:
next(iter_items)

('a', 1)

Here we get an iterator over key, value tuples

We'll examine the usefullness of being able to iterate using `next` instead of a `for` loop, or comprehension, in the next video.

##  Consuming Iterators Manually

We've already seen how to do this:

* get an iterator from the iterable
* call next on the iterator (until the `StopIteration` exception is raised)

Let's quickly see how do this again, using a string as the underlying iterable:

In [1]:
s = 'I sleep all night, and I work all day'

In [2]:
iter_s = iter(s)

In [3]:
print(next(iter_s))
print(next(iter_s))
print(next(iter_s))
print(next(iter_s))
print(next(iter_s))

I
 
s
l
e


This means we can get the next item in a collection without actually using a loop of any kind.

Why might this be useful?

#### Example 1

A fairly typical use case for this would be when reading data from a CSV file where you know the first few lines consist of information abotu teh data rather than just the data itself.

Let's try this using a CSV file I have saved alongside the Jupyter notebook.

Let's first load the data and see what it looks like:

In [4]:
with open('cars.csv') as file:
    for line in file:
        print(line)    

Car;MPG;Cylinders;Displacement;Horsepower;Weight;Acceleration;Model;Origin

STRING;DOUBLE;INT;DOUBLE;DOUBLE;DOUBLE;DOUBLE;INT;CAT

Chevrolet Chevelle Malibu;18.0;8;307.0;130.0;3504.;12.0;70;US

Buick Skylark 320;15.0;8;350.0;165.0;3693.;11.5;70;US

Plymouth Satellite;18.0;8;318.0;150.0;3436.;11.0;70;US

AMC Rebel SST;16.0;8;304.0;150.0;3433.;12.0;70;US

Ford Torino;17.0;8;302.0;140.0;3449.;10.5;70;US

Ford Galaxie 500;15.0;8;429.0;198.0;4341.;10.0;70;US

Chevrolet Impala;14.0;8;454.0;220.0;4354.;9.0;70;US

Plymouth Fury iii;14.0;8;440.0;215.0;4312.;8.5;70;US

Pontiac Catalina;14.0;8;455.0;225.0;4425.;10.0;70;US

AMC Ambassador DPL;15.0;8;390.0;190.0;3850.;8.5;70;US

Citroen DS-21 Pallas;0;4;133.0;115.0;3090.;17.5;70;Europe

Chevrolet Chevelle Concours (sw);0;8;350.0;165.0;4142.;11.5;70;US

Ford Torino (sw);0;8;351.0;153.0;4034.;11.0;70;US

Plymouth Satellite (sw);0;8;383.0;175.0;4166.;10.5;70;US

AMC Rebel SST (sw);0;8;360.0;175.0;3850.;11.0;70;US

Dodge Challenger SE;15.0;8;383.0;170.

As we can see, the values are delimited by `;` and the first two lines consist of the column names, and column types.

The reason for the spacing between each line is that each line ends with a newline, and our print statement also emits a newline by default. So we'll have to strip those out.

Here's what we want to do: 
* read the first line to get the column headers and create a named tuple class
* read data types from second line and store this so we can cast the strings we are reading to the correct data type
* read the data rows and parse them into a named tuples

We could do it this way:

In [5]:
with open('cars.csv') as file:
    row_index = 0
    for line in file:
        if row_index == 0:
            # header row
            headers = line.strip('\n').split(';')
            print(headers)
        elif row_index == 1:
            # data type row
            data_types = line.strip('\n').split(';')
            print(data_types)
        else:
            # data rows
            data = line.strip('\n').split(';')
            print(data)
        row_index += 1

['Car', 'MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model', 'Origin']
['STRING', 'DOUBLE', 'INT', 'DOUBLE', 'DOUBLE', 'DOUBLE', 'DOUBLE', 'INT', 'CAT']
['Chevrolet Chevelle Malibu', '18.0', '8', '307.0', '130.0', '3504.', '12.0', '70', 'US']
['Buick Skylark 320', '15.0', '8', '350.0', '165.0', '3693.', '11.5', '70', 'US']
['Plymouth Satellite', '18.0', '8', '318.0', '150.0', '3436.', '11.0', '70', 'US']
['AMC Rebel SST', '16.0', '8', '304.0', '150.0', '3433.', '12.0', '70', 'US']
['Ford Torino', '17.0', '8', '302.0', '140.0', '3449.', '10.5', '70', 'US']
['Ford Galaxie 500', '15.0', '8', '429.0', '198.0', '4341.', '10.0', '70', 'US']
['Chevrolet Impala', '14.0', '8', '454.0', '220.0', '4354.', '9.0', '70', 'US']
['Plymouth Fury iii', '14.0', '8', '440.0', '215.0', '4312.', '8.5', '70', 'US']
['Pontiac Catalina', '14.0', '8', '455.0', '225.0', '4425.', '10.0', '70', 'US']
['AMC Ambassador DPL', '15.0', '8', '390.0', '190.0', '3850.', '8.5', '70', 'US']
[

In [6]:
from collections import namedtuple
cars = []

with open('cars.csv') as file:
    row_index = 0
    for line in file:
        if row_index == 0:
            # header row
            headers = line.strip('\n').split(';')
            Car = namedtuple('Car', headers)
        elif row_index == 1:
            # data type row
            data_types = line.strip('\n').split(';')
            print(data_types)
        else:
            # data rows
            data = line.strip('\n').split(';')
            car = Car(*data)
            cars.append(car)
        row_index += 1

['STRING', 'DOUBLE', 'INT', 'DOUBLE', 'DOUBLE', 'DOUBLE', 'DOUBLE', 'INT', 'CAT']


In [7]:
print(cars[0])

Car(Car='Chevrolet Chevelle Malibu', MPG='18.0', Cylinders='8', Displacement='307.0', Horsepower='130.0', Weight='3504.', Acceleration='12.0', Model='70', Origin='US')


We still need to parse the data into strings, integers, floats...

Let's break this problem down into smaller chunks:

First we need to figure cast to a data type based on the data type string:
* STRING --> `str`
* DOUBLE --> `float`
* INT --> `int`
* CAT --> `str`

In [8]:
def cast(data_type, value):
    if data_type == 'DOUBLE':
        return float(value)
    elif data_type == 'INT':
        return int(value)
    else:
        return str(value)

Next we somehow have to cast all the items in a list, based on their corresponding data type in the data_types array:

In [9]:
data_types = ['STRING', 'DOUBLE', 'INT', 'DOUBLE', 'DOUBLE', 'DOUBLE', 'DOUBLE', 'INT', 'CAT']

In [10]:
data_row = ['Chevrolet Chevelle Malibu', '18.0', '8', '307.0', '130.0', '3504.', '12.0', '70', 'US']

For something like this, we can just zip up the two lists:

In [11]:
list(zip(data_types, data_row))

[('STRING', 'Chevrolet Chevelle Malibu'),
 ('DOUBLE', '18.0'),
 ('INT', '8'),
 ('DOUBLE', '307.0'),
 ('DOUBLE', '130.0'),
 ('DOUBLE', '3504.'),
 ('DOUBLE', '12.0'),
 ('INT', '70'),
 ('CAT', 'US')]

And we can either use a `map()` or a list comprehension to apply the cast function to each one:

In [12]:
[cast(data_type, value) for data_type, value in zip(data_types, data_row)]

['Chevrolet Chevelle Malibu', 18.0, 8, 307.0, 130.0, 3504.0, 12.0, 70, 'US']

So now we can write this in a function:

In [13]:
def cast_row(data_types, data_row):
    return [cast(data_type, value) 
            for data_type, value in zip(data_types, data_row)]

Let's go back and fix up our original code now:

In [14]:
from collections import namedtuple
cars = []

with open('cars.csv') as file:
    row_index = 0
    for line in file:
        if row_index == 0:
            # header row
            headers = line.strip('\n').split(';')
            Car = namedtuple('Car', headers)
        elif row_index == 1:
            # data type row
            data_types = line.strip('\n').split(';')
        else:
            # data rows
            data = line.strip('\n').split(';')
            data = cast_row(data_types, data)
            car = Car(*data)
            cars.append(car)
        row_index += 1

In [15]:
cars[0]

Car(Car='Chevrolet Chevelle Malibu', MPG=18.0, Cylinders=8, Displacement=307.0, Horsepower=130.0, Weight=3504.0, Acceleration=12.0, Model=70, Origin='US')

Now let's see if we can clean up this code by using iterators directly:

In [16]:
from collections import namedtuple
cars = []

with open('cars.csv') as file:
    file_iter = iter(file)
    headers = next(file_iter).strip('\n').split(';')
    Car = namedtuple('Car', headers)
    data_types = next(file_iter).strip('\n').split(';')
    for line in file_iter:
        data = line.strip('\n').split(';')
        data = cast_row(data_types, data)
        car = Car(*data)
        cars.append(car)

In [17]:
cars[0]

Car(Car='Chevrolet Chevelle Malibu', MPG=18.0, Cylinders=8, Displacement=307.0, Horsepower=130.0, Weight=3504.0, Acceleration=12.0, Model=70, Origin='US')

That's already quite a bit cleaner... But why stop there!

In [18]:
from collections import namedtuple

with open('cars.csv') as file:
    file_iter = iter(file)
    headers = next(file_iter).strip('\n').split(';')
    data_types = next(file_iter).strip('\n').split(';')
    cars_data = [cast_row(data_types, 
                          line.strip('\n').split(';'))
                   for line in file_iter]
    cars = [Car(*item) for item in cars_data]

In [19]:
cars_data[0]

['Chevrolet Chevelle Malibu', 18.0, 8, 307.0, 130.0, 3504.0, 12.0, 70, 'US']

In [20]:
cars[0]

Car(Car='Chevrolet Chevelle Malibu', MPG=18.0, Cylinders=8, Displacement=307.0, Horsepower=130.0, Weight=3504.0, Acceleration=12.0, Model=70, Origin='US')

I chose to split creating the parsed cars_data and the named tuple list into two steps for readability - but we could combine them into a single step:

In [21]:
from collections import namedtuple

with open('cars.csv') as file:
    file_iter = iter(file)
    headers = next(file_iter).strip('\n').split(';')
    data_types = next(file_iter).strip('\n').split(';')
    cars = [Car(*cast_row(data_types, 
                          line.strip('\n').split(';')))
            for line in file_iter]


In [22]:
cars[0]

Car(Car='Chevrolet Chevelle Malibu', MPG=18.0, Cylinders=8, Displacement=307.0, Horsepower=130.0, Weight=3504.0, Acceleration=12.0, Model=70, Origin='US')

##  Cyclic Iterators

Iterables do not have to be finite. In fact we can easily create an infinite cyclical iterator.

Here's an example - suppose we have a loop that iterates over some range of integers. As we loop through those integers we want to create a tuple containing the integer and a string that cycles over a finite set (smaller than the list of integers).

```
1, 2, 3, 4, 5, 6, 7, 8, 9, ...

N, S, W, E
```

and we want to generate

```
1N, 2S, 3W, 4E, 5N, 6S, 7W, 8E, 9N, ...
```


We could do it this way by creating a custom iterator for the list `['N', 'S', 'W', 'E']` that will cycle over that list indefinitely:

In [1]:
class CyclicIterator:
    def __init__(self, lst):
        self.lst = lst
        self.i = 0
        
    def __iter__(self):
        return self
    
    def __next__(self):
        result = self.lst[self.i % len(self.lst)]
        self.i += 1
        return result

In [2]:
iter_cycl = CyclicIterator('NSWE')

In [3]:
for i in range(10):
    print(next(iter_cycl))

N
S
W
E
N
S
W
E
N
S


So, now we can tackle our original problem:

In [4]:
n = 10
iter_cycl = CyclicIterator('NSWE')
for i in range(1, n+1):
    direction = next(iter_cycl)
    print(f'{i}{direction}')

1N
2S
3W
4E
5N
6S
7W
8E
9N
10S


And re-working this into a list comprehension:

In [5]:
n = 10
iter_cycl = CyclicIterator('NSWE')
[f'{i}{next(iter_cycl)}' for i in range(1, n+1)]

['1N', '2S', '3W', '4E', '5N', '6S', '7W', '8E', '9N', '10S']

Of course, there's an easy alternative way to do this as well, using:
* repetition
* zip
* a list comprehension

We need to repeat the array ['N', 'S', 'W', 'E'] for as many times as we have elements in our range of integers - we can even create way more than we need - because when we `zip` it up with the range of integers, the smallest length iterable will be used:

In [6]:
n = 10
list(zip(range(1, n+1), 'NSWE' * (n//4 + 1)))

[(1, 'N'),
 (2, 'S'),
 (3, 'W'),
 (4, 'E'),
 (5, 'N'),
 (6, 'S'),
 (7, 'W'),
 (8, 'E'),
 (9, 'N'),
 (10, 'S')]

In [7]:
[f'{i}{direction}'
 for i, direction in zip(range(1, n+1), 'NSWE' * (n//4 + 1))]

['1N', '2S', '3W', '4E', '5N', '6S', '7W', '8E', '9N', '10S']

There's actually an even easier way yet, and that's to use our `CyclicIterator`, but instead of building it ourselves, we can simply use the one provided by Python in the standard library!!

In [8]:
import itertools

In [9]:
n = 10
iter_cycl = CyclicIterator('NSWE')
[f'{i}{next(iter_cycl)}' for i in range(1, n+1)]

['1N', '2S', '3W', '4E', '5N', '6S', '7W', '8E', '9N', '10S']

and using itertools:

In [10]:
n = 10
iter_cycl = itertools.cycle('NSWE')
[f'{i}{next(iter_cycl)}' for i in range(1, n+1)]

['1N', '2S', '3W', '4E', '5N', '6S', '7W', '8E', '9N', '10S']

##  Lazy Iterables

An iterable is an object that can return an iterator (`__iter__`).

In turn an iterator is an object that can return itself (`__iter__`), and return the next value when asked (`__next__`).

Nothing in all this says that the iterable needs to be a finite collection, or that the elements in the iterable need to be materialized (pre-created) at the time the iterable / iterator is created.

Lazy evaluation is when evaluating a value is deferred until it is actually requested.

It is not specific to iterables however.

Simple examples of lazy evaluation are often seen in classes for calculated properties.

Let's look at an example of a lazy class property:

In [1]:
import math

class Circle:
    def __init__(self, r):
        self.radius = r
        
    @property
    def radius(self):
        return self._radius
    
    @radius.setter
    def radius(self, r):
        self._radius = r
        self.area = math.pi * r**2

As you can see, in this circle class, every time we set the radius, we re-calculate and store the area. When we request the area of the circle, we simply return the stored value.

In [2]:
c = Circle(1)

In [3]:
c.area

3.141592653589793

In [4]:
c.radius = 2

In [5]:
c.radius, c.area

(2, 12.566370614359172)

But instead of doing it this way, we could just calculate the area every time it is requested without actually storing the value:

In [6]:
class Circle:
    def __init__(self, r):
        self.radius = r
        
    @property
    def radius(self):
        return self._radius
    
    @radius.setter
    def radius(self, r):
        self._radius = r

    @property
    def area(self):
        return math.pi * self.radius ** 2

In [7]:
c = Circle(1)

In [8]:
c.area

3.141592653589793

In [9]:
c.radius = 2

In [10]:
c.area

12.566370614359172

But the area is always recalculated, so we may take a hybrid approach where we want to store the area so we don't need to recalculate it every time (except when the radius is modified), but delay calculating the area until it is requested - that way if it is never requested, we didn't waste the CPU cycles to calculate it, or the memory to store it.

In [11]:
class Circle:
    def __init__(self, r):
        self.radius = r
        
    @property
    def radius(self):
        return self._radius
    
    @radius.setter
    def radius(self, r):
        self._radius = r
        self._area = None

    @property
    def area(self):
        if self._area is None:
            print('Calculating area...')
            self._area = math.pi * self.radius ** 2
        return self._area

In [12]:
c = Circle(1)

In [13]:
c.area

Calculating area...


3.141592653589793

In [14]:
c.area

3.141592653589793

In [15]:
c.radius = 2

In [16]:
c.area

Calculating area...


12.566370614359172

This is an example of lazy evaluation. We don't actually calculate and store an attribute of the class until it is actually needed.

We can sometimes do something similar with iterables - we don't actually have to store every item of the collection - we may be able to just calculate the item as needed.

In the following example we'll create an iterable of factorials of integers starting at `0`, i.e.

`0!, 1!, 2!, 3!, ..., n!`

In [18]:
class Factorials:
    def __init__(self, length):
        self.length = length
    
    def __iter__(self):
        return self.FactIter(self.length)
    
    class FactIter:
        def __init__(self, length):
            self.length = length
            self.i = 0
            
        def __iter__(self):
            return self
        
        def __next__(self):
            if self.i >= self.length:
                raise StopIteration
            else:
                result = math.factorial(self.i)
                self.i += 1
                return result
            

In [19]:
facts = Factorials(5)

In [20]:
list(facts)

[1, 1, 2, 6, 24]

So as you can see, we do not store the values of the iterable, instead we just calculate the items as needed.

In fact, now that we have this iterable, we don't even need it to be finite:

In [23]:
class Factorials:
    def __iter__(self):
        return self.FactIter()
    
    class FactIter:
        def __init__(self):
            self.i = 0
            
        def __iter__(self):
            return self
        
        def __next__(self):
            result = math.factorial(self.i)
            self.i += 1
            return result

In [25]:
factorials = Factorials()
fact_iter = iter(factorials)

for _ in range(10):
    print(next(fact_iter))

1
1
2
6
24
120
720
5040
40320
362880


You'll notice that the main part of the iterable code is in the iterator, and the iterable itself is nothing more than a thin shell that allows us to create and access the iterator. This is so common, that there is a better way of doing this that we'll see when we deal with generators.

##  Python's Built-In Iterables and Iterators

Python has a lot of built-in functions that return iterators or iterables.

Let's look at the simple `range` function first:

In [48]:
r_10 = range(10)

Now, `r_10` is an **iterable**:

In [49]:
'__iter__' in dir(r_10)

True

But it is not an **iterator**:

In [50]:
'__next__' in dir(r_10)

False

However, we can request an iterator by calling the `__iter__` method, or simply using the `iter()` function:

In [51]:
r_10_iter = iter(r_10)

And of course this is now an iterator:

In [52]:
'__iter__' in dir(r_10_iter)

True

In [53]:
'__next__' in dir(r_10_iter)

True

Most built-in iterables in Python use lazy evaluation (including the `range`) function - i.e. when we execute `range(10)` Python does no pre-compute a "list" of all the elements in the range. Instead it uses lazy evluation and the iterator computes and returns elements one at a time.

This is why when we print a range object we do not actually see the contents of the range - they don't exist yet!

Instead, we need to iterate through the iterator and put it into something like a list:

In [54]:
[num for num in range(10)]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

The `zip` function on the other hand returns an iterator:

In [1]:
z = zip([1, 2, 3], 'abc')

In [2]:
z

<zip at 0x28b01b684c8>

It is an **iterator**:

In [3]:
print('__iter__' in dir(z))
print('__next__' in dir(z))

True
True


Just like `range()` though, it also uses lazy evaluation, so we need to iterate through the iterator and make a list in order to see the contents:

In [4]:
list(z)

[(1, 'a'), (2, 'b'), (3, 'c')]

Even reading a file line by line is done using lazy evaluation:

In [59]:
with open('cars.csv') as f:
    print(type(f))
    print('__iter__' in dir(f))
    print('__next__' in dir(f))

<class '_io.TextIOWrapper'>
True
True


As you can see, the `open()` function returns an **iterator** (of type `TextIOWrapper`), and we can read lines from the file one by one using the `next()` function, or calling the `__next__()` method. The class also implements a `readline()` method we can use to get the next row:

In [60]:
with open('cars.csv') as f:
    print(next(f))
    print(f.__next__())
    print(f.readline())

Car;MPG;Cylinders;Displacement;Horsepower;Weight;Acceleration;Model;Origin

STRING;DOUBLE;INT;DOUBLE;DOUBLE;DOUBLE;DOUBLE;INT;CAT

Chevrolet Chevelle Malibu;18.0;8;307.0;130.0;3504.;12.0;70;US



Of course we can just iterate over all the lines using a `for` loop as well:

In [23]:
with open('cars.csv') as f:
    for row in f:
        print(row, end='')

Car;MPG;Cylinders;Displacement;Horsepower;Weight;Acceleration;Model;Origin
STRING;DOUBLE;INT;DOUBLE;DOUBLE;DOUBLE;DOUBLE;INT;CAT
Chevrolet Chevelle Malibu;18.0;8;307.0;130.0;3504.;12.0;70;US
Buick Skylark 320;15.0;8;350.0;165.0;3693.;11.5;70;US
Plymouth Satellite;18.0;8;318.0;150.0;3436.;11.0;70;US
AMC Rebel SST;16.0;8;304.0;150.0;3433.;12.0;70;US
Ford Torino;17.0;8;302.0;140.0;3449.;10.5;70;US
Ford Galaxie 500;15.0;8;429.0;198.0;4341.;10.0;70;US
Chevrolet Impala;14.0;8;454.0;220.0;4354.;9.0;70;US
Plymouth Fury iii;14.0;8;440.0;215.0;4312.;8.5;70;US
Pontiac Catalina;14.0;8;455.0;225.0;4425.;10.0;70;US
AMC Ambassador DPL;15.0;8;390.0;190.0;3850.;8.5;70;US
Citroen DS-21 Pallas;0;4;133.0;115.0;3090.;17.5;70;Europe
Chevrolet Chevelle Concours (sw);0;8;350.0;165.0;4142.;11.5;70;US
Ford Torino (sw);0;8;351.0;153.0;4034.;11.0;70;US
Plymouth Satellite (sw);0;8;383.0;175.0;4166.;10.5;70;US
AMC Rebel SST (sw);0;8;360.0;175.0;3850.;11.0;70;US
Dodge Challenger SE;15.0;8;383.0;170.0;3563.;10.0;70;U

The `TextIOWrapper` class also provides a method `readlines()` that will read the entire file and return a list containing all the rows:

In [25]:
with open('cars.csv') as f:
    l = f.readlines()

In [26]:
l

['Car;MPG;Cylinders;Displacement;Horsepower;Weight;Acceleration;Model;Origin\n',
 'STRING;DOUBLE;INT;DOUBLE;DOUBLE;DOUBLE;DOUBLE;INT;CAT\n',
 'Chevrolet Chevelle Malibu;18.0;8;307.0;130.0;3504.;12.0;70;US\n',
 'Buick Skylark 320;15.0;8;350.0;165.0;3693.;11.5;70;US\n',
 'Plymouth Satellite;18.0;8;318.0;150.0;3436.;11.0;70;US\n',
 'AMC Rebel SST;16.0;8;304.0;150.0;3433.;12.0;70;US\n',
 'Ford Torino;17.0;8;302.0;140.0;3449.;10.5;70;US\n',
 'Ford Galaxie 500;15.0;8;429.0;198.0;4341.;10.0;70;US\n',
 'Chevrolet Impala;14.0;8;454.0;220.0;4354.;9.0;70;US\n',
 'Plymouth Fury iii;14.0;8;440.0;215.0;4312.;8.5;70;US\n',
 'Pontiac Catalina;14.0;8;455.0;225.0;4425.;10.0;70;US\n',
 'AMC Ambassador DPL;15.0;8;390.0;190.0;3850.;8.5;70;US\n',
 'Citroen DS-21 Pallas;0;4;133.0;115.0;3090.;17.5;70;Europe\n',
 'Chevrolet Chevelle Concours (sw);0;8;350.0;165.0;4142.;11.5;70;US\n',
 'Ford Torino (sw);0;8;351.0;153.0;4034.;11.0;70;US\n',
 'Plymouth Satellite (sw);0;8;383.0;175.0;4166.;10.5;70;US\n',
 'AMC Rebe

So you might be wondering which method to use? Use the `readlines()` method, or use the iterator methods?

Especially if you ending up reading the entire file - would one method be better than the other?

Consider this example, where we want to find out all the different origins in the file (last column of each row) - let's do this using both approaches.

In [1]:
origins = set()
with open('cars.csv') as f:
    rows = f.readlines()
for row in rows[2:]:
    origin = row.strip('\n').split(';')[-1]
    origins.add(origin)
print(origins)

{'Japan', 'Europe', 'US'}


In [2]:
origins = set()
with open('cars.csv') as f:
    next(f), next(f)
    for row in f:
        origin = row.strip('\n').split(';')[-1]
        origins.add(origin)
print(origins)

{'Japan', 'Europe', 'US'}


Now consider the first approach: we loaded the **entire** file into memory (a list), and then iterated through all the rows.

But in the second approach, we still iterated through all the rows, but we only need to store **one row** at a time - the overhead was therefore far smaller.

Often we can process files one row at a time and loading the entire file first, especially for huge files, is not always desirable.

The `enumerate` function is another lazy iterator:

In [42]:
e = enumerate('Python rocks!')

In [43]:
print('__iter__' in dir(e))
print('__next__' in dir(e))

True
True


In [44]:
iter(e)

<enumerate at 0x1d75df12fc0>

In [45]:
e

<enumerate at 0x1d75df12fc0>

As we can see, the object and its iterator are the same object.

But `enumerate` is also lazy, so we need to iterate through it in order to recover all the elements:

In [46]:
list(e)

[(0, 'P'),
 (1, 'y'),
 (2, 't'),
 (3, 'h'),
 (4, 'o'),
 (5, 'n'),
 (6, ' '),
 (7, 'r'),
 (8, 'o'),
 (9, 'c'),
 (10, 'k'),
 (11, 's'),
 (12, '!')]

Of course, once we have exhausted the iterator, we cannot use it again:

In [47]:
list(e)

[]

The dictionary object provides methods that return iterables for the keys, values or tuples of key/value pairs:

In [63]:
d = {'a': 1, 'b': 2}

In [64]:
keys = d.keys()

In [66]:
'__iter__' in dir(keys), '__next__' in dir(keys)

(True, False)

More simply, we can just test to see if `iter(keys)` **is** the same object as `keys` - if not then we are dealing with an iterable.

In [67]:
iter(keys) is keys

False

So we have an iterable.

Similarly for `.values()` and `.items()`:

In [68]:
values = d.values()
iter(values) is values

False

In [69]:
items = d.items()
iter(items) is items

False

There are many other such functions and methods in Python, and we'll cover more of them in some upcoming videos

Just be careful and know whether you are dealing with an iterable or an iterator. You can iterate an iterable over and over again, but can only do so once with an iterator.

##  Sorting Iterables

There's nothing really new here - we have seen the `sorted()` function before when we looked at sorting sequences.

The `sorted()` function will in fact work with any iterable, not just sequences.

Let's try this by creating a custom iterable and then sorting it.

For this example, we'll create an iterable of random numbers, and then sort it.

In [1]:
import random

In [2]:
random.seed(0)

In [8]:
for i in range(10):
    print(random.randint(1, 10))

10
4
9
3
5
3
2
10
5
9


In [4]:
import random

class RandomInts:
    def __init__(self, length, *, seed=0, lower=0, upper=10):
        self.length = length
        self.seed = seed
        self.lower = lower
        self.upper = upper
        
    def __len__(self):
        return self.length
    
    def __iter__(self):
        return self.RandomIterator(self.length, 
                                   seed = self.seed, 
                                   lower = self.lower,
                                   upper=self.upper)
    
    
    class RandomIterator:
        def __init__(self, length, *, seed, lower, upper):
            self.length = length
            self.lower = lower
            self.upper = upper
            self.num_requests = 0
            random.seed(seed)
            
        def __iter__(self):
            return self
        
        def __next__(self):
            if self.num_requests >= self.length:
                raise StopIteration
            else:
                result = random.randint(self.lower, self.upper)
                self.num_requests += 1
                return result

In [5]:
randoms = RandomInts(10)

In [6]:
for num in randoms:
    print(num)

6
6
0
4
8
7
6
4
7
5


We can now sort our iterable using the `sorted()` method:

In [7]:
sorted(randoms)

[0, 4, 4, 5, 6, 6, 6, 7, 7, 8]

In [9]:
sorted(randoms, reverse=True)

[8, 7, 7, 6, 6, 6, 5, 4, 4, 0]

##  The `iter()` Function

As we have seen before, the `iter()` function is used to request an iterator object from an iterable.

For example:

In [1]:
l = [1, 2, 3, 4]

In [2]:
l_iter = iter(l)

In [3]:
type(l_iter)

list_iterator

And we can use that iterator to iterate the collection by calling `next()` until a `StopIteration` exception is raised.

In [4]:
next(l_iter)

1

In [5]:
next(l_iter)

2

We also saw how sequence types are also iterable even though they are not actual iterables - they do not have an `__iter__` method, but instead they have a `__getitem__` method.

Python had no problem iterating a sequence object - in fact behind the scenes an iterator is built by Python in order to iterate using the `__getitem__` method:

In [6]:
class Squares:
    def __init__(self, n):
        self._n = n
    
    def __len__(self):
        return self._n
    
    def __getitem__(self, i):
        if i >= self._n:
            raise IndexError
        else:
            return i ** 2

In [7]:
sq = Squares(5)

In [8]:
for i in sq:
    print(i)

0
1
4
9
16


But, we can also do this:

In [9]:
sq_iter = iter(sq)

And we now have an iterator for `sq`!

In [10]:
type(sq_iter)

iterator

In [11]:
'__next__' in dir(sq_iter)

True

What happens is that Python will first try to get the iterator by invoking the `__iter__` method on our object.

If it does not have that method, it will look for `__getitem__` next - if it's there it will create an iterator for us that will leverage `__getitem__` and the fact that sequence indices should start at 0.

If neither `__iter__` nor `__getitem__` are found, then we'll get an exception such as this one:

In [12]:
for i in 10:
    print(i)

TypeError: 'int' object is not iterable

Here's how we might build an iterator using the `__getitem__` method ourselves - not that we have to do that since Python does it for us.

In [13]:
class Squares:
    def __init__(self, n):
        self._n = n
    
    def __len__(self):
        return self._n
    
    def __getitem__(self, i):
        if i >= self._n:
            raise IndexError
        else:
            return i ** 2

In [19]:
class SquaresIterator:
    def __init__(self, squares):
        self._squares = squares
        self._i = 0
        
    def __iter__(self):
        return self
    
    def __next__(self):
        if self._i >= len(self._squares):
            raise StopIteration
        else:
            result = self._squares[self._i]
            self._i += 1
            return result

In [20]:
sq = Squares(5)
sq_iterator = SquaresIterator(sq)

In [21]:
type(sq_iterator)

__main__.SquaresIterator

In [22]:
print(next(sq_iterator))
print(next(sq_iterator))
print(next(sq_iterator))
print(next(sq_iterator))
print(next(sq_iterator))

0
1
4
9
16


The iterator is now exhausted, so:

In [23]:
print(next(sq_iterator))

StopIteration: 

Technically, we don't actually need to implement the `__len__` method in our sequence type, but since we are using it in our iterator, we'll have to think of something else - we can leverage the fact that the sequence will raise an IndexError if the index is out of bounds:

In [25]:
class SquaresIterator:
    def __init__(self, squares):
        self._squares = squares
        self._i = 0
        
    def __iter__(self):
        return self
    
    def __next__(self):
        try:
            result = self._squares[self._i]
            self._i += 1
            return result
        except IndexError:
            raise StopIteration()

And things will work as before:

In [26]:
sq_iterator = SquaresIterator(sq)

In [27]:
for i in sq_iterator:
    print(i)

0
1
4
9
16


#### How to test if an object is iterable

Basically an object is iterable if it:
* implements the **iterable** protocol (`__iter__` that returns an iterator)
* implements the **sequence** protocol (`__getitem__`, and `__len__`) - although `__len__` is not required for iteration


Given some object, how can we test to see if it is iterable or not?

The problem is that we would need to test for both `__iter__` (making sure it returns an iterator), and `__getitem__`. Far easier to do a try/except.

For example, just testing that `__iter__` is defined is not sufficient:

In [32]:
class SimpleIter:
    def __init__(self):
        pass
    
    def __iter__(self):
        return 'Nope'

In [33]:
s = SimpleIter()

In [34]:
'__iter__' in dir(s)

True

However, if we call `iter()` on `SimpleIter`, look at what happens:

In [36]:
iter(s)

TypeError: iter() returned non-iterator of type 'str'

So the best way, if you have some need to detect if something is iterable or not, is the following:

In [45]:
def is_iterable(obj):
    try:
        iter(obj)
        return True
    except TypeError:
        return False

In [50]:
is_iterable(SimpleIter())

False

In [51]:
is_iterable(Squares(5))

True

That said, we'll cover exception handling in Python later in this course, but there is rarely a need to test if something is iterable, only to then go ahead and iterate over it right after that if it is.

Consider the following two alternatives:

In [52]:
obj = 100
if is_iterable(obj):
    for i in obj:
        print(i)
else:
    print('Error: obj is not iterable')

Error: obj is not iterable


vs

In [53]:
obj = 100
for i in obj:
    print(i)

TypeError: 'int' object is not iterable

As you can see, the error Python itself raises tells us the same thing, and provides even more information!!

Instead of guarding for potential errors as we did in the first example, try doing the action you really want to do, and let Python raise the exception for you.

If you want to handle the exception, wrap your action inside a try/except:

So instead of writing it this way (*ask before you leap*):

In [54]:
obj = 100
if is_iterable(obj):
    for i in obj:
        print(i)
else:
    print('Error: obj is not iterable')
    print('Taking some action as a consequence of this error')

Error: obj is not iterable
Taking some action as a consequence of this error


prefer writing it this way (*ask for forgiveness later*):

In [55]:
obj = 100
try:
    for i in obj:
        print(i)
except TypeError:
    print('Error: obj is not iterable')
    print('Taking some action as a consequence of this error')

Error: obj is not iterable
Taking some action as a consequence of this error


This approach to exception handling we'll cover in a lot more detail later, but boils down to the simple idea:

*"It's easier to ask forgiveness than it is to get permission"*

(commonly attributed to Grace Hopper)

##  Iterating Callables

We can easily create iterators that are based on callables in general.

Let's look at an example:

##### Example 1

In this example we are going to create a counter function (using a closure) - it's a pretty simplistic function - `counter()` will return a closure that we can then call to increment an internal counter by `1` every time it is called:

In [2]:
def counter():
    i = 0
    
    def inc():
        nonlocal i
        i += 1
        return i
    return inc

This function allows us to create a simple counter, which we can use as follows:

In [3]:
cnt = counter()

In [4]:
cnt()

1

In [5]:
cnt()

2

Technically we can make an iterator to iterate over this counter:

In [6]:
class CounterIterator:
    def __init__(self, counter_callable):
        self.counter_callable = counter_callable
        
    def __iter__(self):
        return self
    
    def __next__(self):
        return self.counter_callable()

Do note that this is an **infinite** iterable!

In [7]:
cnt = counter()
cnt_iter = CounterIterator(cnt)
for _ in range(5):
    print(next(cnt_iter))

1
2
3
4
5


So basically we were able to create an **iterator** from some arbitrary callable.

But one issue is that we have an **inifinite** iterable.

One way around this issue, would be to specify a "stop" value when the iterator should decide to end the iteration.

Let's see how we would do this:

In [8]:
class CounterIterator:
    def __init__(self, counter_callable, sentinel):
        self.counter_callable = counter_callable
        self.sentinel = sentinel
        
    def __iter__(self):
        return self
    
    def __next__(self):
        result = self.counter_callable()
        if result == self.sentinel:
            raise StopIteration
        else:
            return result

Now we can essentially provide a value that if returned from the callable will result in a `StopIteration` exception, essentially terminating the iteration:

In [9]:
cnt = counter()
cnt_iter = CounterIterator(cnt, 5)
for c in cnt_iter:
    print(c)

1
2
3
4


Now there is technically an issue here: the cnt_iter is still "alive" - our iterator raised a `StopIteration` exception, but if we call it again, it will happily resume from where it left off!

In [10]:
next(cnt_iter)

6

We really should make sure the iterator has been consumed, so let's fix that:

In [11]:
class CounterIterator:
    def __init__(self, counter_callable, sentinel):
        self.counter_callable = counter_callable
        self.sentinel = sentinel
        self.is_consumed = False
        
    def __iter__(self):
        return self
    
    def __next__(self):
        if self.is_consumed:
            raise StopIteration
        else:
            result = self.counter_callable()
            if result == self.sentinel:
                self.is_consumed = True
                raise StopIteration
            else:
                return result

Now it should behave as a normal iterator that cannot continue iterating once the first `StopIteration` exception has been raised:

In [12]:
cnt = counter()
cnt_iter = CounterIterator(cnt, 5)
for c in cnt_iter:
    print(c)

1
2
3
4


In [13]:
next(cnt_iter)

StopIteration: 

As we just saw, we can essentially make an iterator based on any callable, and our `CounterIterator` was actually quite generic, it only needed a callable and a sentinel value to work.

In fact, that's exactly what the second form of the `iter()` function allows us to do!

Let's see the help on `iter`:

In [16]:
help(iter)

Help on built-in function iter in module builtins:

iter(...)
    iter(iterable) -> iterator
    iter(callable, sentinel) -> iterator
    
    Get an iterator from an object.  In the first form, the argument must
    supply its own iterator, or be a sequence.
    In the second form, the callable is called until it returns the sentinel.



As we can see `iter` has a second form, that takes in a callable and a sentinel value.

And it will result in exactly what we have been doing, but without having to create the iterator class ourselves!

In [17]:
cnt = counter()
cnt_iter = iter(cnt, 5)
for c in cnt_iter:
    print(c)

1
2
3
4


In [15]:
next(cnt_iter)

StopIteration: 

##### Example 2

Both of these approaches can be made to work with any callable.

For example, you may want to iterate through random numbers until a specific random number is generated:

In [18]:
import random

In [25]:
random.seed(0)
for i in range(10):
    print(i, random.randint(0, 10))

0 6
1 6
2 0
3 4
4 8
5 7
6 6
7 4
8 7
9 5


As you can see in this example (I set my seed to 0 to have repeatable results), the number `8` is reached at the `5`th iteration.

(I am just doing this to find an easy sentinel value so we can easily verify that our code is working properly)

In [28]:
random_iterator = iter(lambda : random.randint(0, 10), 8)

In [29]:
random.seed(0)

for num in random_iterator:
    print(num)

6
6
0
4


Neat!

##### Example 3

Let's try a countdown example like the one we discussed in the lecture.

We'll use a closure to get our countdown working:

In [1]:
def countdown(start=10):
    def run():
        nonlocal start
        start -= 1
        return start
    return run

In [6]:
takeoff = countdown(10)
for _ in range(15):
    print(takeoff())

9
8
7
6
5
4
3
2
1
0
-1
-2
-3
-4
-5


So the countdown function works, but we would like to be able to iterate over it and stop the iteration once we reach 0.

In [7]:
takeoff  = countdown(10)
takeoff_iter = iter(takeoff, -1)

In [8]:
for val in takeoff_iter:
    print(val)

9
8
7
6
5
4
3
2
1
0


##  Delegating Iterators

Often we write classes that use some existing iterable for the data contained in our class. By default, that class is not iterable, and we would need to implement an iterator for our class and implement the `__iter__` method in our class to return new instances of that iterator.

But, if our underlying data structure for our class is already an iterable, there's a much quicker way of doing it - delegation.

We'll start with a really simple example first:

In [3]:
from collections import namedtuple

Person = namedtuple('Person', 'first last')

In [4]:
class PersonNames:
    def __init__(self, persons):
        try:
            self._persons = [person.first.capitalize()
                             + ' ' + person.last.capitalize()
                            for person in persons]
        except (TypeError, AttributeError):
            self._persons = []

In [5]:
persons = [Person('michaeL', 'paLin'), Person('eric', 'idLe'), 
           Person('john', 'cLeese')]

In [13]:
person_names = PersonNames(persons)

Technically we can see the underlying data by accessing the (pseudo) private variable `_persons`.

In [14]:
person_names._persons

['Michael Palin', 'Eric Idle', 'John Cleese']

But we really would prefer making our `PersonNames` instances iterable.

To do so we need to implement the `__iter__` method that returns an iterator that can be used for iterating over the `_persons` list.

But lists are iterables, so they can provide an iterator, and that's precisely what we'll do - we'll **delegate** our own iterator, to the list's iterator:

In [8]:
class PersonNames:
    def __init__(self, persons):
        try:
            self._persons = [person.first.capitalize()
                             + ' ' + person.last.capitalize()
                            for person in persons]
        except TypeError:
            self._persons = []
    
    def __iter__(self):
        return iter(self._persons)

And now, `PersonNames` is iterable!

In [15]:
persons = [Person('michaeL', 'paLin'), Person('eric', 'idLe'), 
           Person('john', 'cLeese')]
person_names = PersonNames(persons)

In [16]:
for p in person_names:
    print(p)

Michael Palin
Eric Idle
John Cleese


And of course we can sort, use list comprehensions, and so on - our PersonNames **is** an iterable.

Here we sort the names based on the full name, then split the names (on the space) and return a tuple of first name, last name:

In [20]:
[tuple(person_name.split()) for person_name in sorted(person_names)]

[('Eric', 'Idle'), ('John', 'Cleese'), ('Michael', 'Palin')]

Or, if we want to sort based on the last name:

In [21]:
sorted(person_names, key=lambda x: x.split()[1])

['John Cleese', 'Eric Idle', 'Michael Palin']

##  Reversed Iteration

Sometimes we may want to iterate through an iterable but in **reverse** order.

Of course, this means the collection being iterated must be finite.

Python has a built-in function called `reversed()` to do this that will work with any type that implement the sequence protocol. But for iterables in general it's a little more complicated.

Let's first build a custom iterable.

For this example we are going to build a custom iterable that returns cards from a 52-card deck.

The deck will be in order of suits (Spades, Hearts, Diamonds and Clubs) and card values (from 2 (lowest) to Ace (highest)).

We are going to use lazy loading - i.e. we are not going to pre-build our card deck.

We just need to recognize that each suit contains `13` cards, so an integer division of the index of the card in the deck will tell us which suit it is. But of course we start indexing at 0.

**Example**

If the requested card is the `6`th in the deck (i.e. index = `5`):

`5 // 13 = 0` ==> first suit (Spades)

If the requested card is the `13`th in the deck (i.e. index = `12`):

`12 // 13 = 0` ==> first suit (Spades)

If the requested card is the `14`th in the deck (i.e. index = `13`):

`13 // 13 = 1` ==> second suit (Hearts)

To determine which card in the suit we are interested in, we simply need to use the `%` operator, again recognizing that there are `13` cards in each suit:

**Example**

If the requested card is the `6`th in the deck (i.e. index = `5`):

`5 % 13 = 5` ==> `5`th card in the suit

If the requested card is the `13`th in the deck (i.e. index = `12`):

`12 % 13 = 12` ==> `12`th card in the suit

If the requested card is the `14`th in the deck (i.e. index = `13`):

`13 % 13 = 0` ==> `1`st card in the suit

In [12]:
_SUITS = ('Spades', 'Hearts', 'Diamonds', 'Clubs')
_RANKS = tuple(range(2, 11) ) + tuple('JQKA')
from collections import namedtuple

Card = namedtuple('Card', 'rank suit')

class CardDeck:
    def __init__(self):
        self.length = len(_SUITS) * len(_RANKS)

    def __len__(self):
        return self.length
    
    def __iter__(self):
        return self.CardDeckIterator(self.length)
        
    class CardDeckIterator:
        def __init__(self, length):
            self.length = length
            self.i = 0
            
        def __iter__(self):
            return self
        
        def __next__(self):
            if self.i >= self.length:
                raise StopIteration
            else:
                suit = _SUITS[self.i // len(_RANKS)]
                rank = _RANKS[self.i % len(_RANKS)]
                self.i += 1
                return Card(rank, suit)

We can now iterate over a deck of cards as follows:

In [3]:
deck = CardDeck()

In [4]:
for card in deck:
    print(card)

Card(rank=2, suit='Spades')
Card(rank=3, suit='Spades')
Card(rank=4, suit='Spades')
Card(rank=5, suit='Spades')
Card(rank=6, suit='Spades')
Card(rank=7, suit='Spades')
Card(rank=8, suit='Spades')
Card(rank=9, suit='Spades')
Card(rank=10, suit='Spades')
Card(rank='J', suit='Spades')
Card(rank='Q', suit='Spades')
Card(rank='K', suit='Spades')
Card(rank='A', suit='Spades')
Card(rank=2, suit='Hearts')
Card(rank=3, suit='Hearts')
Card(rank=4, suit='Hearts')
Card(rank=5, suit='Hearts')
Card(rank=6, suit='Hearts')
Card(rank=7, suit='Hearts')
Card(rank=8, suit='Hearts')
Card(rank=9, suit='Hearts')
Card(rank=10, suit='Hearts')
Card(rank='J', suit='Hearts')
Card(rank='Q', suit='Hearts')
Card(rank='K', suit='Hearts')
Card(rank='A', suit='Hearts')
Card(rank=2, suit='Diamonds')
Card(rank=3, suit='Diamonds')
Card(rank=4, suit='Diamonds')
Card(rank=5, suit='Diamonds')
Card(rank=6, suit='Diamonds')
Card(rank=7, suit='Diamonds')
Card(rank=8, suit='Diamonds')
Card(rank=9, suit='Diamonds')
Card(rank=10, 

Now that we have our deck, how would we obtain the last `7` cards in reverse order from the deck?

One option is to generate a list of all the cards in the deck, then use a slice.

What about iterating in reverse? Using the same technique we generate a list that contains all the cards, reverse the list, and then iterate over the reversed list.

In [5]:
deck = list(CardDeck())

In [8]:
deck[:-8:-1]

[Card(rank='A', suit='Clubs'),
 Card(rank='K', suit='Clubs'),
 Card(rank='Q', suit='Clubs'),
 Card(rank='J', suit='Clubs'),
 Card(rank=10, suit='Clubs'),
 Card(rank=9, suit='Clubs'),
 Card(rank=8, suit='Clubs')]

And to iterate backwards:

In [7]:
deck = list(CardDeck())
deck = deck[::-1]
for card in deck:
    print(card)

Card(rank='A', suit='Clubs')
Card(rank='K', suit='Clubs')
Card(rank='Q', suit='Clubs')
Card(rank='J', suit='Clubs')
Card(rank=10, suit='Clubs')
Card(rank=9, suit='Clubs')
Card(rank=8, suit='Clubs')
Card(rank=7, suit='Clubs')
Card(rank=6, suit='Clubs')
Card(rank=5, suit='Clubs')
Card(rank=4, suit='Clubs')
Card(rank=3, suit='Clubs')
Card(rank=2, suit='Clubs')
Card(rank='A', suit='Diamonds')
Card(rank='K', suit='Diamonds')
Card(rank='Q', suit='Diamonds')
Card(rank='J', suit='Diamonds')
Card(rank=10, suit='Diamonds')
Card(rank=9, suit='Diamonds')
Card(rank=8, suit='Diamonds')
Card(rank=7, suit='Diamonds')
Card(rank=6, suit='Diamonds')
Card(rank=5, suit='Diamonds')
Card(rank=4, suit='Diamonds')
Card(rank=3, suit='Diamonds')
Card(rank=2, suit='Diamonds')
Card(rank='A', suit='Hearts')
Card(rank='K', suit='Hearts')
Card(rank='Q', suit='Hearts')
Card(rank='J', suit='Hearts')
Card(rank=10, suit='Hearts')
Card(rank=9, suit='Hearts')
Card(rank=8, suit='Hearts')
Card(rank=7, suit='Hearts')
Card(ran

This is kind of inefficient since we had to generate the entire list of cards, to then reverse it, and then only pick the first 7 cards from that reversed list.

Maybe we can try Python's built-in `reversed` function instead:

In [8]:
deck = CardDeck()

In [9]:
deck = reversed(deck)

TypeError: 'CardDeck' object is not reversible

As we can see, Python's `reversed` function will not work with out iterator. (It would work automatically with a sequence type, but in this case we don't have a sequence type)

What to do?

We need to somehow define a "reverse" iteration option for our iterator!

We do so by defining the __reversed__ special method in our iterable and instructing out iterator to return elements in reverse order.

If the `__reversed__` method is in our iterable, Python will use that to get the iterator when we call the `reverse()` function:

Let's try that out:

In [10]:
_SUITS = ('Spades', 'Hearts', 'Diamonds', 'Clubs')
_RANKS = tuple(range(2, 11) ) + ('J', 'Q', 'K', 'A')
from collections import namedtuple

Card = namedtuple('Card', 'rank suit')

class CardDeck:
    def __init__(self):
        self.length = len(_SUITS) * len(_RANKS)

    def __len__(self):
        return self.length
    
    def __iter__(self):
        return self.CardDeckIterator(self.length)
        
    def __reversed__(self):
        return self.CardDeckIterator(self.length, reverse=True)
    
    class CardDeckIterator:
        def __init__(self, length, *, reverse=False):
            self.length = length
            self.reverse = reverse
            self.i = 0
            
        def __iter__(self):
            return self
        
        def __next__(self):
            if self.i >= self.length:
                raise StopIteration
            else:
                if self.reverse:
                    index = self.length -1 - self.i
                else:
                    index = self.i
                suit = _SUITS[index // len(_RANKS)]
                rank = _RANKS[index % len(_RANKS)]
                self.i += 1
                return Card(rank, suit)
            


In [11]:
deck = CardDeck()

In [12]:
for card in deck:
    print(card)

Card(rank=2, suit='Spades')
Card(rank=3, suit='Spades')
Card(rank=4, suit='Spades')
Card(rank=5, suit='Spades')
Card(rank=6, suit='Spades')
Card(rank=7, suit='Spades')
Card(rank=8, suit='Spades')
Card(rank=9, suit='Spades')
Card(rank=10, suit='Spades')
Card(rank='J', suit='Spades')
Card(rank='Q', suit='Spades')
Card(rank='K', suit='Spades')
Card(rank='A', suit='Spades')
Card(rank=2, suit='Hearts')
Card(rank=3, suit='Hearts')
Card(rank=4, suit='Hearts')
Card(rank=5, suit='Hearts')
Card(rank=6, suit='Hearts')
Card(rank=7, suit='Hearts')
Card(rank=8, suit='Hearts')
Card(rank=9, suit='Hearts')
Card(rank=10, suit='Hearts')
Card(rank='J', suit='Hearts')
Card(rank='Q', suit='Hearts')
Card(rank='K', suit='Hearts')
Card(rank='A', suit='Hearts')
Card(rank=2, suit='Diamonds')
Card(rank=3, suit='Diamonds')
Card(rank=4, suit='Diamonds')
Card(rank=5, suit='Diamonds')
Card(rank=6, suit='Diamonds')
Card(rank=7, suit='Diamonds')
Card(rank=8, suit='Diamonds')
Card(rank=9, suit='Diamonds')
Card(rank=10, 

In [13]:
deck = reversed(CardDeck())
for card in deck:
    print(card)

Card(rank='A', suit='Clubs')
Card(rank='K', suit='Clubs')
Card(rank='Q', suit='Clubs')
Card(rank='J', suit='Clubs')
Card(rank=10, suit='Clubs')
Card(rank=9, suit='Clubs')
Card(rank=8, suit='Clubs')
Card(rank=7, suit='Clubs')
Card(rank=6, suit='Clubs')
Card(rank=5, suit='Clubs')
Card(rank=4, suit='Clubs')
Card(rank=3, suit='Clubs')
Card(rank=2, suit='Clubs')
Card(rank='A', suit='Diamonds')
Card(rank='K', suit='Diamonds')
Card(rank='Q', suit='Diamonds')
Card(rank='J', suit='Diamonds')
Card(rank=10, suit='Diamonds')
Card(rank=9, suit='Diamonds')
Card(rank=8, suit='Diamonds')
Card(rank=7, suit='Diamonds')
Card(rank=6, suit='Diamonds')
Card(rank=5, suit='Diamonds')
Card(rank=4, suit='Diamonds')
Card(rank=3, suit='Diamonds')
Card(rank=2, suit='Diamonds')
Card(rank='A', suit='Hearts')
Card(rank='K', suit='Hearts')
Card(rank='Q', suit='Hearts')
Card(rank='J', suit='Hearts')
Card(rank=10, suit='Hearts')
Card(rank=9, suit='Hearts')
Card(rank=8, suit='Hearts')
Card(rank=7, suit='Hearts')
Card(ran

#### Reversing Sequences

I just want to point out that if we have a custom **sequence** type we don't need to worry about this.

Let's see a quick example:

In [14]:
class Squares:
    def __init__(self, length):
        self.squares = [i **2 for i in range(length)]
        
    def __len__(self):
        return len(self.squares)
    
    def __getitem__(self, s):
        return self.squares[s]

In [15]:
sq = Squares(10)

In [16]:
for num in Squares(5):
    print(num)

0
1
4
9
16


In [17]:
for num in reversed(Squares(5)):
    print(num)

16
9
4
1
0


As you can see Python was able to automatically reverse the sequence for us.

Also worth noting is that the `__len__` method **must** be implemented for `reversed()` to work:

In [18]:
class Squares:
    def __init__(self, length):
        self.squares = [i **2 for i in range(length)]
        
#     def __len__(self):
#         return len(self.squares)
    
    def __getitem__(self, s):
        return self.squares[s]

In [19]:
for num in reversed(Squares(5)):
    print(num)

TypeError: object of type 'Squares' has no len()

In addition, we can override what is returned when the `reversed()` function is called on our custom sequence type. Here, I'll return the list of the integers themselves instead of squares just to make this really stand out:

In [9]:
class Squares:
    def __init__(self, length):
        self.length = length
        self.squares = [i **2 for i in range(length)]
        
    def __len__(self):
        return len(self.squares)
    
    def __getitem__(self, s):
        return self.squares[s]
    
    def __reversed__(self):
        print('__reversed__ called')
        return [i for i in range(self.length-1, -1, -1)]

In [10]:
for num in Squares(5):
    print(num)

0
1
4
9
16


In [11]:
for num in reversed(Squares(5)):
    print(num)

__reversed__ called
4
3
2
1
0


# Section 05 - Project 2

##  Project: Description

The starting point for this project is the `Polygon` class and the `Polygons` sequence type we created in the previous project.

The code for these classes along with the unit tests for the `Polygon` class are below if you want to use those as your starting point. But use whatever you came up with in the last project.

We have two goals:

##### Goal 1

Refactor the `Polygon` class so that all the calculated properties are lazy properties, i.e. they should still be calculated properties, but they should not have to get recalculated more than once (since we made our `Polygon` class "immutable").

##### Goal 2

Refactor the `Polygons` (sequence) type, into an **iterable**. Make sure also that the elements in the iterator are computed lazily - i.e. you can no longer use a list as an underlying storage mechanism for your polygons.

You'll need to implement both an iterable, and an iterator.

##### Code from Previous Project

In [1]:
import math

class Polygon:
    def __init__(self, n, R):
        if n < 3:
            raise ValueError('Polygon must have at least 3 vertices.')
        self._n = n
        self._R = R
        
    def __repr__(self):
        return f'Polygon(n={self._n}, R={self._R})'
    
    @property
    def count_vertices(self):
        return self._n
    
    @property
    def count_edges(self):
        return self._n
    
    @property
    def circumradius(self):
        return self._R
    
    @property
    def interior_angle(self):
        return (self._n - 2) * 180 / self._n

    @property
    def side_length(self):
        return 2 * self._R * math.sin(math.pi / self._n)
    
    @property
    def apothem(self):
        return self._R * math.cos(math.pi / self._n)
    
    @property
    def area(self):
        return self._n / 2 * self.side_length * self.apothem
    
    @property
    def perimeter(self):
        return self._n * self.side_length
    
    def __eq__(self, other):
        if isinstance(other, self.__class__):
            return (self.count_edges == other.count_edges 
                    and self.circumradius == other.circumradius)
        else:
            return NotImplemented
        
    def __gt__(self, other):
        if isinstance(other, self.__class__):
            return self.count_vertices > other.count_vertices
        else:
            return NotImplemented

In [2]:
def test_polygon():
    abs_tol = 0.001
    rel_tol = 0.001
    
    try:
        p = Polygon(2, 10)
        assert False, ('Creating a Polygon with 2 sides: '
                       ' Exception expected, not received')
    except ValueError:
        pass
                       
    n = 3
    R = 1
    p = Polygon(n, R)
    assert str(p) == 'Polygon(n=3, R=1)', f'actual: {str(p)}'
    assert p.count_vertices == n, (f'actual: {p.count_vertices},'
                                   f' expected: {n}')
    assert p.count_edges == n, f'actual: {p.count_edges}, expected: {n}'
    assert p.circumradius == R, f'actual: {p.circumradius}, expected: {n}'
    assert p.interior_angle == 60, (f'actual: {p.interior_angle},'
                                    ' expected: 60')
    n = 4
    R = 1
    p = Polygon(n, R)
    assert p.interior_angle == 90, (f'actual: {p.interior_angle}, '
                                    ' expected: 90')
    assert math.isclose(p.area, 2, 
                        rel_tol=abs_tol, 
                        abs_tol=abs_tol), (f'actual: {p.area},'
                                           ' expected: 2.0')
    
    assert math.isclose(p.side_length, math.sqrt(2),
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.side_length},'
                                          f' expected: {math.sqrt(2)}')
    
    assert math.isclose(p.perimeter, 4 * math.sqrt(2),
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.perimeter},'
                                          f' expected: {4 * math.sqrt(2)}')
    
    assert math.isclose(p.apothem, 0.707,
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.perimeter},'
                                          ' expected: 0.707')
    p = Polygon(6, 2)
    assert math.isclose(p.side_length, 2,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.apothem, 1.73205,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.area, 10.3923,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.perimeter, 12,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.interior_angle, 120,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    
    p = Polygon(12, 3)
    assert math.isclose(p.side_length, 1.55291,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.apothem, 2.89778,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.area, 27,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.perimeter, 18.635,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.interior_angle, 150,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    
    p1 = Polygon(3, 10)
    p2 = Polygon(10, 10)
    p3 = Polygon(15, 10)
    p4 = Polygon(15, 100)
    p5 = Polygon(15, 100)
    
    assert p2 > p1
    assert p2 < p3
    assert p3 != p4
    assert p1 != p4
    assert p4 == p5

In [1]:
class Polygons:
    def __init__(self, m, R):
        if m < 3:
            raise ValueError('m must be greater than 3')
        self._m = m
        self._R = R
        self._polygons = [Polygon(i, R) for i in range(3, m+1)]
        
    def __len__(self):
        return self._m - 2
    
    def __repr__(self):
        return f'Polygons(m={self._m}, R={self._R})'
    
    def __getitem__(self, s):
        return self._polygons[s]
    
    @property
    def max_efficiency_polygon(self):
        sorted_polygons = sorted(self._polygons, 
                                 key=lambda p: p.area/p.perimeter,
                                reverse=True)
        return sorted_polygons[0]

##  Project Solution: Goal 1

Our starting point is where we left off at the end of Project 1.

Our first goal is to rewrite all properties as lazy properties.

Here is the code we ended up with in Project 1:

In [1]:
import math

class Polygon:
    def __init__(self, n, R):
        if n < 3:
            raise ValueError('Polygon must have at least 3 vertices.')
        self._n = n
        self._R = R
        
    def __repr__(self):
        return f'Polygon(n={self._n}, R={self._R})'
    
    @property
    def count_vertices(self):
        return self._n
    
    @property
    def count_edges(self):
        return self._n
    
    @property
    def circumradius(self):
        return self._R
    
    @property
    def interior_angle(self):
        return (self._n - 2) * 180 / self._n

    @property
    def side_length(self):
        return 2 * self._R * math.sin(math.pi / self._n)
    
    @property
    def apothem(self):
        return self._R * math.cos(math.pi / self._n)
    
    @property
    def area(self):
        return self._n / 2 * self.side_length * self.apothem
    
    @property
    def perimeter(self):
        return self._n * self.side_length
    
    def __eq__(self, other):
        if isinstance(other, self.__class__):
            return (self.count_edges == other.count_edges 
                    and self.circumradius == other.circumradius)
        else:
            return NotImplemented
        
    def __gt__(self, other):
        if isinstance(other, self.__class__):
            return self.count_vertices > other.count_vertices
        else:
            return NotImplemented
            

Let's rewrite computed properties as lazy properties.

To do that we need to do two things:

* create a private backing variable for the property
* compute the property if the backing variable is None and store the result into the backing variable

In [2]:
class Polygon:
    def __init__(self, n, R):
        if n < 3:
            raise ValueError('Polygon must have at least 3 vertices.')
        self._n = n
        self._R = R
        
        self._interior_angle = None
        self._side_length = None
        self._apothem = None
        self._area = None
        self._perimeter = None
        
    def __repr__(self):
        return f'Polygon(n={self._n}, R={self._R})'
    
    @property
    def count_vertices(self):
        return self._n
    
    @property
    def count_edges(self):
        return self._n
    
    @property
    def circumradius(self):
        return self._R
    
    @property
    def interior_angle(self):
        if self._interior_angle is None:
            self._interior_angle = (self._n - 2) * 180 / self._n
        return self._interior_angle

    @property
    def side_length(self):
        if self._side_length is None:
            self._side_length = 2 * self._R * math.sin(math.pi / self._n)
        return self._side_length
    
    @property
    def apothem(self):
        if self._apothem is None:
            self._apothem = self._R * math.cos(math.pi / self._n)
        return self._apothem
    
    @property
    def area(self):
        if self._area is None:
            self._area = self._n / 2 * self.side_length * self.apothem
        return self._area
    
    @property
    def perimeter(self):
        if self._perimeter is None:
            self._perimeter = self._n * self.side_length
        return self._perimeter
    
    def __eq__(self, other):
        if isinstance(other, self.__class__):
            return (self.count_edges == other.count_edges 
                    and self.circumradius == other.circumradius)
        else:
            return NotImplemented
        
    def __gt__(self, other):
        if isinstance(other, self.__class__):
            return self.count_vertices > other.count_vertices
        else:
            return NotImplemented

And let's run the same unit test we wrote for Project 1:

In [3]:
def test_polygon():
    abs_tol = 0.001
    rel_tol = 0.001
    
    try:
        p = Polygon(2, 10)
        assert False, ('Creating a Polygon with 2 sides: '
                       ' Exception expected, not received')
    except ValueError:
        pass
                       
    n = 3
    R = 1
    p = Polygon(n, R)
    assert str(p) == 'Polygon(n=3, R=1)', f'actual: {str(p)}'
    assert p.count_vertices == n, (f'actual: {p.count_vertices},'
                                   f' expected: {n}')
    assert p.count_edges == n, f'actual: {p.count_edges}, expected: {n}'
    assert p.circumradius == R, f'actual: {p.circumradius}, expected: {n}'
    assert p.interior_angle == 60, (f'actual: {p.interior_angle},'
                                    ' expected: 60')
    n = 4
    R = 1
    p = Polygon(n, R)
    assert p.interior_angle == 90, (f'actual: {p.interior_angle}, '
                                    ' expected: 90')
    assert math.isclose(p.area, 2, 
                        rel_tol=abs_tol, 
                        abs_tol=abs_tol), (f'actual: {p.area},'
                                           ' expected: 2.0')
    
    assert math.isclose(p.side_length, math.sqrt(2),
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.side_length},'
                                          f' expected: {math.sqrt(2)}')
    
    assert math.isclose(p.perimeter, 4 * math.sqrt(2),
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.perimeter},'
                                          f' expected: {4 * math.sqrt(2)}')
    
    assert math.isclose(p.apothem, 0.707,
                       rel_tol=rel_tol,
                       abs_tol=abs_tol), (f'actual: {p.perimeter},'
                                          ' expected: 0.707')
    p = Polygon(6, 2)
    assert math.isclose(p.side_length, 2,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.apothem, 1.73205,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.area, 10.3923,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.perimeter, 12,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.interior_angle, 120,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    
    p = Polygon(12, 3)
    assert math.isclose(p.side_length, 1.55291,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.apothem, 2.89778,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.area, 27,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.perimeter, 18.635,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    assert math.isclose(p.interior_angle, 150,
                        rel_tol=rel_tol, abs_tol=abs_tol)
    
    p1 = Polygon(3, 10)
    p2 = Polygon(10, 10)
    p3 = Polygon(15, 10)
    p4 = Polygon(15, 100)
    p5 = Polygon(15, 100)
    
    assert p2 > p1
    assert p2 < p3
    assert p3 != p4
    assert p1 != p4
    assert p4 == p5

In [4]:
test_polygon()

OK, looks like this goal is complete.

##  Project Solution: Goal 2

For this goal we need to rewrite our sequence type `Polygons` into something that implements the iterable protocol.

Furthermore, all iterators should be lazy.

We'll need our `Polygon` class:

In [1]:
import math

class Polygon:
    def __init__(self, n, R):
        if n < 3:
            raise ValueError('Polygon must have at least 3 vertices.')
        self._n = n
        self._R = R
        
        self._interior_angle = None
        self._side_length = None
        self._apothem = None
        self._area = None
        self._perimeter = None
        
    def __repr__(self):
        return f'Polygon(n={self._n}, R={self._R})'
    
    @property
    def count_vertices(self):
        return self._n
    
    @property
    def count_edges(self):
        return self._n
    
    @property
    def circumradius(self):
        return self._R
    
    @property
    def interior_angle(self):
        if self._interior_angle is None:
            self._interior_angle = (self._n - 2) * 180 / self._n
        return self._interior_angle

    @property
    def side_length(self):
        if self._side_length is None:
            self._side_length = 2 * self._R * math.sin(math.pi / self._n)
        return self._side_length
    
    @property
    def apothem(self):
        if self._apothem is None:
            self._apothem = self._R * math.cos(math.pi / self._n)
        return self._apothem
    
    @property
    def area(self):
        if self._area is None:
            self._area = self._n / 2 * self.side_length * self.apothem
        return self._area
    
    @property
    def perimeter(self):
        if self._perimeter is None:
            self._perimeter = self._n * self.side_length
        return self._perimeter
    
    def __eq__(self, other):
        if isinstance(other, self.__class__):
            return (self.count_edges == other.count_edges 
                    and self.circumradius == other.circumradius)
        else:
            return NotImplemented
        
    def __gt__(self, other):
        if isinstance(other, self.__class__):
            return self.count_vertices > other.count_vertices
        else:
            return NotImplemented

And here's our original implementation of `Polygons` as a sequence type:

In [2]:
class Polygons:
    def __init__(self, m, R):
        if m < 3:
            raise ValueError('m must be greater than 3')
        self._m = m
        self._R = R
        self._polygons = [Polygon(i, R) for i in range(3, m+1)]
        
    def __len__(self):
        return self._m - 2
    
    def __repr__(self):
        return f'Polygons(m={self._m}, R={self._R})'
    
    def __getitem__(self, s):
        return self._polygons[s]
    
    @property
    def max_efficiency_polygon(self):
        sorted_polygons = sorted(self._polygons, 
                                 key=lambda p: p.area/p.perimeter,
                                reverse=True)
        return sorted_polygons[0]

We now need to implement the iterable protocol - which means we'll need to implement an iterator first.

In [3]:
class PolygonsIterator:
    def __init__(self, m, R):
        if m < 3:
            raise ValueError('m must be greater than 3')
        self._m = m
        self._R = R
        
    def __iter__(self):
        return self
    
    def __next__(self):
        pass

This is the basic skeleton we need to implement the iterator protocol.

So now we need to implement the __next__ method.

This method should simply return the `next` instance of a Polygon - we start with polygons with `3` sides, and work our way up.

To do this, we'll use a private variable `_i` to trqack the number-of-side Polygon to hand out next. So, we'll start `_i` at `3` and work our way up.

In [4]:
class PolygonsIterator:
    def __init__(self, m, R):
        if m < 3:
            raise ValueError('m must be greater than 3')
        self._m = m
        self._R = R
        self._i = 3
        
    def __iter__(self):
        return self
    
    def __next__(self):
        if self._i > self._m:
            raise StopIteration
        else:
            result = Polygon(self._i, self._R)
            self._i += 1
            return result

Let's make sure the iterator works as expected:

In [5]:
p_iter = PolygonsIterator(5, 1)
for p in p_iter:
    print(p)

Polygon(n=3, R=1)
Polygon(n=4, R=1)
Polygon(n=5, R=1)


Of course, this is an iterator, so it should be exhausted now:

In [6]:
list(p_iter)

[]

Looks good, so next we have to replace the sequence protocol with the iterable protocol in our `Polygons` class.

In [7]:
class Polygons:
    def __init__(self, m, R):
        if m < 3:
            raise ValueError('m must be greater than 3')
        self._m = m
        self._R = R
        
    def __len__(self):
        return self._m - 2
    
    def __repr__(self):
        return f'Polygons(m={self._m}, R={self._R})'
    
    def __iter__(self):
        return PolygonsIterator(self._m, self._R)
    
    @property
    def max_efficiency_polygon(self):
        sorted_polygons = sorted(self._polygons, 
                                 key=lambda p: p.area/p.perimeter,
                                reverse=True)
        return sorted_polygons[0]

And now we should have an iterable:

In [8]:
polygons = Polygons(5, 1)

In [9]:
for p in polygons:
    print(p)

Polygon(n=3, R=1)
Polygon(n=4, R=1)
Polygon(n=5, R=1)


In [10]:
for p in polygons:
    print(p)

Polygon(n=3, R=1)
Polygon(n=4, R=1)
Polygon(n=5, R=1)


Finally, we also need to make our `max_efficiency_polygon` a lazy property:

In [11]:
class Polygons:
    def __init__(self, m, R):
        if m < 3:
            raise ValueError('m must be greater than 3')
        self._m = m
        self._R = R
        self._max_efficiency_polygon = None
        
    def __len__(self):
        return self._m - 2
    
    def __repr__(self):
        return f'Polygons(m={self._m}, R={self._R})'
    
    def __iter__(self):
        return PolygonsIterator(self._m, self._R)
    
    @property
    def max_efficiency_polygon(self):
        if self._max_efficiency_polygon is None:
            sorted_polygons = sorted(self._polygons, 
                                     key=lambda p: p.area/p.perimeter,
                                    reverse=True)
            self._max_efficiency_polygon = sorted_polygons[0]
        return self._max_efficiency_polygon

Let's test that to make sure it still calculates correctly (should always return the largest (in terms of edges/vertices) Polygon in the iterable.

In [12]:
polygons = Polygons(10, 1)
print(polygons.max_efficiency_polygon)

AttributeError: 'Polygons' object has no attribute '_polygons'

As you can see, we have a slight problem. We also need to change the iterable passed to the `sorted` method - we no longer have a list of Polygons.

But that's easily fixed since `sorted` can work with iterables and iterators in general!

In [13]:
class Polygons:
    def __init__(self, m, R):
        if m < 3:
            raise ValueError('m must be greater than 3')
        self._m = m
        self._R = R
        self._max_efficiency_polygon = None
        
    def __len__(self):
        return self._m - 2
    
    def __repr__(self):
        return f'Polygons(m={self._m}, R={self._R})'
    
    def __iter__(self):
        return PolygonsIterator(self._m, self._R)
    
    @property
    def max_efficiency_polygon(self):
        if self._max_efficiency_polygon is None:
            sorted_polygons = sorted(PolygonsIterator(self._m, self._R), 
                                     key=lambda p: p.area/p.perimeter,
                                    reverse=True)
            self._max_efficiency_polygon = sorted_polygons[0]
        return self._max_efficiency_polygon

And let's test that again:

In [14]:
polygons = Polygons(10, 1)
print(polygons.max_efficiency_polygon)

Polygon(n=10, R=1)


OK, that seems to work!

# Section 06 - Generators

##  Yielding and Generators

Let's start by writing a "simple" iterator first using the techniques we learned in the previous section.

In [1]:
import math

In [2]:
class FactIter:
    def __init__(self, n):
        self.n = n
        self.i = 0

    def __iter__(self):
        return self

    def __next__(self):
        if self.i >= self.n:
            raise StopIteration
        else:
            result = math.factorial(self.i)
            self.i += 1
            return result

In [3]:
fact_iter = FactIter(5)

In [4]:
for num in fact_iter:
    print(num)

1
1
2
6
24


We could achieve the same thing using the `iter` method's second form - we just have to know our sentinel value - in this case it would be the factorial of n+1 where n is the last integer's factorial we want our iterator to produce:

In [5]:
def fact():
    i = 0
    def inner():
        nonlocal i
        result = math.factorial(i)
        i += 1
        return result
    return inner           

In [6]:
fact_iter = iter(fact(), math.factorial(5))

In [7]:
for num in fact_iter:
    print(num)

1
1
2
6
24


You'll note that in both cases `fact_iter` was an **iterator**. In the first example we implemented the iterator ourselves, in the second example Python built-it for us.

The second example was a little less code, but maybe a little more difficult to understand if we were just shown the code without having written it ourselves.

There has to be a better way!!

And indeed, there is... generators.

Let's look at the `yield` statement first.

The `yield` statement is used almost like a `return` statement in a function - but there is a huge difference - when the `yield` statement is encountered, Python returns whatever value `yield` specifies, but it "pauses" execution of the function. We can then "call" the same function again and it will "resume" from where the last `yield` was encountered.

I say "call" because we do not "resume" the function by calling it - instead we use the function... `next()` !!!

Let's try it:

In [8]:
def my_func():
    print('line 1')
    yield 'Flying'
    print('line 2')
    yield 'Circus'    

In [9]:
my_func()

<generator object my_func at 0x0000019DA77D3BA0>

So, executing `my_func()`, returned a generator object - it did not actually "run" the body of `my_func` (none of our print statements actually ran).

To do that, we need to use the `next()` function. 

`next()`?? Isn't that what we use for iteration??

In [10]:
gen_my_func = my_func()

In [11]:
next(gen_my_func)

line 1


'Flying'

In [12]:
next(gen_my_func)

line 2


'Circus'

And let's call it one more time:

In [13]:
next(gen_my_func)

StopIteration: 

A `StopIteration` exception.

Hmmm... `next`, `StopIteration`? What does this look like? 

An **iterator**!

And in fact that's exactly what Python generators are - they **are** iterators. 

If generators are iterators, they should implement the iterator **protocol**.

Let's see:

In [14]:
gen_my_func = my_func()

In [15]:
'__iter__' in dir(gen_my_func)

True

In [16]:
'__next__' in dir(gen_my_func)

True

And so we just have an iterator, which we can use with the `iter()` function and the `next()` function like any other iterator:

In [17]:
gen_my_func

<generator object my_func at 0x0000019DA78660A0>

In [18]:
iter(gen_my_func)

<generator object my_func at 0x0000019DA78660A0>

As you can see, the `iter` function returned the same object - something we expect with iterators.

So if this is an iterator that Python builds, how does it know when to stop the iteration (raise the `StopIteration` exception)?

In the example above, it seemed clear - when the function finished running - there were no more statements after that last `yield`.

What actually happens if a function finishes running and we don't explicitly return something?

Remember that Python fills in the gap, and returns `None`.

In general, the iteration will terminate when we **return** something from the function.

Let's take a look:

In [19]:
def squares(sentinel):
    i = 0
    while True:
        if i < sentinel:
            result = i**2
            i += 1
            yield result
        else:
            return 'all done!'

In [20]:
sq = squares(3)

In [21]:
next(sq)

0

In [22]:
next(sq)

1

In [23]:
next(sq)

4

In [24]:
next(sq)

StopIteration: all done!

And the return value of our function became the message of the `StopIteration` exception.

But, we can simplify this slightly:

In [25]:
def squares(sentinel):
    i = 0
    while True:
        if i < sentinel:
            yield i**2
            i += 1 # note how we can incremenet **after** the yield
        else:
            return 'all done!'

In [26]:
for num in squares(5):
    print(num)

0
1
4
9
16


So now let's see how we could re-write our initial `factorial` example:

In [27]:
def factorials(n):
    for i in range(n):
        yield math.factorial(i)    

In [28]:
for num in factorials(5):
    print(num)

1
1
2
6
24


Now that's a much simpler and understandable way to create the iterator!

Note that a generator **is** an iterator, but not vice-versa - iterators are not necessarily generators, just like sequences are iterables, but iterables are not necessarily sequences.

Another thing to note is that since generators are iterators, they also  become exhausted (consumed) just like an iterator does.

In [29]:
facts = factorials(5)

In [30]:
list(facts)

[1, 1, 2, 6, 24]

In [31]:
list(facts)

[]

As you can see, our second iteration through the same generator ended up with nothing - that's because the generator has been exhausted:

In [32]:
next(facts)

StopIteration: 

##  Example: Fibonacci Sequence

Here is the Fibonacci sequence:

```
1 1 2 3 5 8 13 ...
```

As you can see there is a recursive definition of the numbers in this sequence:

```
Fib(n) = Fib(n-1) + Fib(n-2)
```
where 

```
Fib(0) = 1
``` 

and

```
Fib(1) = 1
```

Although we can certainly use a recursive approach to calculate the *n-th* number in the sequence, it is not a very effective method - we can of course help it by using memoization, but we'll still run into Python's maximum recursion depth. In Python there is a maximum number of times a recursive function can call itself (creating a stack frame at every nested call) before Python gives us an exception that we have exceeded the maximum permitted depth (the number of recursive calls). We can actually change that number if we want to, but if we're running into that limitation, it might be better creating a non-recursive algorithm - recursion can be elegant, but not particularly efficient.

In [1]:
def fib_recursive(n):
    if n <= 1:
        return 1
    else:
        return fib_recursive(n-1) + fib_recursive(n-2)

In [2]:
[fib_recursive(i) for i in range(7)]

[1, 1, 2, 3, 5, 8, 13]

But this quickly becomes an issue as `n` grows larger:

In [3]:
from timeit import timeit

In [4]:
timeit('fib_recursive(10)', globals=globals(), number=10)

0.00027306209887231856

In [5]:
timeit('fib_recursive(28)', globals=globals(), number=10)

1.5438638503706388

In [6]:
timeit('fib_recursive(29)', globals=globals(), number=10)

2.507533317368592

We can alleviate this by using memoization:

In [7]:
from functools import lru_cache

In [8]:
@lru_cache()
def fib_recursive(n):
    if n <= 1:
        return 1
    else:
        return fib_recursive(n-1) + fib_recursive(n-2)

In [9]:
timeit('fib_recursive(10)', globals=globals(), number=10)

9.75221781729374e-06

In [10]:
timeit('fib_recursive(29)', globals=globals(), number=10)

1.9775330573068572e-05

As you can see, performance is greatly improved, but we still have a recursion depth limit:

In [11]:
@lru_cache()
def fib_recursive(n):
    if n <= 1:
        return 1
    else:
        return fib_recursive(n-1) + fib_recursive(n-2)

In [12]:
fib_recursive(2000)

RecursionError: maximum recursion depth exceeded while calling a Python object

So we can use a non-recursive approach to calculate the `n-th` Fibonacci number:

In [13]:
def fib(n):
    fib_0 = 1
    fib_1 = 1
    for i in range(n-1):
        fib_0, fib_1 = fib_1, fib_0 + fib_1
    return fib_1

In [14]:
[fib(i) for i in range(7)]

[1, 1, 2, 3, 5, 8, 13]

This works well for large `n` values too:

In [15]:
timeit('fib(5000)', globals=globals(), number=10)

0.006382826561139865

So now, let's create an iterator approach so we can iterate over the sequence, but without materializing it (i.e. we want to use lazy evaluation, not eager evaluation)

Our first approach is going to be a custom iterator and iterable:

In [16]:
class Fib:
    def __init__(self, n):
        self.n = n
        
    def __iter__(self):
        return self.FibIter(self.n)
        
    class FibIter:
        def __init__(self, n):
            self.n = n
            self.i = 0
            
        def __iter__(self):
            return self
        
        def __next__(self):
            if self.i >= self.n:
                raise StopIteration
            else:
                result = fib(self.i)
                self.i += 1
                return result

And we can now iterate the usual way:

In [17]:
fib_iterable = Fib(7)

In [18]:
for num in fib_iterable:
    print(num)

1
1
2
3
5
8
13


Of course, we can also use the second form of the `iter` function too, but we have to create a closure first:

In [19]:
def fib_closure():
    i = 0
    def inner():
        nonlocal i
        result = fib(i)
        i += 1
        return result
    return inner

In [20]:
fib_numbers = fib_closure()
fib_iter = iter(fib_numbers, fib(7))
for num in fib_iter:
    print(num)

1
1
2
3
5
8
13


But there's two things here:

1. The syntax for either implementation is a little convoluted and not very clear
2. More importantly, notice what happens every time the `next` method is called - it has to calculate every Fibonacci number from scratch (using the `fib` function) - that is wasteful...

Instead, we can use a generator function very effectively here.

Here is our original `fib` function:

In [21]:
def fib(n):
    fib_0 = 1
    fib_1 = 1
    for i in range(n-1):
        fib_0, fib_1 = fib_1, fib_0 + fib_1
    return fib_1    

In [22]:
[fib(i) for i in range(7)]

[1, 1, 2, 3, 5, 8, 13]

Now let's modity it into a generator function:

In [23]:
def fib_gen(n):
    fib_0 = 1
    fib_1 = 1
    for i in range(n-1):
        fib_0, fib_1 = fib_1, fib_0 + fib_1
        yield fib_1    

In [24]:
[num for num in fib_gen(7)]

[2, 3, 5, 8, 13, 21]

We're almost there. We're missing the first two Fibonacci numbers in the sequence - we need to yield those too.

In [25]:
def fib_gen(n):
    fib_0 = 1
    yield fib_0
    fib_1 = 1
    yield fib_1
    for i in range(n-1):
        fib_0, fib_1 = fib_1, fib_0 + fib_1
        yield fib_1    

In [26]:
[num for num in fib_gen(7)]

[1, 1, 2, 3, 5, 8, 13, 21]

And finally we're returning one number too many if `n` is meant to indicate the length of the sequence:

In [27]:
def fib_gen(n):
    fib_0 = 1
    yield fib_0
    fib_1 = 1
    yield fib_1
    for i in range(n-2):
        fib_0, fib_1 = fib_1, fib_0 + fib_1
        yield fib_1    

And now everything works fine:

In [28]:
[num for num in fib_gen(7)]

[1, 1, 2, 3, 5, 8, 13]

Let's time it as well to compare it with the other methods:

In [29]:
timeit('[num for num in Fib(5_000)]', globals=globals(), number=1)

1.4024426054891919

In [30]:
fib_numbers = fib_closure()
sentinel = fib(5_001)

timeit('[num for num in iter(fib_numbers, sentinel)]', globals=globals(),
      number=1)

1.4315486413535154

In [31]:
timeit('[num for num in fib_gen(5_000)]', globals=globals(), number=1)

0.0013831895603644284

##  Making an Iterable from a Generator

As we now know, generators are iterators.

This means that they become exhausted - so sometimes we want to create an iterable instead.

There's no magic here, we simply have to implement a class that implements the iterable protocol:

Let's write a simple generator that generates the squares of integers:

In [1]:
def squares_gen(n):
    for i in range(n):
        yield i ** 2

Now, we can create a new generator:

In [2]:
sq = squares_gen(5)

In [3]:
for num in sq:
    print(num)

0
1
4
9
16


But, `sq` was an iterator - so now it's been exhausted:

In [4]:
next(sq)

StopIteration: 

To restart the iteration we have to create a new instance of the generator (iterator):

In [5]:
sq = squares_gen(5)

In [6]:
[num for num in sq]

[0, 1, 4, 9, 16]

So, let's wrap this in an iterable:

In [7]:
class Squares:
    def __init__(self, n):
        self.n = n
        
    def __iter__(self):
        return squares_gen(self.n)

In [8]:
sq = Squares(5)

In [9]:
[num for num in sq]

[0, 1, 4, 9, 16]

And we can do it again:

In [10]:
[num for num in sq]

[0, 1, 4, 9, 16]

We can put those pieces of code together if we prefer:

In [11]:
class Squares:
    def __init__(self, n):
        self.n = n
        
    @staticmethod
    def squares_gen(n):
        for i in range(n):
            yield i ** 2
        
    def __iter__(self):
        return Squares.squares_gen(self.n)

In [12]:
sq = Squares(5)

In [13]:
[num for num in sq]

[0, 1, 4, 9, 16]

#### Generators used with other Generators

I want to point out that you can also easily run into various bugs when you use generators with other generator functions.

Consider this example:

In [14]:
def squares(n):
    for i in range(n):
        yield i ** 2

In [15]:
sq = squares(5)

In [16]:
enum_sq = enumerate(sq)

Now `enumerate` is lazy, so `sq` had not, at this point, been consumed:

In [17]:
next(sq)

0

In [18]:
next(sq)

1

Since we have consumed two elements from `sq`, when we now use `enumerate` it will have two less elements from sq:

In [19]:
next(enum_sq)

(0, 4)

You'll notice that we don't get the first element of the original `sq` - instead we get the third element (`2 ** 2`).

Moreover, you'll notice that the index returned in the tuple produced by `enumerate` is 0, not 2!

##  Example: Card Deck

Let's go back to a previous example we worked with, our card deck.

Let's rebuild it quickly, but this time using generators instead of a custom iterator:

In [1]:
from collections import namedtuple

Card = namedtuple('Card', 'rank, suit')
SUITS = ('Spades', 'Hearts', 'Diamonds', 'Clubs')
RANKS = tuple(range(2, 11)) + tuple('JQKA')

Recall how we recovered the suit and rank for any given card index in the deck (assuming the sorting is `2S...AS  2H...AH  2D...AD  2C...AC`):

```
suit_index = card_index // len(RANKS)
rank_index = card_index % len(RANKS)
```

In [2]:
def card_gen():
    for i in range(len(SUITS) * len(RANKS)):
        suit = SUITS[i // len(RANKS)]
        rank = RANKS[i % len(RANKS)]
        card = Card(rank, suit)
        yield card

So now we can iterate using our generator:

In [3]:
for card in card_gen():
    print(card)

Card(rank=2, suit='Spades')
Card(rank=3, suit='Spades')
Card(rank=4, suit='Spades')
Card(rank=5, suit='Spades')
Card(rank=6, suit='Spades')
Card(rank=7, suit='Spades')
Card(rank=8, suit='Spades')
Card(rank=9, suit='Spades')
Card(rank=10, suit='Spades')
Card(rank='J', suit='Spades')
Card(rank='Q', suit='Spades')
Card(rank='K', suit='Spades')
Card(rank='A', suit='Spades')
Card(rank=2, suit='Hearts')
Card(rank=3, suit='Hearts')
Card(rank=4, suit='Hearts')
Card(rank=5, suit='Hearts')
Card(rank=6, suit='Hearts')
Card(rank=7, suit='Hearts')
Card(rank=8, suit='Hearts')
Card(rank=9, suit='Hearts')
Card(rank=10, suit='Hearts')
Card(rank='J', suit='Hearts')
Card(rank='Q', suit='Hearts')
Card(rank='K', suit='Hearts')
Card(rank='A', suit='Hearts')
Card(rank=2, suit='Diamonds')
Card(rank=3, suit='Diamonds')
Card(rank=4, suit='Diamonds')
Card(rank=5, suit='Diamonds')
Card(rank=6, suit='Diamonds')
Card(rank=7, suit='Diamonds')
Card(rank=8, suit='Diamonds')
Card(rank=9, suit='Diamonds')
Card(rank=10, 

But we can really simplify this further!

We don't have to use these indices at all!

In [4]:
def card_gen():
    for suit in SUITS:
        for rank in RANKS:
            card = Card(rank, suit)
            yield card

And we now have the same functionality:

In [5]:
for card in card_gen():
    print(card)

Card(rank=2, suit='Spades')
Card(rank=3, suit='Spades')
Card(rank=4, suit='Spades')
Card(rank=5, suit='Spades')
Card(rank=6, suit='Spades')
Card(rank=7, suit='Spades')
Card(rank=8, suit='Spades')
Card(rank=9, suit='Spades')
Card(rank=10, suit='Spades')
Card(rank='J', suit='Spades')
Card(rank='Q', suit='Spades')
Card(rank='K', suit='Spades')
Card(rank='A', suit='Spades')
Card(rank=2, suit='Hearts')
Card(rank=3, suit='Hearts')
Card(rank=4, suit='Hearts')
Card(rank=5, suit='Hearts')
Card(rank=6, suit='Hearts')
Card(rank=7, suit='Hearts')
Card(rank=8, suit='Hearts')
Card(rank=9, suit='Hearts')
Card(rank=10, suit='Hearts')
Card(rank='J', suit='Hearts')
Card(rank='Q', suit='Hearts')
Card(rank='K', suit='Hearts')
Card(rank='A', suit='Hearts')
Card(rank=2, suit='Diamonds')
Card(rank=3, suit='Diamonds')
Card(rank=4, suit='Diamonds')
Card(rank=5, suit='Diamonds')
Card(rank=6, suit='Diamonds')
Card(rank=7, suit='Diamonds')
Card(rank=8, suit='Diamonds')
Card(rank=9, suit='Diamonds')
Card(rank=10, 

We can now make it into an iterable:

In [6]:
class CardDeck:
    SUITS = ('Spades', 'Hearts', 'Diamonds', 'Clubs')
    RANKS = tuple(range(2, 11)) + tuple('JQKA')
        
    def __iter__(self):
        return CardDeck.card_gen()
    
    @staticmethod
    def card_gen():
        for suit in CardDeck.SUITS:
            for rank in CardDeck.RANKS:
                card = Card(rank, suit)
                yield card
        

In [7]:
deck = CardDeck()

In [8]:
[card for card in deck]

[Card(rank=2, suit='Spades'),
 Card(rank=3, suit='Spades'),
 Card(rank=4, suit='Spades'),
 Card(rank=5, suit='Spades'),
 Card(rank=6, suit='Spades'),
 Card(rank=7, suit='Spades'),
 Card(rank=8, suit='Spades'),
 Card(rank=9, suit='Spades'),
 Card(rank=10, suit='Spades'),
 Card(rank='J', suit='Spades'),
 Card(rank='Q', suit='Spades'),
 Card(rank='K', suit='Spades'),
 Card(rank='A', suit='Spades'),
 Card(rank=2, suit='Hearts'),
 Card(rank=3, suit='Hearts'),
 Card(rank=4, suit='Hearts'),
 Card(rank=5, suit='Hearts'),
 Card(rank=6, suit='Hearts'),
 Card(rank=7, suit='Hearts'),
 Card(rank=8, suit='Hearts'),
 Card(rank=9, suit='Hearts'),
 Card(rank=10, suit='Hearts'),
 Card(rank='J', suit='Hearts'),
 Card(rank='Q', suit='Hearts'),
 Card(rank='K', suit='Hearts'),
 Card(rank='A', suit='Hearts'),
 Card(rank=2, suit='Diamonds'),
 Card(rank=3, suit='Diamonds'),
 Card(rank=4, suit='Diamonds'),
 Card(rank=5, suit='Diamonds'),
 Card(rank=6, suit='Diamonds'),
 Card(rank=7, suit='Diamonds'),
 Card(rank

And of course we can do it again:

In [9]:
[card for card in deck]

[Card(rank=2, suit='Spades'),
 Card(rank=3, suit='Spades'),
 Card(rank=4, suit='Spades'),
 Card(rank=5, suit='Spades'),
 Card(rank=6, suit='Spades'),
 Card(rank=7, suit='Spades'),
 Card(rank=8, suit='Spades'),
 Card(rank=9, suit='Spades'),
 Card(rank=10, suit='Spades'),
 Card(rank='J', suit='Spades'),
 Card(rank='Q', suit='Spades'),
 Card(rank='K', suit='Spades'),
 Card(rank='A', suit='Spades'),
 Card(rank=2, suit='Hearts'),
 Card(rank=3, suit='Hearts'),
 Card(rank=4, suit='Hearts'),
 Card(rank=5, suit='Hearts'),
 Card(rank=6, suit='Hearts'),
 Card(rank=7, suit='Hearts'),
 Card(rank=8, suit='Hearts'),
 Card(rank=9, suit='Hearts'),
 Card(rank=10, suit='Hearts'),
 Card(rank='J', suit='Hearts'),
 Card(rank='Q', suit='Hearts'),
 Card(rank='K', suit='Hearts'),
 Card(rank='A', suit='Hearts'),
 Card(rank=2, suit='Diamonds'),
 Card(rank=3, suit='Diamonds'),
 Card(rank=4, suit='Diamonds'),
 Card(rank=5, suit='Diamonds'),
 Card(rank=6, suit='Diamonds'),
 Card(rank=7, suit='Diamonds'),
 Card(rank

One thing we don't have here is the support for `reversed`:

In [10]:
reversed(CardDeck())

TypeError: 'CardDeck' object is not reversible

But we can add it in by implementing the `__reversed__` method and returning an iterator that iterates the deck in reverse order.

In [11]:
class CardDeck:
    SUITS = ('Spades', 'Hearts', 'Diamonds', 'Clubs')
    RANKS = tuple(range(2, 11)) + tuple('JQKA')
        
    def __iter__(self):
        return CardDeck.card_gen()
    
    def __reversed__(self):
        return CardDeck.reversed_card_gen()
    
    @staticmethod
    def card_gen():
        for suit in CardDeck.SUITS:
            for rank in CardDeck.RANKS:
                card = Card(rank, suit)
                yield card
        
    @staticmethod
    def reversed_card_gen():
        for suit in reversed(CardDeck.SUITS):
            for rank in reversed(CardDeck.RANKS):
                card = Card(rank, suit)
                yield card

In [12]:
rev = reversed(CardDeck())

In [13]:
[card for card in rev]

[Card(rank='A', suit='Clubs'),
 Card(rank='K', suit='Clubs'),
 Card(rank='Q', suit='Clubs'),
 Card(rank='J', suit='Clubs'),
 Card(rank=10, suit='Clubs'),
 Card(rank=9, suit='Clubs'),
 Card(rank=8, suit='Clubs'),
 Card(rank=7, suit='Clubs'),
 Card(rank=6, suit='Clubs'),
 Card(rank=5, suit='Clubs'),
 Card(rank=4, suit='Clubs'),
 Card(rank=3, suit='Clubs'),
 Card(rank=2, suit='Clubs'),
 Card(rank='A', suit='Diamonds'),
 Card(rank='K', suit='Diamonds'),
 Card(rank='Q', suit='Diamonds'),
 Card(rank='J', suit='Diamonds'),
 Card(rank=10, suit='Diamonds'),
 Card(rank=9, suit='Diamonds'),
 Card(rank=8, suit='Diamonds'),
 Card(rank=7, suit='Diamonds'),
 Card(rank=6, suit='Diamonds'),
 Card(rank=5, suit='Diamonds'),
 Card(rank=4, suit='Diamonds'),
 Card(rank=3, suit='Diamonds'),
 Card(rank=2, suit='Diamonds'),
 Card(rank='A', suit='Hearts'),
 Card(rank='K', suit='Hearts'),
 Card(rank='Q', suit='Hearts'),
 Card(rank='J', suit='Hearts'),
 Card(rank=10, suit='Hearts'),
 Card(rank=9, suit='Hearts'),


##  Generator Expressions

Recall how list comprehensions worked:

In [1]:
l = [i ** 2 for i in range(5)]

In [2]:
l

[0, 1, 4, 9, 16]

The expression inside the `[]` brackets is called a comprehension expression.

The `[]` brackets resulted in a list being created.

We can easily create a **generator** by using `()` parentheses instead of the `[]` brackets:

In [3]:
g = (i ** 2 for i in range(5))

Note that `g` is a generator, and is also lazily evaluated:

In [4]:
type(g)

generator

In [5]:
for item in g:
    print(item)

0
1
4
9
16


And now the generator has been exhausted:

In [6]:
for item in g:
    print(item)

Scoping works the same way with generator expressions as with list comprehensions, i.e. generator expressions are created by Python using a function, and therefore have local scopes and can access enclosing nonlocal and global scopes.

In [7]:
import dis

Recall for list comprehensions:

In [8]:
exp = compile('[i**2 for i in range(5)]', filename='<string>', mode='eval')

In [9]:
dis.dis(exp)

  1           0 LOAD_CONST               0 (<code object <listcomp> at 0x000002181BDEEDB0, file "<string>", line 1>)
              2 LOAD_CONST               1 ('<listcomp>')
              4 MAKE_FUNCTION            0
              6 LOAD_NAME                0 (range)
              8 LOAD_CONST               2 (5)
             10 CALL_FUNCTION            1
             12 GET_ITER
             14 CALL_FUNCTION            1
             16 RETURN_VALUE


In [10]:
exp = compile('(i ** 2 for i in range(5))', filename='<string>', mode='eval')

In [11]:
dis.dis(exp)

  1           0 LOAD_CONST               0 (<code object <genexpr> at 0x000002181BE32150, file "<string>", line 1>)
              2 LOAD_CONST               1 ('<genexpr>')
              4 MAKE_FUNCTION            0
              6 LOAD_NAME                0 (range)
              8 LOAD_CONST               2 (5)
             10 CALL_FUNCTION            1
             12 GET_ITER
             14 CALL_FUNCTION            1
             16 RETURN_VALUE


As you can see the internal mechanism for list comprehensions and generator expressions is almost the same - in particular note how a function is created. The main difference is that in one case a list is created (an iterable), while in the other a generator (an iterator) is produced.

We can iterate over the same list comprehension multiple times, since it is an iterable. However, we can only iterate over a comprehension expression once, since it is an iterator.

In [12]:
l = [i * 2 for i in range(5)]

In [13]:
type(l)

list

In [14]:
g = (i ** 2 for i in range(5))

In [15]:
type(g)

generator

#### Nested Comprehensions

Just as with list comprehensions, we can nest generator expressions too:

Let's use some of the same examples we saw with nested list comprehensions.

##### Example 1

A multiplication table:

Using a list comprehension approach first:

In [16]:
start = 1
stop = 10

mult_list = [ [i * j 
               for j in range(start, stop+1)]
             for i in range(start, stop+1)]

In [17]:
mult_list

[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
 [3, 6, 9, 12, 15, 18, 21, 24, 27, 30],
 [4, 8, 12, 16, 20, 24, 28, 32, 36, 40],
 [5, 10, 15, 20, 25, 30, 35, 40, 45, 50],
 [6, 12, 18, 24, 30, 36, 42, 48, 54, 60],
 [7, 14, 21, 28, 35, 42, 49, 56, 63, 70],
 [8, 16, 24, 32, 40, 48, 56, 64, 72, 80],
 [9, 18, 27, 36, 45, 54, 63, 72, 81, 90],
 [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]]

The equivalent generator expression would be:

In [18]:
start = 1
stop = 10

mult_list = ( (i * j 
               for j in range(start, stop+1))
             for i in range(start, stop+1))

In [19]:
mult_list

<generator object <genexpr> at 0x000002181BDD9CA8>

We can iterate through mult_list:

In [20]:
table = list(mult_list)

In [21]:
table

[<generator object <genexpr>.<genexpr> at 0x000002181BDD9DB0>,
 <generator object <genexpr>.<genexpr> at 0x000002181BDD9E08>,
 <generator object <genexpr>.<genexpr> at 0x000002181BDD9A98>,
 <generator object <genexpr>.<genexpr> at 0x000002181BDD9D58>,
 <generator object <genexpr>.<genexpr> at 0x000002181BDD96D0>,
 <generator object <genexpr>.<genexpr> at 0x000002181BDD9BF8>,
 <generator object <genexpr>.<genexpr> at 0x000002181BDD9F68>,
 <generator object <genexpr>.<genexpr> at 0x000002181BDD9E60>,
 <generator object <genexpr>.<genexpr> at 0x000002181BDD9BA0>,
 <generator object <genexpr>.<genexpr> at 0x000002181BDD9A40>]

But you'll notice that our rows are themselves generators!

To fully materialize the table we need to iterate through the row generators too:

In [22]:
table_rows = [list(gen) for gen in table]

In [23]:
table_rows

[[10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
 [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
 [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
 [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
 [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
 [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
 [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
 [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
 [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
 [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]]

Of course, we can mix list comprehensions and generators. 

In this modification, we'll make the rows list comprehensions, and retain the generator expression in the outer comprehension:

In [24]:
start = 1
stop = 10

mult_list = ( [i * j 
               for j in range(start, stop+1)]
             for i in range(start, stop+1))

Notice what is happening here, the table itself is lazy evaluated, i.e. the rows are not yielded until they are requested - but once a row is requested, the list comprehension that defines the row will be entirely evaluated at that point:

In [25]:
for item in mult_list:
    print(item)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
[3, 6, 9, 12, 15, 18, 21, 24, 27, 30]
[4, 8, 12, 16, 20, 24, 28, 32, 36, 40]
[5, 10, 15, 20, 25, 30, 35, 40, 45, 50]
[6, 12, 18, 24, 30, 36, 42, 48, 54, 60]
[7, 14, 21, 28, 35, 42, 49, 56, 63, 70]
[8, 16, 24, 32, 40, 48, 56, 64, 72, 80]
[9, 18, 27, 36, 45, 54, 63, 72, 81, 90]
[10, 20, 30, 40, 50, 60, 70, 80, 90, 100]


##### Example 2

Let's try Pascal's triangle again:

```
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
```

we just need to know how to calculate combinations:
```
C(n, k) = n! / (k! (n-k)!)
```

* row 0, column 0: n=0, k=0: c(0, 0) = 0! / 0! 0! = 1/1 = 1
* row 4, column 2: n=4, k=2: c(4, 2) = 4! / 2! 2! = 4x3x2 / 2x2 = 6

In other words, we need to calculate the following list of lists:
```
c(0,0)
c(1,0) c(1,1)
c(2,0) c(2,1) c(2,2)
c(3,0) c(3,1) c(3,2) c(3,3)
...
```

Here's how we did it using a list comprehension:

In [26]:
from math import factorial

def combo(n, k):
    return factorial(n) // (factorial(k) * factorial(n-k))

size = 10  # global variable
pascal = [ [combo(n, k) for k in range(n+1)] for n in range(size+1) ]

In [27]:
pascal

[[1],
 [1, 1],
 [1, 2, 1],
 [1, 3, 3, 1],
 [1, 4, 6, 4, 1],
 [1, 5, 10, 10, 5, 1],
 [1, 6, 15, 20, 15, 6, 1],
 [1, 7, 21, 35, 35, 21, 7, 1],
 [1, 8, 28, 56, 70, 56, 28, 8, 1],
 [1, 9, 36, 84, 126, 126, 84, 36, 9, 1],
 [1, 10, 45, 120, 210, 252, 210, 120, 45, 10, 1]]

We can now use generator expressions for either one or both of the nested list comprehensions. In this case I'll use it for both:

In [28]:
size = 10  # global variable
pascal = ( (combo(n, k) for k in range(n+1)) for n in range(size+1) )

If we want to materialize the triangle into a list we'll need to do so ourselves:

In [29]:
[list(row) for row in pascal]

[[1],
 [1, 1],
 [1, 2, 1],
 [1, 3, 3, 1],
 [1, 4, 6, 4, 1],
 [1, 5, 10, 10, 5, 1],
 [1, 6, 15, 20, 15, 6, 1],
 [1, 7, 21, 35, 35, 21, 7, 1],
 [1, 8, 28, 56, 70, 56, 28, 8, 1],
 [1, 9, 36, 84, 126, 126, 84, 36, 9, 1],
 [1, 10, 45, 120, 210, 252, 210, 120, 45, 10, 1]]

#### Timings

So we see that the main difference between the two approaches is that in one case we have a fully materialized list (i.e. all the elements have been  created and put into list objects), while in the other we are dealing with lazily evaluated iterators.

One main advantage to using generators is that we do not need the up-front calculations - if we end up not consuming the entire iterator, we have saved some time.

The other advantage, as we saw with lazy iterators is that you do not need to have the entire data set in memory at one time. We saw an example of this when reading files - we can read extremely large files one row at a time, without having to store the entire file in memory.

Let's see the time difference between creating a list comprehension and a generator expression for a large Pascal triangle:

In [30]:
from timeit import timeit

In [31]:
size = 600

In [32]:
timeit('[[combo(n, k) for k in range(n+1)] for n in range(size+1)]',
      globals=globals(), number=1)

3.937809023402072

In [33]:
timeit('((combo(n, k) for k in range(n+1)) for n in range(size+1))',
      globals=globals(), number=1)

3.5216342118005173e-06

As you can see, much faster - but that's because we haven't actually done anything other than set up the nested iterators. Since no iteration took place, no calculations were performed.

In fact, even if we make the inner generator expression a list comprehension, those will not be calculated until the individual rows from the outer generator expression are requested:

In [34]:
timeit('([combo(n, k) for k in range(n+1)] for n in range(size+1))',
      globals=globals(), number=1)

7.314163362526216e-06

In fact, we can quickly create a **huge** Pascal triangle using the generator approach:

In [35]:
size = 100_000

timeit('([combo(n, k) for k in range(n+1)] for n in range(size+1))',
      globals=globals(), number=1)

5.959688666123952e-06

What about timing both creating **and** iterating though all the elements?

Let's do this by creating some functions that will do that:

In [36]:
def pascal_list(size):
    l = [[combo(n, k) for k in range(n+1)] for n in range(size+1)]
    for row in l:
        for item in row:
            pass

In [37]:
def pascal_gen(size):
    g = ((combo(n, k) for k in range(n+1)) for n in range(size+1))
    for row in g:
        for item in row:
            pass

In [38]:
size = 600
timeit('pascal_list(size)', globals=globals(), number=1)

3.9256347339324087

In [39]:
size = 600
timeit('pascal_gen(size)', globals=globals(), number=1)

3.835240885197976

So as you can see, if we actually iterate through each element, we don't end up saving any time - however, creating the iterator is faster, and if we don't use all the elements, then it will be more efficient.

#### Memory Usage

Another thing that is way more efficient is memory usage.

To see this, we'll use a rough technique and the `tracemalloc` standard library module:

In [40]:
import tracemalloc

In [49]:
def pascal_list(size):
    l = [[combo(n, k) for k in range(n+1)] for n in range(size+1)]
    for row in l:
        for item in row:
            pass
    stats = tracemalloc.take_snapshot().statistics('lineno')
    print(stats[0].size, 'bytes')

In [50]:
def pascal_gen(size):
    g = ((combo(n, k) for k in range(n+1)) for n in range(size+1))
    for row in g:
        for item in row:
            pass
    stats = tracemalloc.take_snapshot().statistics('lineno')
    print(stats[0].size, 'bytes')

In [47]:
tracemalloc.stop()
tracemalloc.clear_traces()
tracemalloc.start()
pascal_list(300)

1998608 bytes
1998608 bytes


In [48]:
tracemalloc.stop()
tracemalloc.clear_traces()
tracemalloc.start()
pascal_gen(300)

222 bytes
222 bytes


As you can see, using a generator did not require as much memory. Because we are essentially using a lazy iterator, the memory required by a previous result is released once the next iteration is requested.

##  Yield From

In the last video we saw when we had two nested generators that we had to use a nested loop in order to iterate through both iterators:

In [1]:
def matrix(n):
    gen = ( (i * j for j in range(1, n+1))
            for i in range(1, n+1)
          )
    return gen

In [2]:
m = list(matrix(5))

In [3]:
m

[<generator object matrix.<locals>.<genexpr>.<genexpr> at 0x0000028236EC2BF8>,
 <generator object matrix.<locals>.<genexpr>.<genexpr> at 0x0000028236EC2C50>,
 <generator object matrix.<locals>.<genexpr>.<genexpr> at 0x0000028236EC2EB8>,
 <generator object matrix.<locals>.<genexpr>.<genexpr> at 0x0000028236EC2F10>,
 <generator object matrix.<locals>.<genexpr>.<genexpr> at 0x0000028236EC29E8>]

Suppose we want an iterator to iterate over all the values of the matrix, element by element.

We could write it this way:

In [4]:
def matrix_iterator(n):
    for row in matrix(n):
        for item in row:
            yield item

All we have done here is create a generator (iterator) that can be used to iterate over the elements of a nested iterator.

We can then use it this way:

In [5]:
for i in matrix_iterator(3):
    print(i)

1
2
3
2
4
6
3
6
9


But we can avoid using that nested for loop by using a special form of `yield`: `yield from`

In [6]:
def matrix_iterator(n):
    for row in matrix(n):
        yield from row

In [7]:
for i in matrix_iterator(3):
    print(i)

1
2
3
2
4
6
3
6
9


As you can see we obtain the same result.

We can think of 
```
yield from <iterator>
```
as a replacement for the code:
```
for i in <iterator>:
    yield i
```

We'll come back to `yield from` in more detail, because there's a **lot** more to it than just a simple replacement for that inner loop!

#### Example

Here's an example where using `yield from` can be quite effective.

In this example we need to read car brands from multiple files to get it as a single collection.

We might do it this way:

In [8]:
brands = []

with open('car-brands-1.txt') as f:
    for brand in f:
        brands.append(brand.strip('\n'))
        
with open('car-brands-2.txt') as f:
    for brand in f:
        brands.append(brand.strip('\n'))
        
with open('car-brands-3.txt') as f:
    for brand in f:
        brands.append(brand.strip('\n'))

In [9]:
for brand in brands:
    print(brand, end=', ')

Alfa Romeo, Aston Martin, Audi, Bentley, Benz, BMW, Bugatti, Cadillac, Chevrolet, Chrysler, Citroën, Corvette, DAF, Dacia, Daewoo, Daihatsu, Datsun, De Lorean, Dino, Dodge, Farboud, Ferrari, Fiat, Ford, Honda, Hummer, Hyundai, Jaguar, Jeep, KIA, Koenigsegg, Lada, Lamborghini, Lancia, Land Rover, Lexus, Ligier, Lincoln, Lotus, Martini, Maserati, Maybach, Mazda, McLaren, Mercedes-Benz, Mini, Mitsubishi, Nissan, Noble, Opel, Peugeot, Pontiac, Porsche, Renault, Rolls-Royce, Saab, Seat, Å koda, Smart, Spyker, Subaru, Suzuki, Toyota, Vauxhall, Volkswagen, Volvo, 

But notice that we had to load up the entire data set in memory.

As we have discussed before this is not very efficient.

Instead we could use a generator approach as follows:

In [10]:
def brands(*files):
    for f_name in files:
        with open(f_name) as f:
            for line in f:
                yield line.strip('\n')

In [11]:
files = 'car-brands-1.txt', 'car-brands-2.txt', 'car-brands-3.txt'
for brand in brands(*files):
    print(brand, end = ', ')

Alfa Romeo, Aston Martin, Audi, Bentley, Benz, BMW, Bugatti, Cadillac, Chevrolet, Chrysler, Citroën, Corvette, DAF, Dacia, Daewoo, Daihatsu, Datsun, De Lorean, Dino, Dodge, Farboud, Ferrari, Fiat, Ford, Honda, Hummer, Hyundai, Jaguar, Jeep, KIA, Koenigsegg, Lada, Lamborghini, Lancia, Land Rover, Lexus, Ligier, Lincoln, Lotus, Martini, Maserati, Maybach, Mazda, McLaren, Mercedes-Benz, Mini, Mitsubishi, Nissan, Noble, Opel, Peugeot, Pontiac, Porsche, Renault, Rolls-Royce, Saab, Seat, Å koda, Smart, Spyker, Subaru, Suzuki, Toyota, Vauxhall, Volkswagen, Volvo, 

We can simplify our function by using `yield from`:

In [12]:
def brands(*files):
    for f_name in files:
        with open(f_name) as f:
            yield from f

In [13]:
for brand in brands(*files):
    print(brand, end=', ')

Alfa Romeo
, Aston Martin
, Audi
, Bentley
, Benz
, BMW
, Bugatti
, Cadillac
, Chevrolet
, Chrysler
, Citroën
, Corvette
, DAF
, Dacia
, Daewoo
, Daihatsu
, Datsun
, De Lorean
, Dino
, Dodge, Farboud
, Ferrari
, Fiat
, Ford
, Honda
, Hummer
, Hyundai
, Jaguar
, Jeep
, KIA
, Koenigsegg
, Lada
, Lamborghini
, Lancia
, Land Rover
, Lexus
, Ligier
, Lincoln
, Lotus
, Martini, Maserati
, Maybach
, Mazda
, McLaren
, Mercedes-Benz
, Mini
, Mitsubishi
, Nissan
, Noble
, Opel
, Peugeot
, Pontiac
, Porsche
, Renault
, Rolls-Royce
, Saab
, Seat
, Å koda
, Smart
, Spyker
, Subaru
, Suzuki
, Toyota
, Vauxhall
, Volkswagen
, Volvo, 

Now we still have to clean up that trailing `\n` character...

So, we are going to create generators that can read each line of the file, and yield a clean result, and we'll `yield from` that generator:

In [14]:
def gen_clean_read(file):
    with open(file) as f:
        for line in f:
            yield line.strip('\n')

As you can see, this generator function will clean each line of the file before yielding it. Let's try it with a single file and make sure it works:

In [15]:
f1 = gen_clean_read('car-brands-1.txt')
for line in f1:
    print(line, end=', ')

Alfa Romeo, Aston Martin, Audi, Bentley, Benz, BMW, Bugatti, Cadillac, Chevrolet, Chrysler, Citroën, Corvette, DAF, Dacia, Daewoo, Daihatsu, Datsun, De Lorean, Dino, Dodge, 

Ok, that works. So now, we can proceed with our overarching generator function as before, except we'll `yield from` our generators, instead of directly from the file iterator:

In [16]:
files = 'car-brands-1.txt', 'car-brands-2.txt', 'car-brands-3.txt'

In [17]:
def brands(*files):
    for file in files:
        yield from gen_clean_read(file)

In [18]:
for brand in brands(*files):
    print(brand, end=', ')

Alfa Romeo, Aston Martin, Audi, Bentley, Benz, BMW, Bugatti, Cadillac, Chevrolet, Chrysler, Citroën, Corvette, DAF, Dacia, Daewoo, Daihatsu, Datsun, De Lorean, Dino, Dodge, Farboud, Ferrari, Fiat, Ford, Honda, Hummer, Hyundai, Jaguar, Jeep, KIA, Koenigsegg, Lada, Lamborghini, Lancia, Land Rover, Lexus, Ligier, Lincoln, Lotus, Martini, Maserati, Maybach, Mazda, McLaren, Mercedes-Benz, Mini, Mitsubishi, Nissan, Noble, Opel, Peugeot, Pontiac, Porsche, Renault, Rolls-Royce, Saab, Seat, Å koda, Smart, Spyker, Subaru, Suzuki, Toyota, Vauxhall, Volkswagen, Volvo, 

I want to point out that in this particular instance, we are using `yield from` as a simple replacement for a `for` loop. We could equally well have written it this way:

Using `yield from`:

In [19]:
def brands(*files):
    for file in files:
        yield from gen_clean_read(file)

Without using `yield from`:

In [20]:
def brands(*files):
    for file in files:
        for line in gen_clean_read(file):
            yield line

In [21]:
for brand in brands(*files):
    print(brand, end=', ')

Alfa Romeo, Aston Martin, Audi, Bentley, Benz, BMW, Bugatti, Cadillac, Chevrolet, Chrysler, Citroën, Corvette, DAF, Dacia, Daewoo, Daihatsu, Datsun, De Lorean, Dino, Dodge, Farboud, Ferrari, Fiat, Ford, Honda, Hummer, Hyundai, Jaguar, Jeep, KIA, Koenigsegg, Lada, Lamborghini, Lancia, Land Rover, Lexus, Ligier, Lincoln, Lotus, Martini, Maserati, Maybach, Mazda, McLaren, Mercedes-Benz, Mini, Mitsubishi, Nissan, Noble, Opel, Peugeot, Pontiac, Porsche, Renault, Rolls-Royce, Saab, Seat, Å koda, Smart, Spyker, Subaru, Suzuki, Toyota, Vauxhall, Volkswagen, Volvo, 

We'll come back to `yield from` in a lot more detail later when we study coroutines - there's a whole lot more to `yield from` than a replacement for a simple loop!

# Section 07 - Project 3

##  Project

For this project you are given a file that contains some parking ticket violations for NYC.

(It's just a tiny extract!)

If you're wondering where I get these data sets, Kaggle is an **excellent** source of data sets in a whole variety of topics: 
https://www.kaggle.com/

You have to sign up, but it's free.

If you want the full data set, it's available here: https://www.kaggle.com/new-york-city/nyc-parking-tickets/version/2#

For this sample data set, the file is named: 
```
nyc_parking_tickets_extract.csv
```

Your goals are as follows:

##### Goal 1
Create a lazy iterator that will return a named tuple of the data in each row. The data types should be appropriate - i.e. if the column is a date, you should be storing dates in the named tuple, if the field is an integer, then it should be stored as an integer, etc.

##### Goal 2

Calculate the number of violations by car make.

##### Note:
Try to use lazy evaluation as much as possible - it may not always be possible though! That's OK, as long as it's kept to a minimum.

##  Project Solution - Goal 1

First we should look at what's in the file itself. Just a few records should be enough. (You can also "cheat" and look in Excel - but this works because the file is relatively small).

In [1]:
file_name = 'nyc_parking_tickets_extract.csv'

In [2]:
with open(file_name) as f:
    for _ in range(10):
        print(next(f))

Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Violation Description

4006478550,VAD7274,VA,PAS,10/5/2016,5,4D,BMW,BUS LANE VIOLATION

4006462396,22834JK,NY,COM,9/30/2016,5,VAN,CHEVR,BUS LANE VIOLATION

4007117810,21791MG,NY,COM,4/10/2017,5,VAN,DODGE,BUS LANE VIOLATION

4006265037,FZX9232,NY,PAS,8/23/2016,5,SUBN,FORD,BUS LANE VIOLATION

4006535600,N203399C,NY,OMT,10/19/2016,5,SUBN,FORD,BUS LANE VIOLATION

4007156700,92163MG,NY,COM,4/13/2017,5,VAN,FRUEH,BUS LANE VIOLATION

4006687989,MIQ600,SC,PAS,11/21/2016,5,VN,HONDA,BUS LANE VIOLATION

4006943052,2AE3984,MD,PAS,2/1/2017,5,SW,LINCO,BUS LANE VIOLATION

4007306795,HLG4926,NY,PAS,5/30/2017,5,SUBN,TOYOT,BUS LANE VIOLATION



So we should notice that we have these `\n` line terminators in the file - we'll need to strip those out.

Secondly we see that the first row of the file are the column headers - we'll need to skip that line when we want to look at just the data.

We should also not make the assumption that the data is entirely clean - we probably have missing values and will need to deal with that accordingly.

We also will need to determine an appropriatre data type for every column in the data set.

#### Column Definitions and Named Tuple

Let's start with the column definitions, data types and named tuple.

In [3]:
with open(file_name) as f:
    column_headers = next(f).strip('\n').split(',')
    sample_data = next(f).strip('\n').split(',')

In [4]:
column_headers

['Summons Number',
 'Plate ID',
 'Registration State',
 'Plate Type',
 'Issue Date',
 'Violation Code',
 'Vehicle Body Type',
 'Vehicle Make',
 'Violation Description']

In [5]:
sample_data

['4006478550',
 'VAD7274',
 'VA',
 'PAS',
 '10/5/2016',
 '5',
 '4D',
 'BMW',
 'BUS LANE VIOLATION']

In [6]:
list(zip(column_headers, sample_data))

[('Summons Number', '4006478550'),
 ('Plate ID', 'VAD7274'),
 ('Registration State', 'VA'),
 ('Plate Type', 'PAS'),
 ('Issue Date', '10/5/2016'),
 ('Violation Code', '5'),
 ('Vehicle Body Type', '4D'),
 ('Vehicle Make', 'BMW'),
 ('Violation Description', 'BUS LANE VIOLATION')]

Let's start by creating a tuple that contains the names of the columns:

In [7]:
column_names = [header.replace(' ', '_').lower() 
                for header in column_headers]

In [8]:
column_names

['summons_number',
 'plate_id',
 'registration_state',
 'plate_type',
 'issue_date',
 'violation_code',
 'vehicle_body_type',
 'vehicle_make',
 'violation_description']

Next we need to determine the data types for each of these fields:

    0. summons_number: looks like integers
    1. plate_id: string
    2: registration_state: string
    3: plate_type: string
    4: issue_date: looks like valid dates
    5: violation_code: looks like integers
    6: vehicle_body_type: string
    7: vehicle_make: string
    8: violation_description: string


We'll create utility functions to cast the data (which will always be strings) into the appropriate data type for each field.

We have to be careful though, we may have issues with data integrity and our assumptions about the data type.

What we'll do as a first pass is to keep track of the rows where the data was not an integer or date when we expected it (or missing).

Let's create our named tuple data structure:

In [9]:
from collections import namedtuple

Ticket = namedtuple('Ticket', column_names)

#### Reading and Cleaning a data row

In [10]:
with open(file_name) as f:
    next(f)
    raw_data_row = next(f)

In [11]:
raw_data_row

'4006478550,VAD7274,VA,PAS,10/5/2016,5,4D,BMW,BUS LANE VIOLATION\n'

You'll notice that to read the data in the file, we have to skip the first row in the file. Also, I have to use a `with` statement and the file name every time. To make life easier, I'm going to write a small utility function that will yield just the data rows from the file:

In [12]:
def read_data():
    with open(file_name) as f:
        next(f)
        yield from f

We can test it out easily:

In [13]:
raw_data = read_data()
for _ in range(5):
    print(next(raw_data))

4006478550,VAD7274,VA,PAS,10/5/2016,5,4D,BMW,BUS LANE VIOLATION

4006462396,22834JK,NY,COM,9/30/2016,5,VAN,CHEVR,BUS LANE VIOLATION

4007117810,21791MG,NY,COM,4/10/2017,5,VAN,DODGE,BUS LANE VIOLATION

4006265037,FZX9232,NY,PAS,8/23/2016,5,SUBN,FORD,BUS LANE VIOLATION

4006535600,N203399C,NY,OMT,10/19/2016,5,SUBN,FORD,BUS LANE VIOLATION



Let's write a function that will try to convert a value to an integer, or return some default if the value is missing or not an integer:

In [14]:
def parse_int(value, *, default=None):
    try:
        return int(value)
    except ValueError:
        return default

We need to do the same thing with dates.
It looks like the dates are provided in M/D/YYYY format, so we'll use that to parse the date. 

We'll use the `strptime` function available in the `datetime` package.

In [15]:
from datetime import datetime
def parse_date(value, *, default=None):
    date_format='%m/%d/%Y'
    try:
        return datetime.strptime(value, date_format).date()
    except ValueError:
        return default

Let's make sure those functions work as expected:

In [16]:
parse_int('123')

123

In [17]:
parse_int('hello', default='N/A')

'N/A'

In [18]:
parse_date('3/28/2018')

datetime.date(2018, 3, 28)

In [19]:
parse_date('31/31/2000', default='N/A')

'N/A'

OK, so these seem to work as expected.

We also need to write a string parser - we want to remove any potential leading and trailing spaces.

In [20]:
def parse_string(value, *, default=None):
    try:
        cleaned = str(value).strip()
        if not cleaned:
            # empty string
            return default
        else:
            return cleaned
    except ValueError:
        return default

Let's test this one as well:

In [21]:
parse_string('   hello   ')

'hello'

In [22]:
parse_string('  ', default='N/A')

'N/A'

Now that we have our utility functions, we can write our row parser.

To make life easier, I'm going to create a tuple that contains the functions that should be called to clean up each field. The tuple positions will correspond to the fields in the data row.

I'm also going to specify what the default value should be when there is a problem parsing the fields. To do this, I will use `partials`, because I still need a callable for each element of the column parser tuple. (Note that I could just as easily use a lambda as well instead of partials).

In [23]:
from functools import partial

In [24]:
column_names

['summons_number',
 'plate_id',
 'registration_state',
 'plate_type',
 'issue_date',
 'violation_code',
 'vehicle_body_type',
 'vehicle_make',
 'violation_description']

In [25]:
column_parsers = (parse_int,  # summons_number, default is None
                  parse_string,  # plate_id, default is None
                  partial(parse_string, default=''),  # state
                  partial(parse_string, default=''),  # plate_type
                  parse_date,  # issue_date, default is None
                  parse_int,  # violation_code
                  partial(parse_string, default=''),  # body type
                  parse_string,  # make, default is None
                  lambda x: parse_string(x, default='')  # description
                 )

To parse each field in a row, I'll first separate the data fields into a list of values, then I'll apply the functions in `column_parsers` to the data in that list. 

To do that, I'm going to zip up the parser functions and the data, and use a comprehension to apply each function to its corresponding data field:

In [26]:
def parse_row(row):
    fields = row.strip('\n').split(',')
    parsed_data = (func(field) 
                   for func, field in zip(column_parsers, fields))
    return parsed_data

This is not quite what we want yet, but let's test it out and make sure it does what we expect:

In [27]:
rows = read_data()
for _ in range(5):
    row = next(rows)
    parsed_data = parse_row(row)
    print(list(parsed_data))

[4006478550, 'VAD7274', 'VA', 'PAS', datetime.date(2016, 10, 5), 5, '4D', 'BMW', 'BUS LANE VIOLATION']
[4006462396, '22834JK', 'NY', 'COM', datetime.date(2016, 9, 30), 5, 'VAN', 'CHEVR', 'BUS LANE VIOLATION']
[4007117810, '21791MG', 'NY', 'COM', datetime.date(2017, 4, 10), 5, 'VAN', 'DODGE', 'BUS LANE VIOLATION']
[4006265037, 'FZX9232', 'NY', 'PAS', datetime.date(2016, 8, 23), 5, 'SUBN', 'FORD', 'BUS LANE VIOLATION']
[4006535600, 'N203399C', 'NY', 'OMT', datetime.date(2016, 10, 19), 5, 'SUBN', 'FORD', 'BUS LANE VIOLATION']


Let's finish up the row parser.

First I want it to return a named tuple instead of a plain iterator.

Also, the way I have set up the parsers, I only want to look at data where none of the fields are `None` - that's why I had some fields default to an empty string instead of `None` - those are the ones I still want to retain, even if they are empty.

To do this efficiently, I'm going to use `all`

Let's just quickly recall how `all` works:

In [28]:
all([10, 'hello'])

True

In [29]:
all([None, 'hello'])

False

But we have to watch out, since we are allowing empty strings in our valid data, we cannot simply use `all`:

In [30]:
all([10, ''])

False

That's because empty strings are falsy. So, we need to tweak this slightly.

I'll use a generator expression for this:

In [31]:
l = [10, '', 0]
all(item is not None for item in l)

True

In [32]:
l = [10, '', 0, None]
all(item is not None for item in l)

False

So, now let's finish up our row parser. We'll return a Ticket named tuple if none of the parsed fields are `None`, and we'll allow the user to specify a default otherwise.

In [33]:
def parse_row(row, *, default=None):
    fields = row.strip('\n').split(',')
    # note that I'm using a list comprehension here, 
    # since we'll need to iterate through the entire parsed fields
    # twice - one time to check if nothing is None
    # and another time to create the named tuple
    parsed_data = [func(field) 
                   for func, field in zip(column_parsers, fields)]
    if all(item is not None for item in parsed_data):
        print(*parsed_data)
        return Ticket(*parsed_data)
    else:
        return default

Now let's test it out again:

In [34]:
rows = read_data()
for _ in range(5):
    row = next(rows)
    parsed_data = parse_row(row)
    print(parsed_data)

4006478550 VAD7274 VA PAS 2016-10-05 5 4D BMW BUS LANE VIOLATION
Ticket(summons_number=4006478550, plate_id='VAD7274', registration_state='VA', plate_type='PAS', issue_date=datetime.date(2016, 10, 5), violation_code=5, vehicle_body_type='4D', vehicle_make='BMW', violation_description='BUS LANE VIOLATION')
4006462396 22834JK NY COM 2016-09-30 5 VAN CHEVR BUS LANE VIOLATION
Ticket(summons_number=4006462396, plate_id='22834JK', registration_state='NY', plate_type='COM', issue_date=datetime.date(2016, 9, 30), violation_code=5, vehicle_body_type='VAN', vehicle_make='CHEVR', violation_description='BUS LANE VIOLATION')
4007117810 21791MG NY COM 2017-04-10 5 VAN DODGE BUS LANE VIOLATION
Ticket(summons_number=4007117810, plate_id='21791MG', registration_state='NY', plate_type='COM', issue_date=datetime.date(2017, 4, 10), violation_code=5, vehicle_body_type='VAN', vehicle_make='DODGE', violation_description='BUS LANE VIOLATION')
4006265037 FZX9232 NY PAS 2016-08-23 5 SUBN FORD BUS LANE VIOLATION

#### Checking What Rows are Missing Required Values

Let's quickly run through the file and see what data issues we might have - maybe our assumptions were incorrect about the various data types.

In [35]:
for row in read_data():
    parsed_row = parse_row(row)
    if parsed_row is None:
        print(list(zip(column_names, row.strip('\n').split(','))), end='\n\n')

4006478550 VAD7274 VA PAS 2016-10-05 5 4D BMW BUS LANE VIOLATION
4006462396 22834JK NY COM 2016-09-30 5 VAN CHEVR BUS LANE VIOLATION
4007117810 21791MG NY COM 2017-04-10 5 VAN DODGE BUS LANE VIOLATION
4006265037 FZX9232 NY PAS 2016-08-23 5 SUBN FORD BUS LANE VIOLATION
4006535600 N203399C NY OMT 2016-10-19 5 SUBN FORD BUS LANE VIOLATION
4007156700 92163MG NY COM 2017-04-13 5 VAN FRUEH BUS LANE VIOLATION
4006687989 MIQ600 SC PAS 2016-11-21 5 VN HONDA BUS LANE VIOLATION
4006943052 2AE3984 MD PAS 2017-02-01 5 SW LINCO BUS LANE VIOLATION
4007306795 HLG4926 NY PAS 2017-05-30 5 SUBN TOYOT BUS LANE VIOLATION
4007124590 T715907C NY OMT 2017-04-03 5 SUBN TOYOT BUS LANE VIOLATION
5096061966 HRC9475 NY PAS 2017-04-18 7 SUBN CADIL FAILURE TO STOP AT RED LIGHT
5094070400 DYP8042 NY PAS 2016-10-26 7 SUBN CHEVR FAILURE TO STOP AT RED LIGHT
5094906770 G30ESY NJ PAS 2017-01-01 7 WAGO CHRYS FAILURE TO STOP AT RED LIGHT
5093319363 GGT8868 NY PAS 2016-09-06 7 SUBN CHRYS FAILURE TO STOP AT RED LIGHT
5092638

8464532088 HJR5750 NY PAS 2017-02-03 14 SUBN KIA 14-No Standing
8488825948 HAU1278 NY PAS 2017-03-31 14 SUBN LEXUS 14-No Standing
8559362496 HLW7798 NY PAS 2017-06-13 14 4DSD LEXUS 14-No Standing
8478353860 VUE95C NJ PAS 2016-12-06 14 4DSD LEXUS 14-No Standing
8518911631 ZD53LE NJ PAS 2017-05-17 14 SUBN LEXUS 14-No Standing
1416091415 75213MH NY COM 2016-11-28 14 VAN ME/BE 
8482374059 F96FBF NJ PAS 2017-04-23 14 4DSD ME/BE 14-No Standing
8544960947 HMX8950 NY PAS 2017-05-27 14 4DSD ME/BE 14-No Standing
1418609274 8P82H NY OMT 2016-12-21 14 TAXI NISSA 
1417565895 HEB8184 NY PAS 2017-02-11 14 SDN NISSA 
8155550278 W56GSE NJ PAS 2016-10-22 14 4DSD NISSA 14-No Standing
7433187960 2330830 ME PAS 2017-03-06 14 TRLR NS/OT 14-No Standing
7645052715 FRZ3573 NY PAS 2017-01-07 14 SUBN NS/OT 14-No Standing
7922559173 XCYU94 NJ PAS 2016-10-24 14 VAN NS/OT 14-No Standing
7767415582 XW915N NJ PAS 2017-03-29 14 DELV NS/OT 14-No Standing
8524350581 HGR5953 NY PAS 2017-02-27 14 4DSD OLDSM 14-No Standing

8236569895 EPY9505 NY PAS 2017-03-22 21 4DSD BUICK 21-No Parking (street clean)
8502707383 JGL6885 PA PAS 2017-01-24 21 SUBN BUICK 21-No Parking (street clean)
8511711946 42283JZ NY COM 2017-04-27 21 VAN CHEVR 21-No Parking (street clean)
1418142980 483TFM TN PAS 2016-12-19 21 VAN CHEVR 
1420029915 53468JZ 99 COM 2017-04-27 21 VAN CHEVR 
8539666996 85121ME NY COM 2017-05-22 21 VAN CHEVR 21-No Parking (street clean)
1413709760 DGN6881 NY PAS 2016-09-02 21 SDN CHEVR 
8473311693 DTG6286 NY PAS 2016-11-23 21 SUBN CHEVR 21-No Parking (street clean)
7658191050 GEL4496 NY PAS 2016-07-22 21 SUBN CHEVR 21-No Parking (street clean)
8478626049 GRU5176 NY PAS 2017-05-15 21 4DSD CHEVR 21-No Parking (street clean)
7369979570 GTC5499 NY PAS 2016-07-20 21 4DSD CHEVR 21-No Parking (street clean)
1400876217 GVB9839 NY PAS 2016-09-19 21 P-U CHEVR 
1422284335 HHJ5747 NY PAS 2017-06-03 21 SUBN CHEVR 
8556051480 M322307 NJ PAS 2017-04-26 21 PICK CHEVR 21-No Parking (street clean)
8514570973 T738567C NY OMT 

8566704320 L62FLC NJ PAS 2017-05-17 24 SUBN CHEVR 24-No Parking (exc auth veh)
8536961156 X92FSM NJ PAS 2017-03-28 24 SUBN HYUND 24-No Parking (exc auth veh)
1408122029 DSN6323 NY PAS 2016-07-04 24 SUBN JEEP 
8266568753 F12GRE NJ PAS 2016-11-11 24 SUBN JEEP 24-No Parking (exc auth veh)
8446524806 GYG8911 NY PAS 2016-12-18 24 SUBN JEEP 24-No Parking (exc auth veh)
8316519355 HDJ7785 PA PAS 2016-07-14 24 2DSD NISSA 24-No Parking (exc auth veh)
8533035640 T687600C NY OMT 2017-05-31 24 4DSD NISSA 24-No Parking (exc auth veh)
8526732663 ASV2478 NY PAS 2017-06-22 24 4DSD TOYOT 24-No Parking (exc auth veh)
1410027223 ENS9253 NY PAS 2016-10-31 24 SDN VOLKS 
8479824712 58910MG NY COM 2017-04-21 26 DELV ISUZU 26-No Stnd (for-hire veh only)
1416527722 AHG9422 NY PAS 2017-01-05 27 SUBN TOYOT 
8512270482 Y26CRJ NJ PAS 2017-06-03 31 SUBN BSA 31-No Stand (Com. Mtr. Zone)
8498706890 THANEDAR NY OMT 2017-03-22 31 SUBN CADIL 31-No Stand (Com. Mtr. Zone)
8328546632 HHP7446 NY PAS 2016-12-02 31 2DSD CHEVR

4622727298 GWF1975 NY PAS 2016-08-04 36 4DSD LEXUS PHTO SCHOOL ZN SPEED VIOLATION
4622272659 HEA8485 NY OMS 2016-07-14 36 4DSD LEXUS PHTO SCHOOL ZN SPEED VIOLATION
4631639052 HEK6758 NY PAS 2017-03-09 36 SUBN LEXUS PHTO SCHOOL ZN SPEED VIOLATION
4629635506 HHA6371 NY PAS 2017-01-18 36 SUBN LEXUS PHTO SCHOOL ZN SPEED VIOLATION
4633257055 HJN6853 NY PAS 2017-05-02 36 SUBN LEXUS PHTO SCHOOL ZN SPEED VIOLATION
4629197515 N78FLU NJ PAS 2017-01-09 36 WAGO LEXUS PHTO SCHOOL ZN SPEED VIOLATION
4633226356 PTJ969 NY PAS 2017-05-01 36 SUBN LEXUS PHTO SCHOOL ZN SPEED VIOLATION
4626735174 HGR5863 NY PAS 2016-11-14 36 4DSD LINCO PHTO SCHOOL ZN SPEED VIOLATION
4622247501 T635722C NY OMT 2016-07-13 36 4DSD LINCO PHTO SCHOOL ZN SPEED VIOLATION
4626383063 T701050C NY OMT 2016-11-04 36 4DSD LINCO PHTO SCHOOL ZN SPEED VIOLATION
4631263763 92814 NY MED 2017-03-03 36 4DSD ME/BE PHTO SCHOOL ZN SPEED VIOLATION
4626546109 39999MB NY COM 2016-11-07 36 VAN ME/BE PHTO SCHOOL ZN SPEED VIOLATION
4626838832 EVV1706 

8478953528 JNB5044 PA PAS 2016-12-17 38 SUBN HYUND 38-Failure to Display Muni Rec
8471611235 T13AZX NJ PAS 2016-10-10 38 4DSD HYUND 38-Failure to Display Muni Rec
7015856128 GMU5413 NY PAS 2016-09-06 38 4DSD INFIN 38-Failure to Display Muni Rec
8532505170 HAZ7926 NY PAS 2017-04-15 38 4DSD INFIN 38-Failure to Display Muni Rec
8156045178 HLF3741 NY PAS 2016-12-21 38 4DSD INFIN 38-Failure to Display Muni Rec
8477154788 AP656S NJ PAS 2017-02-07 38 REFG INTER 38-Failure to Display Muni Rec
7992738184 63162MD NY COM 2016-10-24 38 VAN ISUZU 38-Failure to Display Muni Rec
8488913308 GYV3087 NY PAS 2017-03-13 38 SUBN JEEP 38-Failure to Display Muni Rec
8303532224 HAB4875 NY PAS 2017-02-01 38 SUBN JEEP 38-Failure to Display Muni Rec
7172330286 MHM933 SC PAS 2017-01-18 38 SUBN KIA 38-Failure to Display Muni Rec
8279018967 V41GCR NJ PAS 2016-08-13 38 SUBN KIA 38-Failure to Display Muni Rec
8530104183 EMK4874 NY PAS 2017-03-21 38 4DSD LEXUS 38-Failure to Display Muni Rec
8228552543 GEA2261 NY PAS 2

1387522700 71017JM NY COM 2016-08-25 46 VAN CHEVR 
8008621527 HFP7192 NY PAS 2016-10-25 46 VAN CHEVR 46A-Double Parking (Non-COM)
1413687738 25570MC NY COM 2016-09-06 46 DELV DODGE 
8513917291 33147MK NY COM 2017-06-27 46 VAN DODGE 46A-Double Parking (Non-COM)
8505564479 99828MC NY COM 2017-03-07 46 VAN DODGE 46B-Double Parking (Com-100Ft)
8513554716 XWD5801 VA PAS 2017-03-03 46 SUBN DODGE 46A-Double Parking (Non-COM)
1420122800 AB80443 CT PAS 2017-03-13 46 SDN FIAT 
8472012591 48768JZ NY COM 2016-11-09 46 VAN FORD 46B-Double Parking (Com-100Ft)
1401381406 EUM7025 NY PAS 2016-11-14 46 SUBN FORD 
1417599716 12203MG NY COM 2016-12-01 46 VAN FRUEH 
8019072664 12211MG NY COM 2016-11-11 46 VAN FRUEH 46B-Double Parking (Com-100Ft)
8429513220 12246MG NY COM 2016-10-04 46 VAN FRUEH 46B-Double Parking (Com-100Ft)
8470521172 28672MH NY COM 2016-11-21 46 VAN FRUEH 46B-Double Parking (Com-100Ft)
1414783784 30182JF NY COM 2016-11-25 46 VAN FRUEH 
8267055393 76253JY NY COM 2016-10-10 46 DELV FRUEH 4

8397514636 FYW3850 NY PAS 2016-10-22 74 SUBN FORD 74A-Improperly Displayed Plate
8528912840 HNY5206 NY PAS 2017-04-21 74 SUBN GMC 74A-Improperly Displayed Plate
8511501113 FBP6836 NY PAS 2017-02-16 74 SUBN HONDA 74-Missing Display Plate
8276551890 HJR4081 NY PAS 2017-01-07 74 SUBN HONDA 74A-Improperly Displayed Plate
8214550531 HJY2207 NY PAS 2017-03-09 74 4DSD HYUND 74-Missing Display Plate
8487469012 27527ME NY COM 2017-01-27 74 DELV INTER 74-Missing Display Plate
8034780236 HCD3158 NY PAS 2016-09-02 74 SUBN KIA 74A-Improperly Displayed Plate
8545052455 HMK1149 NY PAS 2017-06-07 74 4DSD ME/BE 74-Missing Display Plate
1411263467 BLANKPLATE 99 999 2017-02-13 74 SDN NISSA 
8540001550 HLA4803 NY PAS 2017-03-23 74 4DSD SAAB 74-Missing Display Plate
8556155431 HFB9919 NY PAS 2017-05-26 75 4DSD DODGE 75-No Match-Plate/Reg. Sticker
8394016790 31690BB NY OMR 2016-09-06 77 BUS AM/T 77-Parked Bus (exc desig area)
1419472306 56090BA NY OMR 2017-02-20 77 BUS INTER 
8363508688 97256 MA APP 2016-08

OK, so mostly the data is clean. Looks like we have a few rows without descriptions. 
Technically there's a whole lot more validation and cleaning we should do. For example, it looks like the states are not always proper state abbreviations (like 99 in some records, etc). But this is good enough for now.

#### Creating an Iterator for the data

Finally, let's create an iterator to easily iterate over the cleaned up and structured data in the file, skipping `None` rows:

In [36]:
def parsed_data():
    for row in read_data():
        parsed = parse_row(row)
        if parsed:
            yield parsed

Let's test it out by iterating a few times:

In [37]:
parsed_rows = parsed_data()
for _ in range(5):
    print(next(parsed_rows))

4006478550 VAD7274 VA PAS 2016-10-05 5 4D BMW BUS LANE VIOLATION
Ticket(summons_number=4006478550, plate_id='VAD7274', registration_state='VA', plate_type='PAS', issue_date=datetime.date(2016, 10, 5), violation_code=5, vehicle_body_type='4D', vehicle_make='BMW', violation_description='BUS LANE VIOLATION')
4006462396 22834JK NY COM 2016-09-30 5 VAN CHEVR BUS LANE VIOLATION
Ticket(summons_number=4006462396, plate_id='22834JK', registration_state='NY', plate_type='COM', issue_date=datetime.date(2016, 9, 30), violation_code=5, vehicle_body_type='VAN', vehicle_make='CHEVR', violation_description='BUS LANE VIOLATION')
4007117810 21791MG NY COM 2017-04-10 5 VAN DODGE BUS LANE VIOLATION
Ticket(summons_number=4007117810, plate_id='21791MG', registration_state='NY', plate_type='COM', issue_date=datetime.date(2017, 4, 10), violation_code=5, vehicle_body_type='VAN', vehicle_make='DODGE', violation_description='BUS LANE VIOLATION')
4006265037 FZX9232 NY PAS 2016-08-23 5 SUBN FORD BUS LANE VIOLATION

##  Project Solution: Goal 2

Here's what we wrote in Goal 1:

In [1]:
from collections import namedtuple
from datetime import datetime
from functools import partial

file_name = 'nyc_parking_tickets_extract.csv'

with open(file_name) as f:
    column_headers = next(f).strip('\n').split(',')
    
column_names = [header.replace(' ', '_').lower() 
                for header in column_headers]

Ticket = namedtuple('Ticket', column_names)

def read_data():
    with open(file_name) as f:
        next(f)
        yield from f
        
def parse_int(value, *, default=None):
    try:
        return int(value)
    except ValueError:
        return default
    
def parse_date(value, *, default=None):
    date_format='%m/%d/%Y'
    try:
        return datetime.strptime(value, date_format).date()
    except ValueError:
        return default
    
def parse_string(value, *, default=None):
    try:
        cleaned = str(value).strip()
        if not cleaned:
            # empty string
            return default
        else:
            return cleaned
    except:
        return default
    
column_parsers = (parse_int,  # summons_number, default is None
                  parse_string,  # plate_id, default is None
                  partial(parse_string, default=''),  # state
                  partial(parse_string, default=''),  # plate_type
                  parse_date,  # issue_date, default is None
                  parse_int,  # violation_code
                  partial(parse_string, default=''),  # body type
                  parse_string,  # make, default is None
                  lambda x: parse_string(x, default='')  # description
                 )

def parse_row(row, *, default=None):
    fields = row.strip('\n').split(',')
    # note that I'm using a list comprehension here, 
    # since we'll need to iterate through the entire parsed fields
    # twice - one time to check if nothing is None
    # and another time to create the named tuple
    parsed_data = [func(field) 
                   for func, field in zip(column_parsers, fields)]
    if all(item is not None for item in parsed_data):
        return Ticket(*parsed_data)
    else:
        return default
    
def parsed_data():
    for row in read_data():
        parsed = parse_row(row)
        if parsed:
            yield parsed

#### Goal 2: Calculating Number of Violations by Car Make

What we want to do here is iterate through the file and keep a counter, for each make, of how many rows for that make was encountered.

A good approach is to use a dictionary to keep track of the makes (as keys), and the value can be a counter that is initialized to 1 if the key (make) does not already exist, or incremented by 1 if it does.

We could do this using regular dictionaries first:

In [2]:
makes_counts = {}

for data in parsed_data():
    if data.vehicle_make in makes_counts:
        makes_counts[data.vehicle_make] += 1
    else:
        makes_counts[data.vehicle_make] = 1
        
for make, cnt in sorted(makes_counts.items(), 
                        key=lambda t: t[1], 
                        reverse=True):
    print(make, cnt)
    

TOYOT 112
HONDA 106
FORD 104
CHEVR 76
NISSA 70
DODGE 45
FRUEH 44
ME/BE 38
GMC 35
HYUND 35
BMW 34
LEXUS 26
INTER 25
JEEP 22
NS/OT 18
SUBAR 18
INFIN 13
LINCO 12
CHRYS 12
ACURA 12
AUDI 12
VOLVO 12
MITSU 11
ISUZU 10
CADIL 9
KIA 8
VOLKS 8
HIN 6
KENWO 5
ROVER 5
BUICK 5
MAZDA 5
MERCU 4
JAGUA 3
SMART 3
PORSC 3
WORKH 2
SATUR 2
SCION 2
SAAB 2
HINO 2
FIR 1
OLDSM 1
PETER 1
CITRO 1
GEO 1
YAMAH 1
BSA 1
MINI 1
PONTI 1
SPRI 1
PLYMO 1
UPS 1
FIAT 1
UD 1
UTILI 1
GMCQ 1
STAR 1
AM/T 1
MI/F 1


We can also make use of a special type of dictionary called a `defaultdict`. The way a `defaultdict` works, is that if you try to retrieve a non-existent key from the dictionary, it will return a **default** value. It does need to know the data type to use for the default - so we should provide one.

Let's take a look:

In [3]:
from collections import defaultdict

In [4]:
d = defaultdict(str)

In [5]:
d['a'] = 100

In [6]:
d['a']

100

In [7]:
d['b']

''

As you can see it returned an empty string.

In our case, we want to use it to count, so we can make our default be integers:

In [8]:
d = defaultdict(int)

In [9]:
d['a'] = 1

In [10]:
d['b']

0

So, if we want to either set a key's value to `1` if it does not already exist, or increment it by `1` if it does, it's quite simple. In both cases, we just need to retrieve the key's value, and increment by 1:

In [11]:
d = defaultdict(int)

In [12]:
d['make1'] += 1

In [13]:
d['make1']

1

In [14]:
d['make1'] += 1

In [15]:
d['make1']

2

So, we could simplify our counter algorithm using a default dict:

In [16]:
makes_counts = defaultdict(int)

for data in parsed_data():
    makes_counts[data.vehicle_make] += 1
    
for make, cnt in sorted(makes_counts.items(), 
                        key=lambda t: t[1], 
                        reverse=True):
    print(make, cnt)
    

TOYOT 112
HONDA 106
FORD 104
CHEVR 76
NISSA 70
DODGE 45
FRUEH 44
ME/BE 38
GMC 35
HYUND 35
BMW 34
LEXUS 26
INTER 25
JEEP 22
NS/OT 18
SUBAR 18
INFIN 13
LINCO 12
CHRYS 12
ACURA 12
AUDI 12
VOLVO 12
MITSU 11
ISUZU 10
CADIL 9
KIA 8
VOLKS 8
HIN 6
KENWO 5
ROVER 5
BUICK 5
MAZDA 5
MERCU 4
JAGUA 3
SMART 3
PORSC 3
WORKH 2
SATUR 2
SCION 2
SAAB 2
HINO 2
FIR 1
OLDSM 1
PETER 1
CITRO 1
GEO 1
YAMAH 1
BSA 1
MINI 1
PONTI 1
SPRI 1
PLYMO 1
UPS 1
FIAT 1
UD 1
UTILI 1
GMCQ 1
STAR 1
AM/T 1
MI/F 1


To wrap up this goal, let's make a function that will return that dictionary, and while we're at it we'll sort the dictionary keys based on a descending count. (Remember that in Python 3.6+ dictionaries will now maintain their key order - we don't need to use an `OrderedDict`).

In [17]:
def violation_counts_by_make():
    makes_counts = defaultdict(int)
    for data in parsed_data():
        makes_counts[data.vehicle_make] += 1
        
    return {make: cnt 
            for make, cnt in sorted(makes_counts.items(), 
                                    key=lambda t: t[1], 
                                    reverse=True)
           }

In [18]:
print(violation_counts_by_make())

{'TOYOT': 112, 'HONDA': 106, 'FORD': 104, 'CHEVR': 76, 'NISSA': 70, 'DODGE': 45, 'FRUEH': 44, 'ME/BE': 38, 'GMC': 35, 'HYUND': 35, 'BMW': 34, 'LEXUS': 26, 'INTER': 25, 'JEEP': 22, 'NS/OT': 18, 'SUBAR': 18, 'INFIN': 13, 'LINCO': 12, 'CHRYS': 12, 'ACURA': 12, 'AUDI': 12, 'VOLVO': 12, 'MITSU': 11, 'ISUZU': 10, 'CADIL': 9, 'KIA': 8, 'VOLKS': 8, 'HIN': 6, 'KENWO': 5, 'ROVER': 5, 'BUICK': 5, 'MAZDA': 5, 'MERCU': 4, 'JAGUA': 3, 'SMART': 3, 'PORSC': 3, 'WORKH': 2, 'SATUR': 2, 'SCION': 2, 'SAAB': 2, 'HINO': 2, 'FIR': 1, 'OLDSM': 1, 'PETER': 1, 'CITRO': 1, 'GEO': 1, 'YAMAH': 1, 'BSA': 1, 'MINI': 1, 'PONTI': 1, 'SPRI': 1, 'PLYMO': 1, 'UPS': 1, 'FIAT': 1, 'UD': 1, 'UTILI': 1, 'GMCQ': 1, 'STAR': 1, 'AM/T': 1, 'MI/F': 1}


# Section 08 - Iteration Tools

##  Slicing Iterables

We know that sequence types can be sliced:

In [1]:
l = [1, 2, 3, 4, 5]

In [2]:
l[0:2]

[1, 2]

Equivalently we can use the `slice` object:

In [3]:
s = slice(0, 2)

In [4]:
l[s]

[1, 2]

But this does not work with iterables that are not also sequence types:

In [5]:
import math

def factorials(n):
    for i in range(n):
        yield math.factorial(i)

In [6]:
facts = factorials(100)

In [7]:
facts[0:2]

TypeError: 'generator' object is not subscriptable

But we could write a function to mimic this. Let's try a simplistic approach that will only work for a consecutive slice:

In [8]:
def slice_(iterable, start, stop):
    for _ in range(0, start):
        next(iterable)
        
    for _ in range(start, stop):
        yield(next(iterable))

In [9]:
list(slice_(factorials(100), 1, 5))

[1, 2, 6, 24]

This is quite simple, however we don't support a `step` value.

The `itertools` module has a function, `islice` which implements this for us:

In [10]:
list(factorials(10))

[1, 1, 2, 6, 24, 120, 720, 5040, 40320, 362880]

Now let's use the `islice` function to obtain the first 3 elements:

In [11]:
from itertools import islice

In [12]:
islice(factorials(10), 0, 3)

<itertools.islice at 0x28b2120fe08>

`islice` is itself a lazy iterator, so we can iterate through it:

In [13]:
list(islice(factorials(10), 0, 3))

[1, 1, 2]

We can even use a step value:

In [14]:
list(islice(factorials(10), 0, 10, 2))

[1, 2, 24, 720, 40320]

It does not support negative indices, or step values, but it does support None for all the arguments. The default, as expected would then be the first element, the last element, and a step of 1:

In [15]:
list(islice(factorials(10), None, None, 2))

[1, 2, 24, 720, 40320]

This function can be very useful when dealing with infinite iterators for example.

In [16]:
def factorials():
    index = 0
    while True:
        yield math.factorial(index)
        index += 1

Let's say we want to see the first 5 elements. We could do it the way we have up to now:

In [17]:
facts = factorials()
for _ in range(5):
    print(next(facts))

1
1
2
6
24


Or we could use `islice` as follows:

In [18]:
list(islice(factorials(), 5))

[1, 1, 2, 6, 24]

One thing to note is that `islice` is a lazy iterator, but when we use a `step` value, there is no magic, Python still has to call `next` on our iterable - it just doesn't always yield it back to us.

To see this, we'll add a print statement to our generator function:

In [19]:
def factorials():
    index = 0
    while True:
        print(f'yielding factorial({index})...')
        yield math.factorial(index)
        index += 1

In [20]:
list(islice(factorials(), 9))

yielding factorial(0)...
yielding factorial(1)...
yielding factorial(2)...
yielding factorial(3)...
yielding factorial(4)...
yielding factorial(5)...
yielding factorial(6)...
yielding factorial(7)...
yielding factorial(8)...


[1, 1, 2, 6, 24, 120, 720, 5040, 40320]

In [21]:
list(islice(factorials(), None, 10, 2))

yielding factorial(0)...
yielding factorial(1)...
yielding factorial(2)...
yielding factorial(3)...
yielding factorial(4)...
yielding factorial(5)...
yielding factorial(6)...
yielding factorial(7)...
yielding factorial(8)...
yielding factorial(9)...


[1, 2, 24, 720, 40320]

As you can see, even though 5 elements were yielded from `islice`, it still had to call our generator 10 times!

The same thing happens if we skip elements in the slice, it still has to call next for the skipped elements:

In [22]:
list(islice(factorials(), 5, 10))

yielding factorial(0)...
yielding factorial(1)...
yielding factorial(2)...
yielding factorial(3)...
yielding factorial(4)...
yielding factorial(5)...
yielding factorial(6)...
yielding factorial(7)...
yielding factorial(8)...
yielding factorial(9)...


[120, 720, 5040, 40320, 362880]

The other thing to watch out for is that islice is an **iterator** - which means it becomes exhausted, **even if you pass an iterable such as a list to it**!

In [23]:
l = [1, 2, 3, 4, 5]

In [24]:
s = islice(l, 0, 3)

In [25]:
list(s)

[1, 2, 3]

In [26]:
list(s)

[]

So watch out!

Furthermore, keep in mind that `islice` iterates over our iterable in order to yield the appropriate values. This means that if we use an iterator, that iterator will get consumed, and possibly exhausted:

In [30]:
facts = factorials()

In [31]:
next(facts), next(facts), next(facts), next(facts)

yielding factorial(0)...
yielding factorial(1)...
yielding factorial(2)...
yielding factorial(3)...


(1, 1, 2, 6)

If we now start slicing `facts` with `islice`, remember that the first four values of `facts` have already been consumed!

In [32]:
list(islice(facts, 0, 3))

yielding factorial(4)...
yielding factorial(5)...
yielding factorial(6)...


[24, 120, 720]

And of course, `islice` further consumed our iterator:

In [33]:
next(facts)

yielding factorial(7)...


5040

So, just something to keep in mind when we pass iterators to `islice`, and more generally to any of the functions in `itertools`.

##  Selecting and Filtering Iterators

#### *filter*  and *filterfalse*

You should already be aware of the Python built-in function `filter`.

Remember that the `filter` function can work with any iterable, including of course iterators and generators.

Let's see a quick example:

In [1]:
def gen_cubes(n):
    for i in range(n):
        print(f'yielding {i}')
        yield i**3

Now let's say we only want to use cubes that are odd.

We need a function that will return a True if the number is odd, False otherwise. (This is technically called a **predicate** by the way - any function that given an input returns True or False is called a **predicate**)

In [2]:
def is_odd(x):
    return x % 2 == 1

Let's make sure the function works as expected:

In [3]:
is_odd(4), is_odd(81)

(False, True)

Now we can use that function (or we could have just used a lambda as well) with the `filter` function.

Note that the `filter` function is also lazy.

In [4]:
filtered = filter(is_odd, gen_cubes(10))

Notice that the `gen_cubes(10)` generator was not actually used (no print output).

We can however iterate through it:

In [5]:
list(filtered)

yielding 0
yielding 1
yielding 2
yielding 3
yielding 4
yielding 5
yielding 6
yielding 7
yielding 8
yielding 9


[1, 27, 125, 343, 729]

As we can see `filtered` will drop any values where the predicate is False.

We could easily reverse this to return not-odd (i.e. even) values:

In [6]:
def is_even(x):
    return x % 2 == 0

In [7]:
list(filter(is_even, gen_cubes(10)))

yielding 0
yielding 1
yielding 2
yielding 3
yielding 4
yielding 5
yielding 6
yielding 7
yielding 8
yielding 9


[0, 8, 64, 216, 512]

But we had to create a new function - instead we could use the `filterfalse` function in the `itertools` module that does the same work as `filter` but retains values where the predicate is False (instead of True as the `filter` function does).

The `filterfalse` function also uses lazy evaluation.

In [8]:
from itertools import filterfalse

In [9]:
evens = filterfalse(is_odd, gen_cubes(10))

No print output --> lazy evaluation

In [10]:
list(evens)

yielding 0
yielding 1
yielding 2
yielding 3
yielding 4
yielding 5
yielding 6
yielding 7
yielding 8
yielding 9


[0, 8, 64, 216, 512]


This way we can filter using the same predicate, depending on whether the result is `True` (using `filter`) or `False` (using `filterfalse`).

#### *dropwhile* and *takewhile*

The `takewhile` function in the `itertools` module will yield elements from an iterable, as long as a specific criteria (the predicate) is `True`.

As soon as the predicate is `False`, iteration is stopped - even if subsequent elements would have had a `True` predicate - this is not a filter, this basically iterate over an iterable as long as the predicate remains `True`.

As we might expect, this function also uses lazy evaluation.

In [11]:
from math import sin, pi

def sine_wave(n):
    start = 0
    max_ = 2 * pi
    step = (max_ - start) / (n-1)
    for _ in range(n):
        yield round(sin(start), 2)
        start += step    

In [12]:
list(sine_wave(15))

[0.0,
 0.43,
 0.78,
 0.97,
 0.97,
 0.78,
 0.43,
 0.0,
 -0.43,
 -0.78,
 -0.97,
 -0.97,
 -0.78,
 -0.43,
 -0.0]

In [13]:
from itertools import takewhile

list(takewhile(lambda x: 0 <= x <= 0.9, sine_wave(15)))

[0.0, 0.43, 0.78]

As you can see iteration stopped at `0.78`, even though we had values later that would have had a `True` predicate. This is different from the `filter` function:

In [14]:
list(filter(lambda x: 0 <= x <= 0.9, sine_wave(15)))

[0.0, 0.43, 0.78, 0.78, 0.43, 0.0, -0.0]

The `dropwhile` function on the other hand starts the iteration once the predicate becomes `False`:

In [15]:
from itertools import dropwhile

In [16]:
l = [1, 3, 5, 2, 1]

In [17]:
list(dropwhile(lambda x: x < 5, l))

[5, 2, 1]

As you can see the iterable skipped `1` and `3` and started the iteration once the predicate was `False`. Once the iteration begins, it no longer checks the predicate, and so we ended up with `5` and `2` and `1` in the iteration.

#### The *compress* function

The compress function is essentially a filter that takes two iterables as parameters.
The first argument is the iterable (data) that will be filtered, and the second iterable contains elements (selectors), possibly of different length than the iterable being filtered. As always in Python, any object has an associated truth value, and the selectors therefore each have a truth value as well.

The resulting iterator yields elements from the data iterable where the selector at the same "position" is truthy.

A simple analogous way to look at it would be as follows using the `zip` function:


In [18]:
data = ['a', 'b', 'c', 'd', 'e']
selectors = [True, False, 1, 0]

In [19]:
list(zip(data, selectors))

[('a', True), ('b', False), ('c', 1), ('d', 0)]

And only retain the elements where the second value in the tuple is truthy:

In [20]:
[item for item, truth_value in zip(data, selectors) if truth_value]

['a', 'c']

The `compress` function works the same way, except that it is evaluated lazily and returns an iterator:

In [21]:
from itertools import compress

In [22]:
list(compress(data, selectors))

['a', 'c']

##  Infinite Iterators

There are three functions in the `itertools` module that produce infinite iterators: `count`, `cycle` and `repeat`.

In [1]:
from itertools import (
    count,
    cycle,
    repeat, 
    islice)

#### count

The `count` function is similar to range, except it does not have a `stop` value. It has both a `start` and a `step`:

In [2]:
g = count(10)

In [3]:
list(islice(g, 5))

[10, 11, 12, 13, 14]

In [4]:
g = count(10, step=2)

In [5]:
list(islice(g, 5))

[10, 12, 14, 16, 18]

And so on. 

Unlike the `range` function, whose arguments must always be integers, `count` works with floats as well:

In [6]:
g = count(10.5, 0.5)

In [7]:
list(islice(g, 5))

[10.5, 11.0, 11.5, 12.0, 12.5]

In fact, we can even use other data types as well:

In [8]:
g = count(1+1j, 1+2j)

In [9]:
list(islice(g, 5))

[(1+1j), (2+3j), (3+5j), (4+7j), (5+9j)]

We can even use Decimal numbers:

In [10]:
from decimal import Decimal

In [11]:
g = count(Decimal('0.0'), Decimal('0.1'))

In [12]:
list(islice(g, 5))

[Decimal('0.0'),
 Decimal('0.1'),
 Decimal('0.2'),
 Decimal('0.3'),
 Decimal('0.4')]

### Cycle

`cycle` is used to repeatedly loop over an iterable:

In [13]:
g = cycle(('red', 'green', 'blue'))

In [14]:
list(islice(g, 8))

['red', 'green', 'blue', 'red', 'green', 'blue', 'red', 'green']

One thing to note is that this works **even** if the argument is an iterator (i.e. gets exhausted after the first complete iteration over it)!

Let's see a simple example of this:

In [15]:
def colors():
    yield 'red'
    yield 'green'
    yield 'blue'

In [16]:
cols = colors()

In [17]:
list(cols)

['red', 'green', 'blue']

In [18]:
list(cols)

[]

As expected, `cols` was exhausted after the first iteration.

Now let's see how `cycle` behaves:

In [19]:
cols = colors()
g = cycle(cols)

In [20]:
list(islice(g, 10))

['red', 'green', 'blue', 'red', 'green', 'blue', 'red', 'green', 'blue', 'red']

As you can see, `cycle` iterated over the elements of iterator, and continued the iteration even though the first run through the iterator technically exhausted it.

##### Example

A simple application of `cycle` is dealing a deck of cards into separate hands:

In [21]:
from collections import namedtuple

In [22]:
Card = namedtuple('Card', 'rank suit')

In [23]:
def card_deck():
    ranks = tuple(str(num) for num in range(2, 11)) + tuple('JQKA')
    suits = ('Spades', 'Hearts', 'Diamonds', 'Clubs')
    for suit in suits:
        for rank in ranks:
            yield Card(rank, suit)

Assume we want 4 hands, so we can think of the hands as a list containing 4 elements - each of which is itself a list containing cards.

The indices of the hands would be `0, 1, 2, 3` in the hands list:

We could certainly do it this way:

In [24]:
hands = [list() for _ in range(4)]

In [25]:
hands

[[], [], [], []]

In [26]:
index = 0
for card in card_deck():
    index = index % 4
    hands[index].append(card)
    index += 1

In [27]:
hands

[[Card(rank='2', suit='Spades'),
  Card(rank='6', suit='Spades'),
  Card(rank='10', suit='Spades'),
  Card(rank='A', suit='Spades'),
  Card(rank='5', suit='Hearts'),
  Card(rank='9', suit='Hearts'),
  Card(rank='K', suit='Hearts'),
  Card(rank='4', suit='Diamonds'),
  Card(rank='8', suit='Diamonds'),
  Card(rank='Q', suit='Diamonds'),
  Card(rank='3', suit='Clubs'),
  Card(rank='7', suit='Clubs'),
  Card(rank='J', suit='Clubs')],
 [Card(rank='3', suit='Spades'),
  Card(rank='7', suit='Spades'),
  Card(rank='J', suit='Spades'),
  Card(rank='2', suit='Hearts'),
  Card(rank='6', suit='Hearts'),
  Card(rank='10', suit='Hearts'),
  Card(rank='A', suit='Hearts'),
  Card(rank='5', suit='Diamonds'),
  Card(rank='9', suit='Diamonds'),
  Card(rank='K', suit='Diamonds'),
  Card(rank='4', suit='Clubs'),
  Card(rank='8', suit='Clubs'),
  Card(rank='Q', suit='Clubs')],
 [Card(rank='4', suit='Spades'),
  Card(rank='8', suit='Spades'),
  Card(rank='Q', suit='Spades'),
  Card(rank='3', suit='Hearts'),


You notice how we had to use the `mod` operator and an `index` to **cycle** through the hands.

So, we can use the `cycle` function instead:

In [28]:
hands = [list() for _ in range(4)]

In [29]:
index_cycle = cycle([0, 1, 2, 3])
for card in card_deck():
    hands[next(index_cycle)].append(card)

In [30]:
hands

[[Card(rank='2', suit='Spades'),
  Card(rank='6', suit='Spades'),
  Card(rank='10', suit='Spades'),
  Card(rank='A', suit='Spades'),
  Card(rank='5', suit='Hearts'),
  Card(rank='9', suit='Hearts'),
  Card(rank='K', suit='Hearts'),
  Card(rank='4', suit='Diamonds'),
  Card(rank='8', suit='Diamonds'),
  Card(rank='Q', suit='Diamonds'),
  Card(rank='3', suit='Clubs'),
  Card(rank='7', suit='Clubs'),
  Card(rank='J', suit='Clubs')],
 [Card(rank='3', suit='Spades'),
  Card(rank='7', suit='Spades'),
  Card(rank='J', suit='Spades'),
  Card(rank='2', suit='Hearts'),
  Card(rank='6', suit='Hearts'),
  Card(rank='10', suit='Hearts'),
  Card(rank='A', suit='Hearts'),
  Card(rank='5', suit='Diamonds'),
  Card(rank='9', suit='Diamonds'),
  Card(rank='K', suit='Diamonds'),
  Card(rank='4', suit='Clubs'),
  Card(rank='8', suit='Clubs'),
  Card(rank='Q', suit='Clubs')],
 [Card(rank='4', suit='Spades'),
  Card(rank='8', suit='Spades'),
  Card(rank='Q', suit='Spades'),
  Card(rank='3', suit='Hearts'),


But we really can simplify this even further - why are we cycling through the indices? Why not simply cycle through the hand themselves, and append the card to the hands?

In [31]:
hands = [list() for _ in range(4)]

In [32]:
hands_cycle = cycle(hands)
for card in card_deck():
    next(hands_cycle).append(card)

In [33]:
hands

[[Card(rank='2', suit='Spades'),
  Card(rank='6', suit='Spades'),
  Card(rank='10', suit='Spades'),
  Card(rank='A', suit='Spades'),
  Card(rank='5', suit='Hearts'),
  Card(rank='9', suit='Hearts'),
  Card(rank='K', suit='Hearts'),
  Card(rank='4', suit='Diamonds'),
  Card(rank='8', suit='Diamonds'),
  Card(rank='Q', suit='Diamonds'),
  Card(rank='3', suit='Clubs'),
  Card(rank='7', suit='Clubs'),
  Card(rank='J', suit='Clubs')],
 [Card(rank='3', suit='Spades'),
  Card(rank='7', suit='Spades'),
  Card(rank='J', suit='Spades'),
  Card(rank='2', suit='Hearts'),
  Card(rank='6', suit='Hearts'),
  Card(rank='10', suit='Hearts'),
  Card(rank='A', suit='Hearts'),
  Card(rank='5', suit='Diamonds'),
  Card(rank='9', suit='Diamonds'),
  Card(rank='K', suit='Diamonds'),
  Card(rank='4', suit='Clubs'),
  Card(rank='8', suit='Clubs'),
  Card(rank='Q', suit='Clubs')],
 [Card(rank='4', suit='Spades'),
  Card(rank='8', suit='Spades'),
  Card(rank='Q', suit='Spades'),
  Card(rank='3', suit='Hearts'),


#### Repeat

The `repeat` function is used to create an iterator that just returns the same value again and again. By default it is infinite, but a count can be specified optionally:

In [34]:
g = repeat('Python')
for _ in range(5):
    print(next(g))

Python
Python
Python
Python
Python


And we also optionally specify a count to make the iterator finite:

In [35]:
g = repeat('Python', 4)

In [36]:
list(g)

['Python', 'Python', 'Python', 'Python']

The important thing to note as well, is that the "value" that is returned is the **same** object every time!

Let's see this:

In [37]:
l = [1, 2, 3]

In [38]:
result = list(repeat(l, 3))

In [39]:
result

[[1, 2, 3], [1, 2, 3], [1, 2, 3]]

In [40]:
l is result[0], l is result[1], l is result[2]

(True, True, True)

So be careful here. If you try to use repeat to create three separate instances of a list, you'll actually end up with shared references:

In [41]:
result[0], result[1], result[2]

([1, 2, 3], [1, 2, 3], [1, 2, 3])

In [42]:
result[0][0] = 100

In [43]:
result[0], result[1], result[2]

([100, 2, 3], [100, 2, 3], [100, 2, 3])

If you want to end up with three separate copies of your argument, then you'll need to use a copy mechanism (either shallow or deep depending on your needs).

This is easily done using a comprehension expression:

In [44]:
l = [1, 2, 3]
result = [item[:] for item in repeat(l, 3)]

In [45]:
result

[[1, 2, 3], [1, 2, 3], [1, 2, 3]]

In [46]:
l is result[0], l is result[1], l is result[2]

(False, False, False)

In [47]:
result[0][0] = 100

In [48]:
result

[[100, 2, 3], [1, 2, 3], [1, 2, 3]]

##  Chaining and Teeing Iterators

Often we need to chain iterators/iterables together to behave like a single iterable.

We can think of this as analogous to sequence concatenation.

For example, suppose we have some generators producing squares:

In [1]:
l1 = (i**2 for i in range(4))
l2 = (i**2 for i in range(4, 8))
l3 = (i**2 for i in range(8, 12))

And we want to essentially iterate through all the values as if they were a single iterator.

We could do it this way:

In [2]:
for gen in (l1, l2, l3):
    for item in gen:
        print(item)

0
1
4
9
16
25
36
49
64
81
100
121


In fact, we could even create our own generator function to do this:

In [3]:
def chain_iterables(*iterables):
    for iterable in iterables:
        yield from iterable

In [4]:
l1 = (i**2 for i in range(4))
l2 = (i**2 for i in range(4, 8))
l3 = (i**2 for i in range(8, 12))

for item in chain_iterables(l1, l2, l3):
    print(item)

0
1
4
9
16
25
36
49
64
81
100
121


But, a much simpler way is to use `chain` in the `itertools` module:

In [5]:
from itertools import chain

In [6]:
l1 = (i**2 for i in range(4))
l2 = (i**2 for i in range(4, 8))
l3 = (i**2 for i in range(8, 12))

for item in chain(l1, l2, l3):
    print(item)

0
1
4
9
16
25
36
49
64
81
100
121


Note that `chain` expects a variable number of positional arguments, each of which should be an iterable.

It will not work if we pass it a single iterable:

In [7]:
l1 = (i**2 for i in range(4))
l2 = (i**2 for i in range(4, 8))
l3 = (i**2 for i in range(8, 12))

lists = [l1, l2, l3]
for item in chain(lists):
    print(item)

<generator object <genexpr> at 0x0000020AAFE06F10>
<generator object <genexpr> at 0x0000020AAFDB43B8>
<generator object <genexpr> at 0x0000020AAFE06FC0>


As you can see, it simply took our list and handed it back directly - there was nothing else to chain with:

In [8]:
l1 = (i**2 for i in range(4))
l2 = (i**2 for i in range(4, 8))
l3 = (i**2 for i in range(8, 12))

lists = [l1, l2, l3]
for item in chain(lists):
    for i in item:
        print(i)

0
1
4
9
16
25
36
49
64
81
100
121


Instead, we could use unpacking:

In [9]:
l1 = (i**2 for i in range(4))
l2 = (i**2 for i in range(4, 8))
l3 = (i**2 for i in range(8, 12))

lists = [l1, l2, l3]
for item in chain(*lists):
    print(item)

0
1
4
9
16
25
36
49
64
81
100
121


Unpacking works with iterables in general, so even the following would work just fine:

In [10]:
def squares():
    yield (i**2 for i in range(4))
    yield (i**2 for i in range(4, 8))
    yield (i**2 for i in range(8, 12))

In [11]:
for item in chain(*squares()):
    print(item)

0
1
4
9
16
25
36
49
64
81
100
121


But, unpacking is not lazy!! Here's a simple example that shows this, and why we have to be careful using unpacking if we want to preserve lazy evaluation:

In [12]:
def squares():
    print('yielding 1st item')
    yield (i**2 for i in range(4))
    print('yielding 2nd item')
    yield (i**2 for i in range(4, 8))
    print('yielding 3rd item')
    yield (i**2 for i in range(8, 12))

In [13]:
def read_values(*args):
    print('finised reading args')

In [14]:
read_values(*squares())

yielding 1st item
yielding 2nd item
yielding 3rd item
finised reading args


Instead we can use an alternate "constructor" for chain, that takes a single iterable (of iterables) and lazily iterates through the outer iterable as well:

In [15]:
c = chain.from_iterable(squares())

In [16]:
for item in c:
    print(item)

yielding 1st item
0
1
4
9
yielding 2nd item
16
25
36
49
yielding 3rd item
64
81
100
121


Note also, that we can easily reproduce the same behavior of either chain quite easily:

In [17]:
def chain_(*args):
    for item in args:
        yield from item

In [18]:
def chain_iter(iterable):
    for item in iterable:
        yield from item

And we can use those just as we saw before with `chain` and `chain.from_iterable`:

In [19]:
c = chain_(*squares())

yielding 1st item
yielding 2nd item
yielding 3rd item


In [20]:
c = chain_iter(squares())
for item in c:
    print(item)

yielding 1st item
0
1
4
9
yielding 2nd item
16
25
36
49
yielding 3rd item
64
81
100
121


### "Copying" an Iterator

Sometimes we may have an iterator that we want to use multiple times for some reason.

As we saw, iterators get exhausted, so simply making multiple references to the same iterator will not work - they will just point to the same iterator object.

What we would really like is a way to "copy" an iterator and use these copies independently of each other.

We can use `tee` in `itertools`:

In [21]:
from itertools import tee

In [22]:
def squares(n):
    for i in range(n):
        yield i**2

In [23]:
gen = squares(10)
gen

<generator object squares at 0x0000020AAFE067D8>

In [24]:
iters = tee(squares(10), 3)

In [25]:
iters

(<itertools._tee at 0x20aafe6b608>,
 <itertools._tee at 0x20aafe6bdc8>,
 <itertools._tee at 0x20aafe6b448>)

In [26]:
type(iters)

tuple

As you can see `iters` is a **tuple** contains 3 iterators - let's put them into some variables and see what each one is:

In [27]:
iter1, iter2, iter3 = iters

In [28]:
next(iter1), next(iter1), next(iter1)

(0, 1, 4)

In [29]:
next(iter2), next(iter2)

(0, 1)

In [30]:
next(iter3)

0

As you can see, `iter1`, `iter2`, and `iter3` are essentially three independent "copies" of our original iterator (`squares(10)`)

Note that this works for any iterable, so even sequence types such as lists:

In [31]:
l = [1, 2, 3, 4]

In [32]:
lists = tee(l, 2)

In [33]:
lists[0]

<itertools._tee at 0x20aafe6b688>

In [34]:
lists[1]

<itertools._tee at 0x20aafe6b048>

But you'll notice that the elements of `lists` are not lists themselves!

In [35]:
list(lists[0])

[1, 2, 3, 4]

In [36]:
list(lists[0])

[]

As you can see, the elements returned by `tee` are actually `iterators` - even if we used an iterable such as a list to start off with!

In [37]:
lists[1] is lists[1].__iter__()

True

In [38]:
'__next__' in dir(lists[1])

True

Yep, the elements of `lists` are indeed iterators!

##  Mapping and Reducing

#### *map* and *starmap*

You should already know the `map` and `reduce` built-in functions, so let's quickly review them:

The `map` function applies a given function (that takes a single argument) to an iterable of values and yields (lazily) the result of applying the function to each element of the iterable.

Let's see a simple example that calculates the square of values in an iterable:

In [1]:
maps = map(lambda x: x**2, range(5))

In [2]:
list(maps)

[0, 1, 4, 9, 16]

Keep in mind that `map` returns an iterator, so it will become exhausted:

In [3]:
list(maps)

[]

Of course, we can supply multiple values to a function by using an iterable of iterables (e.g. tuples) and unpacking the tuple in the function - but we still only use a single argument:

In [4]:
def add(t):
    return t[0] + t[1]

In [5]:
list(map(add, [(0,0), [1,1], range(2,4)]))

[0, 2, 5]

Remember how we can unpack an iterable into separate positional arguments?

In [6]:
def add(x, y):
    return x + y

In [7]:
t = (2, 3)
add(*t)

5

It would be nice if we could do that with the `map` function as well.

For example, it would be nice to do the following:

In [8]:
list(map(add, [(0,0), (1,1), (2,2)]))

TypeError: add() missing 1 required positional argument: 'y'

But of course that is not going to work, since `add` expects two arguments, and only a single one (the tuple) was provided.

This is where `starmap` comes in - it will essentially `*` each element of the iterable before passing it to the function defined in the map:

In [10]:
from itertools import starmap

In [11]:
list(starmap(add, [(0,0), (1,1), (2,2)]))

[0, 2, 4]

#### Accumulation

You should already know the `sum` function - it simply calculates the sum of all the elements in an iterable:

In [12]:
sum([10, 20, 30])

60

It simply returns the final sum.

Sometimes we want to perform other operations than just summing up the values. Maybe we want to find the product of all the values in an iterable.

To do so, we would then use the `reduce` function available in the `functools` module. You should already be familiar with that function, but let's review it quickly.

The `reduce` function requires a `binary` function (a function that takes two arguments). It then applies that binary function to the first two elements of the iterable, obtains a result, then continues applying the binary function using the previous result and the next item in the iterable.

Optionally we can specify a seed value that is used as the 'first' element.

For example, to obtain the product of all values in an iterable:

In [13]:
from functools import reduce

In [14]:
reduce(lambda x, y: x*y, [1, 2, 3, 4])

24

We can even specify a "start" value:

In [15]:
reduce(lambda x, y: x*y, [1, 2, 3, 4], 10)

240

You'll note that with both `sum` and `reduce`, only the final result is shown - none of the intermediate results are available.

Sometimes we want to see the intermediate results as well.

Let's see how we might try it with the `sum` function:|

In [16]:
def sum_(iterable):
    it = iter(iterable)
    acc = next(it)
    yield acc
    for item in it:
        acc += item
        yield acc

And we can use it as follows:

In [17]:
for item in sum_([10, 20, 30]):
    print(item)

10
30
60


Of course, this is just going to work for a sum.

We may want the same functionality with arbitrary binary functions, just like `reduce` was more general than `sum`.

We could try doing it ourselves as follows:

In [18]:
def running_reduce(fn, iterable, start=None):
    it = iter(iterable)
    if start is None:
        accumulator = next(it)
    else:
        accumulator = start
    yield accumulator
    
    for item in it:
        accumulator = fn(accumulator, item)
        yield accumulator
    

Let's try a running sum first.

We'll use the `operator` module instead of using lambdas.

In [19]:
import operator

In [20]:
list(running_reduce(operator.add, [10, 20, 30]))

[10, 30, 60]

Now we can also use other binary operators, such as multiplication:

In [21]:
list(running_reduce(operator.mul, [1, 2, 3, 4]))

[1, 2, 6, 24]

And of course, we can even set a "start" value:

In [22]:
list(running_reduce(operator.mul, [1, 2, 3, 4], 10))

[10, 10, 20, 60, 240]

While this certainly works, we really don't need to code this ourselves - that's exactly what the `accumulate` function in `itertools` does for us.

The order of the arguments however is different, The iterable is defined first - that's because the binary function is optional, and defaults to addition if we don't specify it. Also it does not have a "start" value option. If you really need that feature, you could use the technique I just showed you.

In [23]:
from itertools import accumulate

In [24]:
list(accumulate([10, 20, 30]))

[10, 30, 60]

We can find the running product of an iterable:

In [25]:
list(accumulate([1, 2, 3, 4], operator.mul))

[1, 2, 6, 24]

##  Zipping

We've already used the `zip` function quite a bit.

It zips up two iterables and yields tuples containing elements from all iterables in "parallel". It is also lazy, and it will stop once the first iterable is exhausted.

Let's look at a simple example:

In [1]:
l1 = [1, 2, 3, 4, 5]
l2 = [1, 2, 3, 4]
l3 = [1, 2, 3]

In [2]:
list(zip(l1, l2, l3))

[(1, 1, 1), (2, 2, 2), (3, 3, 3)]

As you can see, the shortest iterable we provided to the `zip` function had a length of 3 (so it reached the end of iteration first), and our output therefore only had 3 tuples in it.

Of course, this works with iterators and generators too:

In [3]:
def integers(n):
    for i in range(n):
        yield i
        
def squares(n):
    for i in range(n):
        yield i**2
        
def cubes(n):
    for i in range(n):
        yield i**3

In [4]:
iter1 = integers(6)
iter2 = squares(5)
iter3 = cubes(4)

In [5]:
list(zip(iter1, iter2, iter3))

[(0, 0, 0), (1, 1, 1), (2, 4, 8), (3, 9, 27)]

Sometimes we want to zip up iterables but completely iterate all the iterables, and not stop at the shortest. Of course, the problem is what to do with iterables that have been fully iterated before the longest one has?

Simple, we just need to provide a default "filler" value.

And that's how the `zip_longest` function from `itertools` works:

In [6]:
from itertools import zip_longest

In [7]:
help(zip_longest)

Help on class zip_longest in module itertools:

class zip_longest(builtins.object)
 |  zip_longest(iter1 [,iter2 [...]], [fillvalue=None]) --> zip_longest object
 |  
 |  Return a zip_longest object whose .__next__() method returns a tuple where
 |  the i-th element comes from the i-th iterable argument.  The .__next__()
 |  method continues until the longest iterable in the argument sequence
 |  is exhausted and then it raises StopIteration.  When the shorter iterables
 |  are exhausted, the fillvalue is substituted in their place.  The fillvalue
 |  defaults to None or can be specified by a keyword argument.
 |  
 |  Methods defined here:
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.
 |  
 |  __next__(self, /)
 |      Implement next(self).
 |  
 |  __reduce__(...)
 |

As you can see, we can only specify a single default value, this means that default will be used for any provided iterable once it has been fully iterated.

As expected, `zip_longest` yields its values - it is lazy.

Let's see an example:

In [8]:
l1 = [1, 2, 3, 4, 5]
l2 = [1, 2, 3, 4]
l3 = [1, 2, 3]

In [9]:
list(zip_longest(l1, l2, l3, fillvalue='N/A'))

[(1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 'N/A'), (5, 'N/A', 'N/A')]

Of course, since this zips over the longest iterable, beware of using an infinite iterable!

You don't have to worry about this with the normal `zip` function as long as at least one of the iterables is finite:

In [10]:
def squares():
    i = 0
    while True:
        yield i ** 2
        i += 1

def cubes():
    i = 0
    while True:
        yield i ** 3
        i += 1

Obviously `squares` produces an inifinite iterator. But we can still zip it with a finite iterable:

In [11]:
iter1 = squares()
iter2 = cubes()
list(zip(range(10), iter1, iter2))

[(0, 0, 0),
 (1, 1, 1),
 (2, 4, 8),
 (3, 9, 27),
 (4, 16, 64),
 (5, 25, 125),
 (6, 36, 216),
 (7, 49, 343),
 (8, 64, 512),
 (9, 81, 729)]

Don't try the same thing with `zip_longest`!

##  Combinatorics

There are a number of functions in `itertools` that are concerned with thing like permutations and combinations.

Let's look at each one briefly - I am not going to go into much depth as to what permutations and combinations are though - this is not meant to be a statistics course :-)

In [1]:
import itertools

#### Cartesian Product

The cartesian product is actually a lot more useful than it might appear at first.

Consider this example, where we want to create a multiplication table as we have seen before:

In [2]:
def matrix(n):
    for i in range(1, n+1):
        for j in range(1, n+1):
            yield f'{i} x {j} = {i*j}'

We can look at a few elements using `islice`:

In [3]:
list(itertools.islice(matrix(10), 10, 20))

['2 x 1 = 2',
 '2 x 2 = 4',
 '2 x 3 = 6',
 '2 x 4 = 8',
 '2 x 5 = 10',
 '2 x 6 = 12',
 '2 x 7 = 14',
 '2 x 8 = 16',
 '2 x 9 = 18',
 '2 x 10 = 20']

Notice that we iterated through the same sets (the numbers from 1 to 10) in a nested fashion.

If we think of those two sets as 
$$
s1 = \{1, 2, 3, ..., 10\}
$$
$$
s2 = \{1, 2, 3, ..., 10\}
$$
then the Cartesian product of the two sets is:
$$
s_1 \times s_2 = \{(x_1, x_2) \, \vert \, x_1 \in s_1 \, \textrm{and} \, x_2 \in s_2\}
$$

Another way to think of it is by creating a table (just like our multiplication table!):

```
        y1        y2        y3
x1  (x1, y1)  (x1, y2)  (x1, y3)

x2  (x2, y1)  (x2, y2)  (x2, y3)

x3  (x3, y1)  (x3, y2)  (x3, y3)

x4  (x4, y1)  (x4, y2)  (x4, y3)
```

Our multiplication table was just the product of $x_i$ and $y_i$:

```
       y1       y2       y3      y4
x1  x1 * y1  x1 * y2  x1 * y3  x1 * y4

x2  x2 * y1  x2 * y2  x2 * y3  x2 * y4  

x3  x3 * y1  x3 * y2  x3 * y3  x3 * y4  

x4  x4 * y1  x4 * y2  x4 * y3  x4 * y4  
```

So, the Cartesian product of two iterables in general can be generated using a nested loop:

In [4]:
l1 = ['x1', 'x2', 'x3', 'x4']
l2 = ['y1', 'y2', 'y3']
for x in l1:
    for y in l2:
        print((x, y), end=' ')
    print('')

('x1', 'y1') ('x1', 'y2') ('x1', 'y3') 
('x2', 'y1') ('x2', 'y2') ('x2', 'y3') 
('x3', 'y1') ('x3', 'y2') ('x3', 'y3') 
('x4', 'y1') ('x4', 'y2') ('x4', 'y3') 


We can achieve the same result with the `product` function in `itertools`. As usual, it is lazy as well.

In [5]:
l1 = ['x1', 'x2', 'x3', 'x4']
l2 = ['y1', 'y2', 'y3']
list(itertools.product(l1, l2))

[('x1', 'y1'),
 ('x1', 'y2'),
 ('x1', 'y3'),
 ('x2', 'y1'),
 ('x2', 'y2'),
 ('x2', 'y3'),
 ('x3', 'y1'),
 ('x3', 'y2'),
 ('x3', 'y3'),
 ('x4', 'y1'),
 ('x4', 'y2'),
 ('x4', 'y3')]

As a simple example, let's go back to the multiplication table we created using a generator function.

In [6]:
def matrix(n):
    for i in range(1, n+1):
        for j in range(1, n+1):
            yield (i, j, i*j)

In [7]:
list(matrix(4))

[(1, 1, 1),
 (1, 2, 2),
 (1, 3, 3),
 (1, 4, 4),
 (2, 1, 2),
 (2, 2, 4),
 (2, 3, 6),
 (2, 4, 8),
 (3, 1, 3),
 (3, 2, 6),
 (3, 3, 9),
 (3, 4, 12),
 (4, 1, 4),
 (4, 2, 8),
 (4, 3, 12),
 (4, 4, 16)]

In [8]:
def matrix(n):
    for i, j in itertools.product(range(1, n+1), range(1, n+1)):
        yield (i, j, i*j)

In [9]:
list(matrix(4))

[(1, 1, 1),
 (1, 2, 2),
 (1, 3, 3),
 (1, 4, 4),
 (2, 1, 2),
 (2, 2, 4),
 (2, 3, 6),
 (2, 4, 8),
 (3, 1, 3),
 (3, 2, 6),
 (3, 3, 9),
 (3, 4, 12),
 (4, 1, 4),
 (4, 2, 8),
 (4, 3, 12),
 (4, 4, 16)]

And of course this is now simple enough to even use just a generator expression:

In [10]:
def matrix(n):
    return ((i, j, i*j) 
            for i, j in itertools.product(range(1, n+1), range(1, n+1)))

In [11]:
list(matrix(4))

[(1, 1, 1),
 (1, 2, 2),
 (1, 3, 3),
 (1, 4, 4),
 (2, 1, 2),
 (2, 2, 4),
 (2, 3, 6),
 (2, 4, 8),
 (3, 1, 3),
 (3, 2, 6),
 (3, 3, 9),
 (3, 4, 12),
 (4, 1, 4),
 (4, 2, 8),
 (4, 3, 12),
 (4, 4, 16)]

You'll notice how we repeated the `range(1, n+1)` twice?

This is a great example of where `tee` can be useful:

In [12]:
from itertools import tee

def matrix(n):
    return ((i, j, i*j) 
            for i, j in itertools.product(*itertools.tee(range(1, n+1), 2)))

In [13]:
list(matrix(4))

[(1, 1, 1),
 (1, 2, 2),
 (1, 3, 3),
 (1, 4, 4),
 (2, 1, 2),
 (2, 2, 4),
 (2, 3, 6),
 (2, 4, 8),
 (3, 1, 3),
 (3, 2, 6),
 (3, 3, 9),
 (3, 4, 12),
 (4, 1, 4),
 (4, 2, 8),
 (4, 3, 12),
 (4, 4, 16)]

#### Example 1

A common usage of Cartesian products might be to generate a grid of coordinates.

For a 2D space for example, we might want to generate a grid of points ranging from -5 to 5 in both the x and y axes, with a step of 0.5.

We can't use a range since ranges need integral numbers, but we have the `count` function in itertools we have seen before:

In [14]:
def grid(min_val, max_val, step, *, num_dimensions=2):
    axis = itertools.takewhile(lambda x: x <= max_val,
                               itertools.count(min_val, step))
    
    # to handle multiple dimensions, we just need to repeat the axis that
    # many times - tee is perfect for that
    axes = itertools.tee(axis, num_dimensions)

    # and now we just need the product of all these iterables
    return itertools.product(*axes)

In [15]:
list(grid(-1, 1, 0.5))

[(-1, -1),
 (-1, -0.5),
 (-1, 0.0),
 (-1, 0.5),
 (-1, 1.0),
 (-0.5, -1),
 (-0.5, -0.5),
 (-0.5, 0.0),
 (-0.5, 0.5),
 (-0.5, 1.0),
 (0.0, -1),
 (0.0, -0.5),
 (0.0, 0.0),
 (0.0, 0.5),
 (0.0, 1.0),
 (0.5, -1),
 (0.5, -0.5),
 (0.5, 0.0),
 (0.5, 0.5),
 (0.5, 1.0),
 (1.0, -1),
 (1.0, -0.5),
 (1.0, 0.0),
 (1.0, 0.5),
 (1.0, 1.0)]

And of course we can now do it in 3D as well:

In [16]:
list(grid(-1, 1, 0.5, num_dimensions=3))

[(-1, -1, -1),
 (-1, -1, -0.5),
 (-1, -1, 0.0),
 (-1, -1, 0.5),
 (-1, -1, 1.0),
 (-1, -0.5, -1),
 (-1, -0.5, -0.5),
 (-1, -0.5, 0.0),
 (-1, -0.5, 0.5),
 (-1, -0.5, 1.0),
 (-1, 0.0, -1),
 (-1, 0.0, -0.5),
 (-1, 0.0, 0.0),
 (-1, 0.0, 0.5),
 (-1, 0.0, 1.0),
 (-1, 0.5, -1),
 (-1, 0.5, -0.5),
 (-1, 0.5, 0.0),
 (-1, 0.5, 0.5),
 (-1, 0.5, 1.0),
 (-1, 1.0, -1),
 (-1, 1.0, -0.5),
 (-1, 1.0, 0.0),
 (-1, 1.0, 0.5),
 (-1, 1.0, 1.0),
 (-0.5, -1, -1),
 (-0.5, -1, -0.5),
 (-0.5, -1, 0.0),
 (-0.5, -1, 0.5),
 (-0.5, -1, 1.0),
 (-0.5, -0.5, -1),
 (-0.5, -0.5, -0.5),
 (-0.5, -0.5, 0.0),
 (-0.5, -0.5, 0.5),
 (-0.5, -0.5, 1.0),
 (-0.5, 0.0, -1),
 (-0.5, 0.0, -0.5),
 (-0.5, 0.0, 0.0),
 (-0.5, 0.0, 0.5),
 (-0.5, 0.0, 1.0),
 (-0.5, 0.5, -1),
 (-0.5, 0.5, -0.5),
 (-0.5, 0.5, 0.0),
 (-0.5, 0.5, 0.5),
 (-0.5, 0.5, 1.0),
 (-0.5, 1.0, -1),
 (-0.5, 1.0, -0.5),
 (-0.5, 1.0, 0.0),
 (-0.5, 1.0, 0.5),
 (-0.5, 1.0, 1.0),
 (0.0, -1, -1),
 (0.0, -1, -0.5),
 (0.0, -1, 0.0),
 (0.0, -1, 0.5),
 (0.0, -1, 1.0),
 (0.0, -0.5, -1

#### Example 2

Another simple application of this might be to determine the odds of rolling an 8 with a pair of dice (with values 1 - 6).

We can brute force this by generating all the possible results (the sample space), and counting how may items add up to 8.

Let's break it up into a few steps:

In [17]:
sample_space = list(itertools.product(range(1, 7), range(1, 7)))
print(sample_space)

[(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6), (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6), (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6), (6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)]


Now we want to filter out the tuples whose elements add up to 8:

In [18]:
outcomes = list(filter(lambda x: x[0] + x[1] == 8, sample_space))
print(outcomes)

[(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)]


And we can calculate the odds by dividing the number acceptable outcomes by the size of the sample space. I'll actually use a `Fraction` so we retain our result as a rational number:

In [19]:
from fractions import Fraction
odds = Fraction(len(outcomes), len(sample_space))
print(odds)

5/36


#### Permutations

From Wikipedia: 


> In mathematics, the notion of permutation relates to the act of arranging all the members of a set into some sequence or order, or if the set is already ordered, rearranging (reordering) its elements, a process called permuting. These differ from combinations, which are selections of some members of a set where order is disregarded.


https://en.wikipedia.org/wiki/Permutation

We can create permutations of length n from an iterable of length m (n <= m) using the `permutation` function:

In [20]:
l1 = 'abc'
list(itertools.permutations(l1))

[('a', 'b', 'c'),
 ('a', 'c', 'b'),
 ('b', 'a', 'c'),
 ('b', 'c', 'a'),
 ('c', 'a', 'b'),
 ('c', 'b', 'a')]

As you can see all the permutations are, by default, the same length as the original iterable.

We can create permutations of smaller length by specifying the `r` value:

In [21]:
list(itertools.permutations(l1, 2))

[('a', 'b'), ('a', 'c'), ('b', 'a'), ('b', 'c'), ('c', 'a'), ('c', 'b')]

The important thing to note is that elements are not 'repeated' in the permutation. The uniqueness of an element is **not** based on its value, but rather on its **position** in the original iterable.

Take this example:

In [22]:
list(itertools.permutations('aaa'))

[('a', 'a', 'a'),
 ('a', 'a', 'a'),
 ('a', 'a', 'a'),
 ('a', 'a', 'a'),
 ('a', 'a', 'a'),
 ('a', 'a', 'a')]

This means that the following will yield what looks like the same permutations when considering the **values** of the iterable:

In [23]:
list(itertools.permutations('aba', 2))

[('a', 'b'), ('a', 'a'), ('b', 'a'), ('b', 'a'), ('a', 'a'), ('a', 'b')]

As you can see, each tuple looks like it has been repeated twice - but considering the elements are unique based on their position, this is actually quite correct.

#### Combinations

From Wikipedia:
>Combinations refer to the combination of n things taken k at a time without repetition. To refer to combinations in which repetition is allowed, the terms k-selection,[1] k-multiset,[2] or k-combination with repetition are often used.

https://en.wikipedia.org/wiki/Combination

`itertools` offers both flavors - with and without replacement.

Let's look at a simple example with replacement first:

In [24]:
list(itertools.combinations([1, 2, 3, 4], 2))

[(1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)]

As you can see `(4, 3)` is not included in the result since, as a combination, it is the same as `(3, 4)` - order is not important.

If we want replacement:

In [25]:
list(itertools.combinations_with_replacement([1, 2, 3, 4], 2))

[(1, 1),
 (1, 2),
 (1, 3),
 (1, 4),
 (2, 2),
 (2, 3),
 (2, 4),
 (3, 3),
 (3, 4),
 (4, 4)]

#### Example 3

A simple application of this might be to calculate the odds of pulling four consecutive aces from a deck of 52 cards.

That's very easy to figure out, but we could use a brute force approach by calculating all the 4-combinations (without repetition) from a deck of 52 cards.

Let's try it:

First we need a deck:

In [26]:
SUITS = 'SHDC'
RANKS = tuple(map(str, range(2, 11))) + tuple('JQKA')

In [27]:
RANKS

('2', '3', '4', '5', '6', '7', '8', '9', '10', 'J', 'Q', 'K', 'A')

I wanted all the elements in my `RANKS` to be strings - just to have a consistent data type, and to show you how handy `map` can be!

Next I need to create the deck:

In [28]:
deck = [rank + suit for suit in SUITS for rank in RANKS]

In [29]:
deck[0:5]

['2S', '3S', '4S', '5S', '6S']

Hmm... A nested loop. Maybe `product` would work well here!

In [30]:
deck = [rank + suit for suit, rank in itertools.product(SUITS, RANKS)]

I would much prefer having a named tuple for the deck, so let's do that as well:

In [31]:
from collections import namedtuple
Card = namedtuple('Card', 'rank suit')

In [32]:
deck = [Card(rank, suit) for suit, rank in itertools.product(SUITS, RANKS)]

And I really don't need it as a list - a generator expression will do just as well...

In [33]:
deck = (Card(rank, suit) for suit, rank in itertools.product(SUITS, RANKS))

Next we need to produce our sample space - all combinations of 4 cards from the deck, without repetition:

In [34]:
sample_space = itertools.combinations(deck, 4)

Next we need to count the number of acceptable outcomes - but we also need to count the size of our sample space.
We can't use `len()` though - iterables in general don't support that method. 
I could create the sample space twice, but that seems wasteful - so instead I'm going to iterate through the sample space once and just keep track of both counts:

In [35]:
deck = (Card(rank, suit) for suit, rank in itertools.product(SUITS, RANKS))
sample_space = itertools.combinations(deck, 4)
total = 0
acceptable = 0
for outcome in sample_space:
    total += 1
    for card in outcome:
        if card.rank != 'A':
            break
    else:
        # else block is executed if loop terminated without a break
        acceptable += 1
print(f'total={total}, acceptable={acceptable}')
print('odds={}'.format(Fraction(acceptable, total)))
print('odds={:.10f}'.format(acceptable/total))

total=270725, acceptable=1
odds=1/270725
odds=0.0000036938


We can easily verify that this is correct:

Odds of succesively picking four aces from a shuffled deck is:

$$
\frac{4}{52} \times \frac{3}{51} \times \frac{2}{50} \times \frac{1}{49}
= \frac{24}{6497400} = \frac{1}{270725}
$$

I also want to point out that we could use the `all` function instead of that inner `for` loop and the `else` block.

Remember that `all(iterable)` will evaluate to True if all the elements of the iterable are truthy.
Now in our case, since ranks are non-empty strings, they will always be truthy, so we can't use `all` directly:

In [36]:
all(['A', 'A', '10', 'J'])

True

Instead we can use the `map` function, yet again!, to test if the value is an 'A' or not:

In [37]:
l1 = ['K', 'A', 'A', 'A']
l2 = ['A', 'A', 'A', 'A']

print(list(map(lambda x: x == 'A', l1)))
print(list(map(lambda x: x == 'A', l2)))

[False, True, True, True]
[True, True, True, True]


So now we can use `all` (and we don't have to create a list):

In [38]:
print(all(map(lambda x: x == 'A', l1)))
print(all(map(lambda x: x == 'A', l2)))

False
True


So, we could rewrite our algorithm as follows:

In [39]:
deck = (Card(rank, suit) for suit, rank in itertools.product(SUITS, RANKS))
sample_space = itertools.combinations(deck, 4)
total = 0
acceptable = 0
for outcome in sample_space:
    total += 1
    if all(map(lambda x: x.rank == 'A', outcome)):
        acceptable += 1

print(f'total={total}, acceptable={acceptable}')
print('odds={}'.format(Fraction(acceptable, total)))
print('odds={:.10f}'.format(acceptable/total))

total=270725, acceptable=1
odds=1/270725
odds=0.0000036938


##  Aggregators

We have already used many built-in aggregators.

In [1]:
def squares(n):
    for i in range(n):
        yield i**2

In [2]:
list(squares(5))

[0, 1, 4, 9, 16]

We can find the `min` and `max` of elements in an iterable:

In [3]:
min(squares(5))

0

In [4]:
max(squares(5))

16

Be careful, all these aggregation functions will **exhaust** any iterator being used.

In [5]:
sq = squares(5)

In [6]:
max(sq)

16

In [7]:
min(sq)

ValueError: min() arg is an empty sequence

We also have `sum`:

In [8]:
list(squares(5))

[0, 1, 4, 9, 16]

In [9]:
sum(squares(5))

30

#### The `any` function

The `any` function is a predicate (a function that returns `True` or `False`) that takes an iterable and returns `True` if all elements of that iterable are True (or have an associated True truth-value, i.e. **truthy**).

Remember that by default custom objects are always truthy:

In [10]:
class Person:
    pass

In [11]:
p = Person()

In [12]:
bool(p)

True

For numbers, anything other than `0` is truthy, and strings, lists, tuples, dictionaries, etc are falsy if they are empty.

In fact, any empty sequence type (i.e. length = 0) is falsy, including custom sequence types:

In [13]:
class MySeq:
    def __init__(self, n):
        self.n = n
        
    def __len__(self):
        return self.n
    
    def __getitem__(self, s):
        pass

In [14]:
my_seq = MySeq(0)

In [15]:
bool(my_seq)

False

In [16]:
my_seq = MySeq(10)

In [17]:
bool(my_seq)

True

The `any` function can be used to quickly test if any element is **truthy**:

In [18]:
any([0, '', None])

False

In [19]:
any([0, '', None, 'hello'])

True

Basically, the `any` function is like doing an `or` between all the elements of the iterable, and casting the result to a Boolean:

In [20]:
result = 0 or '' or None or 'hello'
result, bool(result)

('hello', True)

#### The `all` Function

The `all` function is very similar to the `any` function, but it determines if **all** the elements of the iterable are truthy.

Basically it is equivalent to doing an `and` between all the elements of the iterable and casting the result to a Boolean.

In [21]:
all([1, 'abc', [1, 2], range(5)])

True

In [22]:
all([1, 'abc', [1, 2], range(5), ''])

False

#### In Practice

In practice, we often need to test if all elements of an iterable satisfy some criteria, not necessarily whether the elements are truthy or falsy.

But we can easily apply a predicate to an iterable to first evaluate the conditions we want, and then feed that into the `any` or `all` functions.

This is where the `map` function is extremely useful! Alternatively, we can also use generator expressions.

Let's see a few examples.

##### Example 1

Suppose we want to test if an iterable contains only numeric values.

First, we need to figure out how we determine if something is a number.

This is actually a very common question on the web, with all kinds of weird and wonderful solutions - most of which actually work (for the most part).

But the simplest is to test if the object we are looking at is an instance of the `Number` class!

In [23]:
from numbers import Number

In [24]:
isinstance(10, Number), isinstance(10.5, Number)

(True, True)

In [25]:
isinstance(2+3j, Number)

True

In [26]:
from decimal import Decimal

In [27]:
isinstance(Decimal('10.3'), Number)

True

In [28]:
isinstance(True, Number)

True

On the other hand:

In [29]:
isinstance('100', Number)

False

In [30]:
isinstance([10, 20], Number)

False

Now suppose we have a list (or iterable in general) and we want to see if they are all numbers:

We could proceed with a rather clunky approach this way:

In [31]:
l = [10, 20, 30, 40]

is_all_numbers = True
for item in l:
    if not isinstance(item, Number):
        is_all_numbers = False
        break
print(is_all_numbers)

True


In [32]:
l = [10, 20, 30, 40, 'hello']

is_all_numbers = True
for item in l:
    if not isinstance(item, Number):
        is_all_numbers = False
        break
print(is_all_numbers)

False


Now we can actually simplify this a little, by using the `else` clause of the `for`loop - remember that the `else` clause of a `for` loop will execute if the loop terminated normally (i.e. did not `break` out of the loop).

In [33]:
l = [10, 20, 30, 40, 'hello']
is_all_numbers = False
for item in l:
    if not isinstance(item, Number):
        break
else: # nobreak --> all numbers
    is_all_numbers = True
print(is_all_numbers)

False


Still this is clunky - there has to be a better way!

Yes, of course - the `all` function.

But we can't use it directly on the items - we're not interested in whether they are all truthy or not, we are interested in whether they are all numbers or not.

To achieve this we need to transform each element of the list using a predicate that will return `True` if the element is a number and `False` otherwise.

We can use the `map` function to apply a function (with a single parameter) to all the elements of an iterable:

In [34]:
map(str, [0, 1, 2, 3, 4])

<map at 0x1740e6fbf28>

Now `map` is lazy, so let's put it into a list to see what it contains:

In [35]:
list(map(str, [0, 1, 2, 3, 4]))

['0', '1', '2', '3', '4']

The function we actually want to use is the `isinstance` function - but that requires **two** parameters - the element we are testing, and the `type` we are testing for.

Somehow we need to create a form of `isinstance` that only requires a single variable and simply holds the type (`Number`) fixed.

We can do this very simply using a function or a lambda.

In [36]:
def is_number(x):
    return isinstance(x, Number)

or, simply a lambda:

In [37]:
lambda x: isinstance(x, Number)

<function __main__.<lambda>>

So now, let's map that function to our iterable:

In [38]:
l

[10, 20, 30, 40, 'hello']

In [39]:
list(map(lambda x: isinstance(x, Number), l))

[True, True, True, True, False]

And of course, **now** we can use the `all` function to determine if all the elements are numbers or not:

In [40]:
l = [10, 20, 30, 40, 'hello']
all(map(lambda x: isinstance(x, Number), l))

False

In [41]:
l = [10, 20, 30, 40]
all(map(lambda x: isinstance(x, Number), l))

True

A lot less typing than the first approach we did!

If you don't like using `map` for some reason, we can easily use a generator expression as well:

In [42]:
l = [10, 20, 30, 40]
all(isinstance(x, Number) for x in l)

True

In [43]:
l = [10, 20, 30, 40, 'hello']
all(isinstance(x, Number) for x in l)

False

Both approaches work equally well - use whichever one you are most comfortable with - but do try to use both and once you are comfortable with both approaches, then choose!

##### Example 2

Let's look at another simple example.

Suppose we have a file and we want to make sure that all the rows in the file have length > some number.

Let's just see what data we have in our sample data file:

In [44]:
with open('car-brands.txt') as f:
    for row in f:
        print(len(row), row, end='')

11 Alfa Romeo
13 Aston Martin
5 Audi
8 Bentley
5 Benz
4 BMW
8 Bugatti
9 Cadillac
10 Chevrolet
9 Chrysler
8 Citroën
9 Corvette
4 DAF
6 Dacia
7 Daewoo
9 Daihatsu
7 Datsun
10 De Lorean
5 Dino
5 Dodge

We can easily test to make sure that every brand in our file is at least 3 characters long:

In [45]:
with open('car-brands.txt') as f:
    result = all(map(lambda row: len(row) >= 3, f))
print(result)

True


And we can test to see if any line is more than 10 characters:

In [46]:
with open('car-brands.txt') as f:
    result = any(map(lambda row: len(row) > 10, f))
print(result)

True


More than 13?

In [47]:
with open('car-brands.txt') as f:
    result = any(map(lambda row: len(row) > 13, f))
print(result)

False


Of course, we can also do this using generator expressions instead of `map`:

In [48]:
with open('car-brands.txt') as f:
    result = any(len(row) > 13 for row in f)
print(result)

False


##  Grouping

If your familiar with SQL and the `group by` clause, then this will be familiar to you (with the exception that in SQL the order in which rows are selected does not affect the group by - i.e. we have an automatic implicit sort on the group by key - not so here)

If you're not familiar with the `group by` in SQL, let's consider an example to understand what's going on:

Let's look at the file `cars_2014.csv`:

In [1]:
import itertools

with open('cars_2014.csv') as f:
    for row in itertools.islice(f, 0, 20):
        print(row, end = '')

make,model
ACURA,ILX
ACURA,MDX
ACURA,RDX
ACURA,RLX
ACURA,TL
ACURA,TSX
ALFA ROMEO,4C
ALFA ROMEO,GIULIETTA
APRILIA,CAPONORD 1200
APRILIA,RSV4 FACTORY APRC ABS
APRILIA,RSV4 R APRC ABS
APRILIA,SHIVER 750
ARCTIC CAT,1000 XT
ARCTIC CAT,500 XT
ARCTIC CAT,550 XT
ARCTIC CAT,700 LTD
ARCTIC CAT,700 SUPER DUTY DIESEL
ARCTIC CAT,700 XT
ARCTIC CAT,90 2X4 4-STROKE


This file contains car make and model ordered by make (so all the same makes are together in the file already) and then model.

We may want to know how many models exist for each make.

This is what a group by is used for: we need to make groups of makes, then count the number of items in each group.

Trivial to do with SQL, but a little more work with Python.

We might try doing it this way:

In [2]:
from collections import defaultdict

makes = defaultdict(int)

with open('cars_2014.csv') as f:
    next(f)  # skip header row
    for row in f:
        make, _ = row.strip('\n').split(',')
        makes[make] += 1
        
for key, value in makes.items():
    print(f'{key}: {value}')

ACURA: 6
ALFA ROMEO: 2
APRILIA: 4
ARCTIC CAT: 96
ARGO: 4
ASTON MARTIN: 5
AUDI: 27
BENTLEY: 2
BLUE BIRD: 1
BMW: 86
BUGATTI: 1
BUICK: 5
CADILLAC: 7
CAN-AM: 61
CHEVROLET: 33
CHRYSLER: 2
DODGE: 7
DUCATI: 4
FERRARI: 6
FIAT: 2
FORD: 34
FREIGHTLINER: 7
GMC: 12
HARLEY DAVIDSON: 29
HINO: 7
HONDA: 91
HUSABERG: 4
HUSQVARNA: 9
HYUNDAI: 13
INDIAN: 3
INFINITI: 8
JAGUAR: 9
JEEP: 5
JOHN DEERE: 19
KAWASAKI: 59
KENWORTH: 11
KIA: 10
KTM: 13
KUBOTA: 4
KYMCO: 28
LAMBORGHINI: 2
LAND ROVER: 6
LEXUS: 14
LINCOLN: 6
LOTUS: 1
MACK: 9
MASERATI: 3
MAZDA: 5
MCLAREN: 2
MERCEDES-BENZ: 60
MINI: 3
MITSUBISHI: 8
NISSAN: 24
PEUGEOT: 3
POLARIS: 101
PORSCHE: 4
RAM: 6
RENAULT: 4
ROLLS ROYCE: 3
SCION: 5
SEAT: 3
SKI-DOO: 67
SMART: 1
SRT: 1
SUBARU: 10
SUZUKI: 48
TESLA: 2
TOYOTA: 19
TRIUMPH: 10
VESPA: 4
VICTORY: 14
VOLKSWAGEN: 16
VOLVO: 8
YAMAHA: 110


Instead of doing all this, we could use the `groupby` function in `itertools`.

Again, it is a lazy iterator, so we'll use lists to see what's happening - but let's use a slightly smaller data set as an example first:

In [3]:
data = (1, 1, 2, 2, 3)

In [4]:
list(itertools.groupby(data))

[(1, <itertools._grouper at 0x204a6988dd8>),
 (2, <itertools._grouper at 0x204a69883c8>),
 (3, <itertools._grouper at 0x204a6988208>)]

As you can see, we ended up with an iterable of tuples. The tuple was the groups of numbers in data, so `1`, `2`, and `3`. But what's in the second element of the tuple? Well it's an iterator, but what does it contain?

In [5]:
it = itertools.groupby(data)
for group in it:
    print(group[0], list(group[1]))

1 [1, 1]
2 [2, 2]
3 [3]


Basically it just contained the grouped elements themselves.

This might seem a bit confusing at first - so let's look at the second optional argument of group by - it is a key. Basically the idea behind that key is the same as the sort keys, or filter keys we have worked with in the past. It is a **function** that returns a grouping key.

Let's try it out with a simple example:

In [6]:
data = (
    (1, 'abc'),
    (1, 'bcd'),
   
    (2, 'pyt'),
    (2, 'yth'),
    (2, 'tho'),
    
    (3, 'hon')
)

So we want to group the data, using the first item of each tuple as the group key:

In [7]:
groups = list(itertools.groupby(data, key=lambda x: x[0]))

In [8]:
print(groups)

[(1, <itertools._grouper object at 0x00000204A6990C50>), (2, <itertools._grouper object at 0x00000204A6990BE0>), (3, <itertools._grouper object at 0x00000204A6990BA8>)]


Once again you'll notice that we have the group keys, and some iterable. Let's see what those contain:

In [9]:
groups = itertools.groupby(data, key=lambda x: x[0])
for group in groups:
    print(group[0], list(group[1]))

1 [(1, 'abc'), (1, 'bcd')]
2 [(2, 'pyt'), (2, 'yth'), (2, 'tho')]
3 [(3, 'hon')]


So now let's go back to our car make example.

We want to get all the makes and how many models are in each make.

We could start approaching it this way:

In [10]:
with open('cars_2014.csv') as f:
    make_groups = itertools.groupby(f, key=lambda x: x.split(',')[0])

In [11]:
list(itertools.islice(make_groups, 5))

ValueError: I/O operation on closed file.

What's going on?

Remember that `groupby` is a **lazy** iterator. This means it did not actually do any work when we called it apart from setting up the iterator.

When we called `list()` on that iterator, **then** it went ahead and try to do the iteration.

However, our `with` (context manager) closed the file by then!

So we will need to do our work inside the context manager.

In [12]:
with open('cars_2014.csv') as f:
    next(f)  # skip header row
    make_groups = itertools.groupby(f, key=lambda x: x.split(',')[0])
    print(list(itertools.islice(make_groups, 5)))

[('ACURA', <itertools._grouper object at 0x00000204A69974A8>), ('ALFA ROMEO', <itertools._grouper object at 0x00000204A6997438>), ('APRILIA', <itertools._grouper object at 0x00000204A65C01D0>), ('ARCTIC CAT', <itertools._grouper object at 0x00000204A6990198>), ('ARGO', <itertools._grouper object at 0x00000204A69885F8>)]


Next, we need to know how many items are in each `itertools._grouper` iterators.

How about using the `len()` property of the iterator?

In [13]:
with open('cars_2014.csv') as f:
    next(f)  # skip header row
    make_groups = itertools.groupby(f, key=lambda x: x.split(',')[0])
    make_counts = ((key, len(models)) for key, models in make_groups)
    print(list(make_counts))

TypeError: object of type 'itertools._grouper' has no len()

Aww... Iterators don't necessarily implement a `__len__` method - and this one definitely does not.

Well, if we think about this, we could simply "replace" each element in 
the models, with a `1`, and sum that up...

In [14]:
with open('cars_2014.csv') as f:
    next(f)  # skip header row
    make_groups = itertools.groupby(f, key=lambda x: x.split(',')[0])
    make_counts = ((key, sum(1 for model in models)) 
                    for key, models in make_groups)
    print(list(make_counts))

[('ACURA', 6), ('ALFA ROMEO', 2), ('APRILIA', 4), ('ARCTIC CAT', 96), ('ARGO', 4), ('ASTON MARTIN', 5), ('AUDI', 27), ('BENTLEY', 2), ('BLUE BIRD', 1), ('BMW', 86), ('BUGATTI', 1), ('BUICK', 5), ('CADILLAC', 7), ('CAN-AM', 61), ('CHEVROLET', 33), ('CHRYSLER', 2), ('DODGE', 7), ('DUCATI', 4), ('FERRARI', 6), ('FIAT', 2), ('FORD', 34), ('FREIGHTLINER', 7), ('GMC', 12), ('HARLEY DAVIDSON', 29), ('HINO', 7), ('HONDA', 91), ('HUSABERG', 4), ('HUSQVARNA', 9), ('HYUNDAI', 13), ('INDIAN', 3), ('INFINITI', 8), ('JAGUAR', 9), ('JEEP', 5), ('JOHN DEERE', 19), ('KAWASAKI', 59), ('KENWORTH', 11), ('KIA', 10), ('KTM', 13), ('KUBOTA', 4), ('KYMCO', 28), ('LAMBORGHINI', 2), ('LAND ROVER', 6), ('LEXUS', 14), ('LINCOLN', 6), ('LOTUS', 1), ('MACK', 9), ('MASERATI', 3), ('MAZDA', 5), ('MCLAREN', 2), ('MERCEDES-BENZ', 60), ('MINI', 3), ('MITSUBISHI', 8), ('NISSAN', 24), ('PEUGEOT', 3), ('POLARIS', 101), ('PORSCHE', 4), ('RAM', 6), ('RENAULT', 4), ('ROLLS ROYCE', 3), ('SCION', 5), ('SEAT', 3), ('SKI-DOO', 6

#### Caveat

I want to show you something that you may find odd at first. Notice how I iterated through the groups.

Maybe I want to be able to itrerate multiple times through that iterator, so let's make a list out of it first:

In [15]:
groups = list(itertools.groupby(data, key=lambda x: x[0]))
for group in groups:
    print(group[0], group[1])

1 <itertools._grouper object at 0x00000204A6A33080>
2 <itertools._grouper object at 0x00000204A6A330B8>
3 <itertools._grouper object at 0x00000204A6A33128>


Ok, so this looks fine - we now have a list containing tuples - the first element is the group key, the second is an iterator - we can ceck that easily:

In [16]:
it = groups[0][1]

In [17]:
iter(it) is it

True

So yes, this is an iterator - what's in it?

In [18]:
list(it)

[]

Empty?? But we did not iterate through it - what happened?

Let's try again, just in case calling the `iter` method did something odd:

In [19]:
groups = list(itertools.groupby(data, key=lambda x: x[0]))
for group in groups:
    print(group[0], list(group[1]))

1 []
2 []
3 [(3, 'hon')]


So, the 3rd element is OK, but looks like the first two got exhausted somehow...

Let's make sure they are indeed exhausted:

In [20]:
groups = list(itertools.groupby(data, key=lambda x: x[0]))

In [21]:
next(groups[0][1])

StopIteration: 

In [22]:
next(groups[1][1])

StopIteration: 

In [23]:
next(groups[2][1])

(3, 'hon')

So, yes, the first two were exhausted when we converted the groups to a list.

The solution here is actually in the Python docs. Let's take a look:

```
The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list
```

The key thing here is that the elements yielded from the different groups are using the **same** underlying iterable over all the elements. As the documentation states, when we advance to the next group, the previous one's iterator is automatically exhausted - it basically iterates over all the elements until it hits the next group key.

Let's see this by stepping through the iteration manually:

In [24]:
groups = itertools.groupby(data, key=lambda x: x[0])

In [25]:
group1 = next(groups)

In [26]:
group1

(1, <itertools._grouper at 0x204a69905f8>)

And the iterator in the tuple is not exhausted:

In [27]:
next(group1[1])

(1, 'abc')

Now, let's try again, but this time we'll advance to group2, and see what is in `group1`'s iterator:

In [28]:
groups = itertools.groupby(data, key=lambda x: x[0])

In [29]:
group1 = next(groups)

In [30]:
group2 = next(groups)

Now `group1`'s iterator has been exhausted (because we moved to `group2`):

In [31]:
next(group1[1])

StopIteration: 

But `group2`'s iterator is still OK:

In [32]:
next(group2[1])

(2, 'pyt')

We know that there are still two elements in `group2`, so let's advance to `group3` and go back and see what's left in `group2`'s iterator:

In [33]:
group3 = next(groups)

In [34]:
next(group2[1])

StopIteration: 

But `group3`'s iterator is just fine:

In [35]:
next(group3[1])

(3, 'hon')

So, just be careful here with the `groupby()` - if you want to save all the data into a list you cannot first convert the groups into a list - you **must** step through the groups iterator, and retrieve each individual iterators elements into a list, the way we did it in the first example, or simply using a comprehension:

In [36]:
groups = itertools.groupby(data, key=lambda x: x[0])

In [37]:
groups_list = [(key, list(items)) for key, items in groups]

In [38]:
groups_list

[(1, [(1, 'abc'), (1, 'bcd')]),
 (2, [(2, 'pyt'), (2, 'yth'), (2, 'tho')]),
 (3, [(3, 'hon')])]

# Section 09 - Project 4

##  Project

For this project you have 4 files containing information about persons.

The files are:
* `personal_info.csv` -   personal information such as name, gender, etc. (one row per person)
* `vehicles.csv` -   what vehicle people own (one row per person)
* `employment.csv` -   where a person is employed (one row per person)
* `update_status.csv` -   when the person's data was created and last updated

Each file contains a key, `SSN`, which **uniquely** identifies a person.

This key is present in **all** four files.

You are guaranteed that the same SSN value is present in **every** file, and that it only appears **once per file**.

In addition, the files are all sorted by SSN, i.e. the SSN values appear in the same order in each file.

##### Goal 1

Your first task is to create iterators for each of the four files that contained cleaned up data, of the correct type (e.g. string, int, date, etc), and represented by a named tuple.

For now these four iterators are just separate, independent iterators.

##### Goal 2

Create a single iterable that combines all the columns from all the iterators.

The iterable should yield named tuples containing all the columns.
Make sure that the SSN's across the files match!

All the files are guaranteed to be in SSN sort order, and every SSN is unique, and every SSN appears in every file.

Make sure the SSN is not repeated 4 times - one time per row is enough!

##### Goal 3

Next, you want to identify any stale records, where stale simply means the record has not been updated since 3/1/2017 (e.g. last update date < 3/1/2017). Create an iterator that only contains current records (i.e. not stale) based on the `last_updated` field from the `status_update` file.

##### Goal 4

Find the largest group of car makes for each gender.

Possibly more than one such group per gender exists (equal sizes).

#### Hints

You will not be able to use a simple split approach here, as I explain in the video.

Instead you should use the `csv` module and the `reader` function.

Here's a simple example of how to use it - you will need to expand on this for your project goals, but this is a good starting point.

In [1]:
import csv

def read_file(file_name):
    with open(file_name) as f:
        rows = csv.reader(f, delimiter=',', quotechar='"')
        yield from rows
    

In [2]:
from itertools import islice

rows = read_file('personal_info.csv')
for row in islice(rows, 5):
    print(row)

['ssn', 'first_name', 'last_name', 'gender', 'language']
['100-53-9824', 'Sebastiano', 'Tester', 'Male', 'Icelandic']
['101-71-4702', 'Cayla', 'MacDonagh', 'Female', 'Lao']
['101-84-0356', 'Nomi', 'Lipprose', 'Female', 'Yiddish']
['104-22-0928', 'Justinian', 'Kunzelmann', 'Male', 'Dhivehi']


As you can see, the data is already separated into a list containing the individual fields - but of course they are all just strings.

### Good luck!

# Section 10 - Context Managers

##  Context Managers in Python

You should be familiar with `try` and `finally`.

We use the `finally` block to make sure a piece of code is executed, whether an exception has happened or not:

In [1]:
try:
    10 / 2
except ZeroDivisionError:
    print('Zero division exception occurred')
finally:
    print('finally ran!')    

finally ran!


In [2]:
try:
    1 / 0
except ZeroDivisionError:
    print('Zero division exception occurred')
finally:
    print('finally ran!')

Zero division exception occurred
finally ran!


You'll see that in both instances, the `finally` block was executed. Even if an exception is raised in the `except` block, the `finally` block will **still** execute!

Even if the finally is in a function and there is a return statement in the `try` or `except` blocks:

In [3]:
def my_func():
    try:
        1/0
    except:
        return
    finally:
        print('finally running...')

In [4]:
my_func()

finally running...


This is very handy to release resources even in cases where an exception occurs. For example making sure a file is closed after being opened:

In [5]:
try:
    f = open('test.txt', 'w')
    a = 1 / 0
except:
    print('an exception occurred...')
finally:
    print('Closing file...')
    f.close()

an exception occurred...
Closing file...


We should **always** do that when dealing with files.

But that can get cumbersome...

So, there is a better way.

Let's talk about context managers, and the pattern we are trying to solve:

1. Run some code to create some object(s)
2. Work with object(s)
3. Run some code when done to clean up object(s)

Context managers do precisely that.

We use a context manager to create and clean up some objects. The key point is that the cleanup needs to happens automatically - we should not have to write code such as the `try...except...finally` code we saw above.

When we use context managers in conjunction with the `with` statement, we end up with the "cleanup" phase happening as soon as the `with` statement finishes:

In [6]:
with open('test.txt', 'w') as file:
    print('inside with: file closed?', file.closed)
print('after with: file closed?', file.closed)

inside with: file closed? False
after with: file closed? True


This works even in this case:

In [7]:
def test():
    with open('test.txt', 'w') as file:
        print('inside with: file closed?', file.closed)
        return file

As you can see, we return directly out of the `with` block...

In [8]:
file = test()

inside with: file closed? False


In [9]:
file.closed

True

And yet, the file was still closed.

It also works even if we have an exception in the middle of the block:

In [10]:
with open('test.txt', 'w') as f:
    print('inside with: file closed?', f.closed)
    raise ValueError()

inside with: file closed? False


ValueError: 

In [11]:
print('after with: file closed?', f.closed)

after with: file closed? True


Context managers can be used for more than just opening and closing files.

If we think about it there are two phases to a context manager:
1. when the `with` statement is executing: we **enter** the context
2. when the `with` block is done: we **exit** the context

We can create our own context manager using a class that implements an `__enter__` method which is executed when we enter the context, and an `__exit__` method that is executed when we exit the context.

There is a general pattern that context managers can help us deal with:
* Open - Close
* Lock - Release
* Change - Reset
* Enter - Exit
* Start - Stop

The `__enter__` method is quite straightforward. It can (but does not have to) return one or more objects we then use inside the `with` block.

The `__exit__` method however is slightly more complicated.

1. It needs to return a boolean True/False. This indicates to Python whether to suppress any errors that occurred in the with block. As we saw with files, that was not the case - i.e. it returns a False
2. If an error does occur in the with block, the error information is passed to the `__exit__` method - so it needs three things: the exception type, the exception value and the traceback. If no error occured, then those values will simply be None.

We haven't covered exceptions in detail yet, so let's quickly see what those three things are:

In [12]:
def my_func():
    return 1.0 / 0.0

my_func()

ZeroDivisionError: float division by zero

The exception type here is `ZeroDivisionError`.

The exception value is `float division by zero`.

The traceback is an object of type `traceback` (that itself points to other `traceback` objects forming the trace stack) used to generate that text shown in the output.

I am not going to cover `traceback` objects at this point - we'll do this in a future part (OOP) of this series.

Let's go ahead and create a context manager:

In [13]:
class MyContext:
    def __init__(self):
        self.obj = None
        
    def __enter__(self):
        print('entering context...')
        self.obj = 'the Return Object'
        return self.obj

    def __exit__(self, exc_type, exc_value, exc_traceback):
        print('exiting context...')
        if exc_type:
            print(f'*** Error occurred: {exc_type}, {exc_value}')
        return False  # do not suppress exceptions

We can even cause an exception inside the `with` block:

In [14]:
with MyContext() as obj:
    raise ValueError

entering context...
exiting context...
*** Error occurred: <class 'ValueError'>, 


ValueError: 

As you can see, the `__exit__` method was still called - which is exactly what we wanted in the first place. Also, the exception that was raise inside the `with` block is seen.

We can change that by returning `True` from the `__exit__` method:

In [15]:
class MyContext:
    def __init__(self):
        self.obj = None
        
    def __enter__(self):
        print('entering context...')
        self.obj = 'the Return Object'
        return self.obj

    def __exit__(self, exc_type, exc_value, exc_traceback):
        print('exiting context...')
        if exc_type:
            print(f'*** Error occurred: {exc_type}, {exc_value}')
        return True  # suppress exceptions

In [16]:
with MyContext() as obj:
    raise ValueError
print('reached here without an exception...')

entering context...
exiting context...
*** Error occurred: <class 'ValueError'>, 
reached here without an exception...


Look at the output of this code:

In [17]:
with MyContext() as obj:
    print('running inside with block...')
    print(obj)
print(obj)

entering context...
running inside with block...
the Return Object
exiting context...
the Return Object


Notice that the `obj` we obtained from the context manager, still exists in our scope after the `with` statement.

The `with` statement does **not** have its own local scope - it's not a function!

However, the context manager could manipulate the object returned by the context manager:

In [18]:
class Resource:
    def __init__(self, name):
        self.name = name
        self.state = None

In [19]:
class ResourceManager:
    def __init__(self, name):
        self.name = name
        self.resource = None
        
    def __enter__(self):
        print('entering context')
        self.resource = Resource(self.name)
        self.resource.state = 'created'
        return self.resource
    
    def __exit__(self, exc_type, exc_value, exc_traceback):
        print('exiting context')
        self.resource.state = 'destroyed'
        if exc_type:
            print('error occurred')
        return False

In [20]:
with ResourceManager('spam') as res:
    print(f'{res.name} = {res.state}')
print(f'{res.name} = {res.state}')

entering context
spam = created
exiting context
spam = destroyed


We still have access to `res`, but it's internal state was changed by the resource manager's `__exit__` method.

Although we already have a context manager for files built-in to Python, let's go ahead and write our own anyways - good practice.

In [21]:
class File:
    def __init__(self, name, mode):
        self.name = name
        self.mode = mode
        
    def __enter__(self):
        print('opening file...')
        self.file = open(self.name, self.mode)
        return self.file
    
    def __exit__(self, exc_type, exc_value, exc_traceback):
        print('closing file...')
        self.file.close()
        return False

In [22]:
with File('test.txt', 'w') as f:
    f.write('This is a late parrot!')

opening file...
closing file...


Even if we have an exception inside the `with` statement, our file will still get closed.

Same applies if we return out of the `with` block if we're inside a function:

In [23]:
def test():
    with File('test.txt', 'w') as f:
        f.write('This is a late parrot')
        if True:
            return f
        print(f.closed)
    print(f.closed)

In [24]:
f = test()

opening file...
closing file...


Note that the `__enter__` method can return anything, including the context manager itself.

If we wanted to, we could re-write our file context manager this way:

In [25]:
class File():
    def __init__(self, name, mode):
        self.name = name
        self.mode = mode
        
    def __enter__(self):
        print('opening file...')
        self.file = open(self.name, self.mode)
        return self
    
    def __exit__(self, exc_type, exc_value, exc_traceback):
        print('closing file...')
        self.file.close()
        return False

Of course, now we would have to use the context manager object's `file` property to get a handle to the file:

In [26]:
with File('test.txt', 'r') as file_ctx:
    print(next(file_ctx.file))
    print(file_ctx.name)
    print(file_ctx.mode)

opening file...
This is a late parrot
test.txt
r
closing file...


##  Additional Uses

Remember what I said in the last lecture about some common patterns we can implement with context managers:

* Open - Close
* Change - Reset
* Start - Stop

The open file context manager is an example of the **Open - Close** pattern. But we have other ones as well.

#### Decimal Contexts

Decimals have a context which can be used to define many things, such as precision, rounding mechanism, etc.

By default, Decimals have a "global" context - i.e. one that will apply to any Decimal object by default:

In [1]:
import decimal

In [2]:
decimal.getcontext()

Context(prec=28, rounding=ROUND_HALF_EVEN, Emin=-999999, Emax=999999, capitals=1, clamp=0, flags=[], traps=[InvalidOperation, DivisionByZero, Overflow])

If we create a decimal object, then it will use those settings.

We can certainly change the properties of that global context:

In [3]:
decimal.getcontext().prec=14

In [4]:
decimal.getcontext()

Context(prec=14, rounding=ROUND_HALF_EVEN, Emin=-999999, Emax=999999, capitals=1, clamp=0, flags=[], traps=[InvalidOperation, DivisionByZero, Overflow])

And now the default (global) context has a precision set to 14.

Let's reset it back to 28:

In [5]:
decimal.getcontext().prec = 28

Suppose now that we just want to temporarily change something in the context - we would have to do something like this:

In [6]:
old_prec = decimal.getcontext().prec
decimal.getcontext().prec = 4
print(decimal.Decimal(1) / decimal.Decimal(3))
decimal.getcontext().prec = old_prec
print(decimal.Decimal(1) / decimal.Decimal(3))

0.3333
0.3333333333333333333333333333


Of course, this is kind of a pain to have to store the current value, set it to something new, and then remember to set it back to it's original value.

How about writing a context manager to do that seamlessly for us:

In [7]:
class precision:
    def __init__(self, prec):
        self.prec = prec
        self.current_prec = decimal.getcontext().prec
        
    def __enter__(self):
        decimal.getcontext().prec = self.prec
        
    def __exit__(self, exc_type, exc_value, exc_traceback):
        decimal.getcontext().prec = self.current_prec
        return False      
        

Now we can do this:

In [8]:
with precision(3):
    print(decimal.Decimal(1) / decimal.Decimal(3))
print(decimal.Decimal(1) / decimal.Decimal(3))    

0.333
0.3333333333333333333333333333


And as you can see, the precision was set back to it's original value once the context was exited.

In fact, the decimal class already has a context manager, and it's way better than ours, because we can set not only the precision, but anything else we want:

In [9]:
with decimal.localcontext() as ctx:
    ctx.prec = 3
    print(decimal.Decimal(1) / decimal.Decimal(3))
print(decimal.Decimal(1) / decimal.Decimal(3))

0.333
0.3333333333333333333333333333


So this is an example of using a context manager for a **Change - Reset** type of situation.

#### Timing a with block

Here's another example of a **Start - Stop** type of context manager.
We'll create a context manager to time the code inside the `with` block:

In [10]:
from time import perf_counter, sleep

class Timer:
    def __init__(self):
        self.elapsed = 0
        
    def __enter__(self):
        self.start = perf_counter()
        return self
    
    def __exit__(self, exc_type, exc_value, exc_traceback):
        self.stop = perf_counter()
        self.elapsed = self.stop - self.start
        return False

You'll note that this time we are returning the context manager itself from the `__enter__` statement. This will allow us to look at the `elapsed` property of the context manager once the `with` statement has finished running.

In [11]:
with Timer() as timer:
    sleep(1)
print(timer.elapsed)

0.9993739623039163


#### Redirecting stdout

Here we are going to temporarily redirect `stdout` to a file instead fo the console:

In [12]:
import sys

class OutToFile:
    def __init__(self, fname):
        self._fname = fname
        self._current_stdout = sys.stdout
        
    def __enter__(self):
        self._file = open(self._fname, 'w')
        sys.stdout = self._file
        
    def __exit__(self, exc_type, exc_value, exc_tb):
        sys.stdout = self._current_stdout
        if self._file:
            self._file.close()
        return False

In [13]:
with OutToFile('test.txt'):
    print('Line 1')
    print('Line 2')

As you can see, no output happened on the console... Instead the output went to the file we specified. And our print statements will now output to the console again:

In [14]:
print('back to console output')

back to console output


In [15]:
with open('test.txt') as f:
    print(f.readlines())

['Line 1\n', 'Line 2\n']


#### HTML Tags

In this example, we're going to basically use a context manager to inject opening and closing html tags as we print to the console (of course we could redirect our prints somewhere else as we just saw!):

In [23]:
class Tag:
    def __init__(self, tag):
        self._tag = tag
        
    def __enter__(self):
        print(f'<{self._tag}>', end='')
        
    def __exit__(self, exc_type, exc_value, exc_tb):
        print(f'</{self._tag}>', end='')
        return False

In [17]:
with Tag('p'):
    print('some ', end='')
    with Tag('b'):
        print('bold', end='')
    print(' text', end='')

<p>some <b>bold</b> text</p>

#### Re-entrant Context Managers

We can also write context managers that can be re-entered in the sense that we can call `__enter__` and `__exit__` more than once on the **same** context manager. 

These methods are called when a `with` statement is used, so we'll need to be able to get our hands on the context manager object itself - but that's easy, we just return `self` from the `__enter__` method.

Let's write a ListMaker context manager to do see how this works.

In [22]:
class ListMaker:
    def __init__(self, title, prefix='- ', indent=3):
        self._title = title
        self._prefix = prefix
        self._indent = indent
        self._current_indent = 0
        print(title)
        
    def __enter__(self):
        self._current_indent += self._indent
        return self
    
    def __exit__(self, exc_type, exc_value, exc_tb):
        self._current_indent -= self._indent
        return False
        
    def print(self, arg):
        s = ' ' * self._current_indent + self._prefix + str(arg)
        print(s)

Because `__enter__` is returning `self`, the instance of the context manager, we can call `with` on that context manager and it will automatically call the `__enter__` and `__exit__` methods. Each time we run `__enter__` we increase the indentation, each time we run `__exit__` we decrease the indentation.

Our `print` method then takes that into account when it prints the requested string argument.

In [19]:
with ListMaker('Items') as lm:
    lm.print('Item 1')
    with lm:
        lm.print('item 1a')
        lm.print('item 1b')
    lm.print('Item 2')
    with lm:
        lm.print('item 2a')
        lm.print('item 2b')
    

Items
   - Item 1
      - item 1a
      - item 1b
   - Item 2
      - item 2a
      - item 2b


Of course, we can easily redirect the output to a file instead, using the context manager we wrote earlier:

In [20]:
with OutToFile('my_list.txt'):
    with ListMaker('Items') as lm:
        lm.print('Item 1')
        with lm:
            lm.print('item 1a')
            lm.print('item 1b')
        lm.print('Item 2')
        with lm:
            lm.print('item 2a')
            lm.print('item 2b')

In [21]:
with open('my_list.txt') as f:
    for row in f:
        print(row, end='')

Items
   - Item 1
      - item 1a
      - item 1b
   - Item 2
      - item 2a
      - item 2b


##  Generators and Context Managers

Let's see how we might write something that almost behaves like a context manager, using a generator function:

In [1]:
def my_gen():
    try:
        print('creating context and yielding object')
        lst = [1, 2, 3, 4, 5]
        yield lst
    finally:
        print('exiting context and cleaning up')

In [2]:
gen = my_gen()  # create generator

In [3]:
lst = next(gen)  # enter context and get "as" object

creating context and yielding object


In [4]:
print(lst)

[1, 2, 3, 4, 5]


In [5]:
next(gen)  # exit context

exiting context and cleaning up


StopIteration: 

As you can see, the exiting context code ran.
But to make this cleaner, we'll catch the StopIteration exception:

In [6]:
gen = my_gen()
lst = next(gen)
print(lst)
try:
    next(gen)
except StopIteration:
    pass

creating context and yielding object
[1, 2, 3, 4, 5]
exiting context and cleaning up


Now let's write a context manager that can use the type of generator we wrote so we can use it using a `with` statement instead:

In [7]:
class GenCtxManager:
    def __init__(self, gen_func):
        self._gen = gen_func()
        
    def __enter__(self):
        return next(self._gen)
    
    def __exit__(self, exc_type, exc_value, exc_tb):
        try:
            next(self._gen)
        except StopIteration:
            pass
        return False

In [8]:
with GenCtxManager(my_gen) as lst:
    print(lst)

creating context and yielding object
[1, 2, 3, 4, 5]
exiting context and cleaning up


Our `GenCtxManager` class is not very flexible - we cannot pass arguments to the generator function for example. We are also not doing any exception handling...

We could try some of this ourselves. 
For example handling arguments:

In [9]:
class GenCtxManager:
    def __init__(self, gen_func, *args, **kwargs):
        self._gen = gen_func(*args, **kwargs)
        
    def __enter__(self):
        return next(self._gen)
    
    def __exit__(self, exc_type, exc_value, exc_tb):
        try:
            next(self._gen)
        except StopIteration:
            pass
        return False

In [10]:
def open_file(fname, mode):
    try:
        print('opening file...')
        f = open(fname, mode)
        yield f
    finally:
        print('closing file...')
        f.close()

In [11]:
with GenCtxManager(open_file, 'test.txt', 'w') as f:
    print('writing to file...')
    f.write('testing...')

opening file...
writing to file...
closing file...


In [12]:
with open('test.txt') as f:
    print(next(f))

testing...


This works, but is not very elegant, and we still are not doing much exception handling. 
In the next video, we'll look at a decorator in the standard library that does this far more robustly and elegantly...

##  Using Decorators to Create Context Managers using Generator Functions

In the last video we saw how we could create a generic class that could take a generator function that had a specific structure and turn it into a context manager.

Let's see if we can do one step better, using a decorator instead.

Recall the basic structure our generator function needs to have:

```
def gen(args):
    # set up happens here, or inside try
    try:
        yield obj # whatever normally gets returned by __enter__
    finally:
        # perform clean up code here
```

First let's define a generator function to open a file, yield it, and then close it - same as the example we saw in the previous video:

In [1]:
def open_file(fname, mode='r'):
    print('opening file...')
    f = open(fname, mode)
    try:
        yield f
    finally:
        print('closing file...')
        f.close()

Next, let's re-create the context manager wrapper we did in the last video, but this time, I'm going to pass it a generator object, instead of a generator function. But basically the same idea:

In [2]:
class GenContextManager:
    def __init__(self, gen):
        self.gen = gen
        
    def __enter__(self):
        return next(self.gen)
        
    def __exit__(self, exc_type, exc_value, exc_tb):
        print('calling next to perform cleanup in generator')
        try:
            next(self.gen)
        except StopIteration:
            pass
        return False

At this point we can use this with our generator function as follows:

In [3]:
file_gen = open_file('test.txt', 'w')

with GenContextManager(file_gen) as f:
    f.writelines('Sir Spamalot')

opening file...
calling next to perform cleanup in generator
closing file...


And we can read back from the file too:

In [4]:
file_gen = open_file('test.txt')
with GenContextManager(file_gen) as f:
    print(f.readlines())

opening file...
['Sir Spamalot']
calling next to perform cleanup in generator
closing file...


Of course, our context manager object is not very robust - there is no exception handling for example. But let's leave that "minor detail" aside for now :-)

We still have to create the generator object from the generator function before we can use the context manager class.

We can simplify things even more by using a decorator:

In [5]:
def context_manager_dec(gen_fn):
    def helper(*args, **kwargs):
        gen = gen_fn(*args, **kwargs)
        ctx = GenContextManager(gen)
        return ctx
    return helper

Notice what this decorator does.

It decorates a generator function and returns `helper`. When we invoke `helper` it will create an instance of the generator, and create and return an instance of the context manager.

Let's try it out:

In [6]:
@context_manager_dec
def open_file(fname, mode='r'):
    print('opening file...')
    f = open(fname, mode)
    try:
        yield f
    finally:
        print('closing file...')
        f.close()    

In [7]:
with open_file('test.txt') as f:
    print(f.readlines())

opening file...
['Sir Spamalot']
calling next to perform cleanup in generator
closing file...


So now we have an approach to using a decorator to turn any generator function (that has the structure we mentioned earlier) into a context manager!

Our code was not very robust, either in the context manager class or in the decorator - and it would take quite a bit more work to make it so.

Fortunately the standard library already has this implemented for us - in fact that was one of the critical goals of Python's context managers - the ability to create context managers using generator functions (see PEP 343).

In [8]:
from contextlib import contextmanager

In [9]:
@contextmanager
def open_file(fname, mode='r'):
    print('opening file...')
    f = open(fname, mode)
    try:
        yield f
    finally:
        print('closing file...')
        f.close() 

In [10]:
with open_file('test.txt') as f:
    print(f.readlines())

opening file...
['Sir Spamalot']
closing file...


And of course, this works for more than just opening and closing files. 

Here are some more examples:

#### Example 1

Let's implement a timer.

In [11]:
from time import perf_counter, sleep

In [12]:
@contextmanager
def timer():
    stats = dict()
    start = perf_counter()
    stats['start'] = start
    yield stats
    end = perf_counter()
    stats['end'] = end
    stats['elapsed'] = end - start

In [13]:
with timer() as stats:
    sleep(1)

In [14]:
print(stats)

{'start': 2.708953062782967e-07, 'end': 1.0097690265350079, 'elapsed': 1.0097687556397017}


#### Example 2

In this example, let's redirect `stdout`.

In [15]:
import sys

In [16]:
@contextmanager
def out_to_file(fname):
    current_stdout = sys.stdout
    file = open(fname, 'w')
    sys.stdout = file
    try:
        yield None
    finally:
        file.close()
        sys.stdout = current_stdout

In [17]:
with out_to_file('test.txt'):
    print('line 1')
    print('line 2')

In [18]:
with open('test.txt') as f:
    print(f.readlines())

['line 1\n', 'line 2\n']


And of course, `stdout` is back to "normal":

In [19]:
print('line 1')

line 1


The `contextlib` module actually implements a `stdout` redirect context manager, so we technically don't have to write one ourselves.

The difference from the one we wrote is that it needs an open file object, not just a file name. So we would have to open the file, then redirect stdout. We can do this easily by nesting two context managers as follows:

In [20]:
from contextlib import redirect_stdout

In [21]:
with open('test.txt', 'w') as f:
    with redirect_stdout(f):
        print('Look on the bright side of life')

And we can check that this worked:

In [22]:
with open('test.txt') as f:
    print(f.readlines())

['Look on the bright side of life\n']


##  Caveat with Lazy Iterators

We have to be careful when working with context managers and lazy iterators.

Consider this example where we want to create a generator from a file:

In [5]:
import csv

def read_data():
    with open('nyc_parking_tickets_extract.csv') as f:
        return csv.reader(f, delimiter=',', quotechar='"')

In [6]:
for row in read_data():
    print(row)

ValueError: I/O operation on closed file.

As you can see, `read_data` returns a lazy iterator (`csv.reader`), but by the time we iterate over it, the `with` context that opened the file was exited, and the file was closed!

We have two possible solutions here:

The first one is not very desirable since it involves reading the entire file into memory by iterating the file and putting it into a list before we exit the `with` block:

In [7]:
def read_data():
    with open('nyc_parking_tickets_extract.csv') as f:
        return list(csv.reader(f, delimiter=',', quotechar='"'))

for row in read_data():
    print(row)

['Summons Number', 'Plate ID', 'Registration State', 'Plate Type', 'Issue Date', 'Violation Code', 'Vehicle Body Type', 'Vehicle Make', 'Violation Description']
['4006478550', 'VAD7274', 'VA', 'PAS', '10/5/2016', '5', '4D', 'BMW', 'BUS LANE VIOLATION']
['4006462396', '22834JK', 'NY', 'COM', '9/30/2016', '5', 'VAN', 'CHEVR', 'BUS LANE VIOLATION']
['4007117810', '21791MG', 'NY', 'COM', '4/10/2017', '5', 'VAN', 'DODGE', 'BUS LANE VIOLATION']
['4006265037', 'FZX9232', 'NY', 'PAS', '8/23/2016', '5', 'SUBN', 'FORD', 'BUS LANE VIOLATION']
['4006535600', 'N203399C', 'NY', 'OMT', '10/19/2016', '5', 'SUBN', 'FORD', 'BUS LANE VIOLATION']
['4007156700', '92163MG', 'NY', 'COM', '4/13/2017', '5', 'VAN', 'FRUEH', 'BUS LANE VIOLATION']
['4006687989', 'MIQ600', 'SC', 'PAS', '11/21/2016', '5', 'VN', 'HONDA', 'BUS LANE VIOLATION']
['4006943052', '2AE3984', 'MD', 'PAS', '2/1/2017', '5', 'SW', 'LINCO', 'BUS LANE VIOLATION']
['4007306795', 'HLG4926', 'NY', 'PAS', '5/30/2017', '5', 'SUBN', 'TOYOT', 'BUS LANE

The second method, the one we have used quite a bit, involves yielding each row from the csv reader:

In [8]:
def read_data():
    with open('nyc_parking_tickets_extract.csv') as f:
        yield from csv.reader(f, delimiter=',', quotechar='"')

for row in read_data():
    print(row)

['Summons Number', 'Plate ID', 'Registration State', 'Plate Type', 'Issue Date', 'Violation Code', 'Vehicle Body Type', 'Vehicle Make', 'Violation Description']
['4006478550', 'VAD7274', 'VA', 'PAS', '10/5/2016', '5', '4D', 'BMW', 'BUS LANE VIOLATION']
['4006462396', '22834JK', 'NY', 'COM', '9/30/2016', '5', 'VAN', 'CHEVR', 'BUS LANE VIOLATION']
['4007117810', '21791MG', 'NY', 'COM', '4/10/2017', '5', 'VAN', 'DODGE', 'BUS LANE VIOLATION']
['4006265037', 'FZX9232', 'NY', 'PAS', '8/23/2016', '5', 'SUBN', 'FORD', 'BUS LANE VIOLATION']
['4006535600', 'N203399C', 'NY', 'OMT', '10/19/2016', '5', 'SUBN', 'FORD', 'BUS LANE VIOLATION']
['4007156700', '92163MG', 'NY', 'COM', '4/13/2017', '5', 'VAN', 'FRUEH', 'BUS LANE VIOLATION']
['4006687989', 'MIQ600', 'SC', 'PAS', '11/21/2016', '5', 'VN', 'HONDA', 'BUS LANE VIOLATION']
['4006943052', '2AE3984', 'MD', 'PAS', '2/1/2017', '5', 'SW', 'LINCO', 'BUS LANE VIOLATION']
['4007306795', 'HLG4926', 'NY', 'PAS', '5/30/2017', '5', 'SUBN', 'TOYOT', 'BUS LANE

##  Not Just a Context Manager

Just because our class implements the context manager protocol does not mean it cannot do other things as well!

In fact the `open` function we use to open files can be used with or without a context manager:

In [3]:
f = open('test.txt', 'w')
f.writelines('this is a test')
f.close()

Here we did not use a context manager - the `open` function simply returned the file object - but we had to close the file ourselves - there was not context used.

On the other hand we can also use it with a context manager:

In [4]:
with open('test.txt') as f:
    print(f.readlines())

['this is a test']


We can implement classes that implement their own functionality as well as a context manager if we want to.

##### Example

In [2]:
class DataIterator:
    def __init__(self, fname):
        self._fname = fname
        self._f = None
    
    def __iter__(self):
        return self
    
    def __next__(self):
        row = next(self._f)
        return row.strip('\n').split(',')
    
    def __enter__(self):
        self._f = open(self._fname)
        return self
    
    def __exit__(self, exc_type, exc_value, exc_tb):
        if not self._f.closed:
            self._f.close()
        return False

In [3]:
with DataIterator('nyc_parking_tickets_extract.csv') as data:
    for row in data:
        print(row)

['Summons Number', 'Plate ID', 'Registration State', 'Plate Type', 'Issue Date', 'Violation Code', 'Vehicle Body Type', 'Vehicle Make', 'Violation Description']
['4006478550', 'VAD7274', 'VA', 'PAS', '10/5/2016', '5', '4D', 'BMW', 'BUS LANE VIOLATION']
['4006462396', '22834JK', 'NY', 'COM', '9/30/2016', '5', 'VAN', 'CHEVR', 'BUS LANE VIOLATION']
['4007117810', '21791MG', 'NY', 'COM', '4/10/2017', '5', 'VAN', 'DODGE', 'BUS LANE VIOLATION']
['4006265037', 'FZX9232', 'NY', 'PAS', '8/23/2016', '5', 'SUBN', 'FORD', 'BUS LANE VIOLATION']
['4006535600', 'N203399C', 'NY', 'OMT', '10/19/2016', '5', 'SUBN', 'FORD', 'BUS LANE VIOLATION']
['4007156700', '92163MG', 'NY', 'COM', '4/13/2017', '5', 'VAN', 'FRUEH', 'BUS LANE VIOLATION']
['4006687989', 'MIQ600', 'SC', 'PAS', '11/21/2016', '5', 'VN', 'HONDA', 'BUS LANE VIOLATION']
['4006943052', '2AE3984', 'MD', 'PAS', '2/1/2017', '5', 'SW', 'LINCO', 'BUS LANE VIOLATION']
['4007306795', 'HLG4926', 'NY', 'PAS', '5/30/2017', '5', 'SUBN', 'TOYOT', 'BUS LANE

Of course, we cannot use this iterator without also using the context manager since the file would not be opened otherwise:

In [4]:
data = DataIterator('nyc_parking_tickets_extract.csv')

In [6]:
for row in data:
    print(row)

TypeError: 'NoneType' object is not an iterator

But I want to point out that creating the context manager and using the `with` statement can be done in two steps if we want to:

In [8]:
data_iter = DataIterator('nyc_parking_tickets_extract.csv')

At this stage, the object has been created, but the `__enter__` method has not been called yet.

Once we use `with`, then the file will be opened, and the iterator will be ready for use:

In [10]:
with data_iter as data:
    for row in data:
        print(row)

['Summons Number', 'Plate ID', 'Registration State', 'Plate Type', 'Issue Date', 'Violation Code', 'Vehicle Body Type', 'Vehicle Make', 'Violation Description']
['4006478550', 'VAD7274', 'VA', 'PAS', '10/5/2016', '5', '4D', 'BMW', 'BUS LANE VIOLATION']
['4006462396', '22834JK', 'NY', 'COM', '9/30/2016', '5', 'VAN', 'CHEVR', 'BUS LANE VIOLATION']
['4007117810', '21791MG', 'NY', 'COM', '4/10/2017', '5', 'VAN', 'DODGE', 'BUS LANE VIOLATION']
['4006265037', 'FZX9232', 'NY', 'PAS', '8/23/2016', '5', 'SUBN', 'FORD', 'BUS LANE VIOLATION']
['4006535600', 'N203399C', 'NY', 'OMT', '10/19/2016', '5', 'SUBN', 'FORD', 'BUS LANE VIOLATION']
['4007156700', '92163MG', 'NY', 'COM', '4/13/2017', '5', 'VAN', 'FRUEH', 'BUS LANE VIOLATION']
['4006687989', 'MIQ600', 'SC', 'PAS', '11/21/2016', '5', 'VN', 'HONDA', 'BUS LANE VIOLATION']
['4006943052', '2AE3984', 'MD', 'PAS', '2/1/2017', '5', 'SW', 'LINCO', 'BUS LANE VIOLATION']
['4007306795', 'HLG4926', 'NY', 'PAS', '5/30/2017', '5', 'SUBN', 'TOYOT', 'BUS LANE

##  Nested Context Managers

In the last video we saw that we could nest context managers.

This is actually fairly common.

Suppose we need to open a number of files - using a `with` statement for each one means we would have to nest that many `with` statements as well.

For example, we want to "zip" three files. Let's look at the content of each file first:

In [1]:
with open('file1.txt') as f:
    for row in f:
        print(row, end='')
print('\n----------------')
with open('file2.txt') as f:
    for row in f:
        print(row, end='')
print('\n----------------')
with open('file3.txt') as f:
    for row in f:
        print(row, end='')

file1_line1
file1_line2
file1_line3
----------------
file2_line1
file2_line2
file2_line3
----------------
file3_line1
file3_line2
file3_line3

Now we want to combine the rows from each file, and print them out - like zipping together basically, except we need to strip out that pesky `\n`!

In [2]:
with open('file1.txt') as f1:
    with open('file2.txt') as f2:
        with open('file3.txt') as f3:
            while True:
                try:
                    f1_row = next(f1).strip('\n')
                    f2_row = next(f2).strip('\n')
                    f3_row = next(f3).strip('\n')
                except StopIteration:
                    break
                else:
                    print(f1_row + ',' + f2_row + ',' + f3_row)
            

file1_line1,file2_line1,file3_line1
file1_line2,file2_line2,file3_line2
file1_line3,file2_line3,file3_line3


As you can see we needed three levels of nested `with` statements.

Instead, we might try to approach the problem this way - but first let's write our own `openfile` context manager so we can easily see when the file is being opened and closed:

In [3]:
from contextlib import contextmanager

@contextmanager
def open_file(f_name):
    print(f'opening file {f_name}')
    f = open(f_name)
    try:
        yield f
    finally:
        print(f'closing file {f_name}')
        f.close()

First we are going to create (but not enter) the context managers, and store the enter and exit methods in some lists. We'll also create a list that will contain the values returned by the enter methods:

In [4]:
f_names = 'file1.txt', 'file2.txt', 'file3.txt'

enters = []
exits = []
for f_name in f_names:
    ctx = open_file(f_name)
    enters.append(ctx.__enter__)
    exits.append(ctx.__exit__)    

Now, we are going to enter the contexts by calling the `__enter__` methods, store the return values (the file objects), process the data, and then run all the `__exit__` methods (in reverse order!):

In [5]:
files = [enter() for enter in enters]

opening file file1.txt
opening file file2.txt
opening file file3.txt


In [6]:
while True:
    try:
        rows = [next(f).strip('\n') for f in files]
    except StopIteration:
        break
    else:
        row = ','.join(rows)
        print(row)

file1_line1,file2_line1,file3_line1
file1_line2,file2_line2,file3_line2
file1_line3,file2_line3,file3_line3


Now, we need to close all the files by calling the `__exit__` methods (in reverse order, since we aded them in the order in which the contexts were opened (i.e. we open from first file to last, but close from last opened file to first - think of how the context managers are nested).

Also, keep in mind that `__exit__` methods need those exception parameters - here we'll just use None for simplicity - we are not doing any exception handling!

In [7]:
for fn in exits[::-1]:
    fn(None, None, None)

closing file file3.txt
closing file file2.txt
closing file file1.txt


So, let's recap by putting all this code together and simplifying a few things along the way:

In [8]:
f_names = 'file1.txt', 'file2.txt', 'file3.txt'

exits = []
files = []
try:
    for f_name in f_names:
        ctx = open_file(f_name)
        files.append(ctx.__enter__())
        exits.append(ctx.__exit__)    
    
    while True:
        try:
            rows = [next(f).strip('\n') for f in files]
        except StopIteration:
            break
        else:
            row = ','.join(rows)
            print(row)
finally:
    for fn in exits[::-1]:
        fn(None, None, None)

opening file file1.txt
opening file file2.txt
opening file file3.txt
file1_line1,file2_line1,file3_line1
file1_line2,file2_line2,file3_line2
file1_line3,file2_line3,file3_line3
closing file file3.txt
closing file file2.txt
closing file file1.txt


What was simpler, this method or simply nesting the `with` blocks?

What if we were doing this with 100 files instead of just 3?

Or what if we did not know in advance how many files we had to zip together (maybe we're reading all the .txt files in a directory for example - there may be 1 file, 3 files, 10 files, we don't realy know, and it can change over time)?

Maybe we can find a way to make this a little easier to use.

Let's try using a context manager to hold on to all these `__exit__` methods we want to use:

In [9]:
class NestedContexts:
    def __init__(self, *contexts):
        self._enters = []
        self._exits = []
        self._values = []
        
        for ctx in contexts:
            self._enters.append(ctx.__enter__)
            self._exits.append(ctx.__exit__)
        
    def __enter__(self):
        for enter in self._enters:
            self._values.append(enter())
        return self._values
    
    def __exit__(self, exc_type, exc_value, exc_tb):
        for exit in self._exits[::-1]:
            exit(exc_type, exc_value, exc_tb)
        return False

Now let's try to use it:

In [10]:
with NestedContexts(open_file('file1.txt'),
                   open_file('file2.txt'),
                   open_file('file3.txt')) as files:
    print('do work here')

opening file file1.txt
opening file file2.txt
opening file file3.txt
do work here
closing file file3.txt
closing file file2.txt
closing file file1.txt


As you can see, the files were opened, our work was done, and the files were closed.
Let's just add in some real work:

In [11]:
with NestedContexts(open_file('file1.txt'),
                   open_file('file2.txt'),
                   open_file('file3.txt')) as files:
    while True:
        try:
            rows = [next(f).strip('\n') for f in files]
        except StopIteration:
            break
        else:
            row = ','.join(rows)
            print(row)

opening file file1.txt
opening file file2.txt
opening file file3.txt
file1_line1,file2_line1,file3_line1
file1_line2,file2_line2,file3_line2
file1_line3,file2_line3,file3_line3
closing file file3.txt
closing file file2.txt
closing file file1.txt


This is much better, but specifying the context managers is still a little painful, having to list them all separately as the arguments to the `NestedContexts` manager.

We could simplify things somewhat by taking this approach:

In [12]:
file_names = 'file1.txt', 'file2.txt', 'file3.txt'
contexts = [open_file(f_name) for f_name in f_names]
with NestedContexts(*contexts) as files:
    while True:
        try:
            rows = [next(f).strip('\n') for f in files]
        except StopIteration:
            break
        else:
            row = ','.join(rows)
            print(row)

opening file file1.txt
opening file file2.txt
opening file file3.txt
file1_line1,file2_line1,file3_line1
file1_line2,file2_line2,file3_line2
file1_line3,file2_line3,file3_line3
closing file file3.txt
closing file file2.txt
closing file file1.txt


So, this works, and is actually quite workable, but we have to do some setup work before we can use the context manager.

We can try a slightly different approach where we create a method in our `NestedContextManager` that can be used to "register" contexts - so instead of creating the `NestedContextManager` with an `__init__` that takes in all the contexts at once, we create the `NextedContextManager` **first**, and then, inside the `with` block we append the contexts we want to work with.

One main advantage to that approach is that we can add contexts to `NestedContextManager` at any time in the `with` block - i.e. we can delay when and how we add contexts to the overarching context manager.

So, we'll do this by implementing a method in `NestedContexts` itself that will allow us to append a context manager, get the `__enter__` value out of it, and store the `__exit__` methods.

To do this we're going to take a slightly different approach - the `NestedContexts` manager is going to return itself in it's `__enter__` method, instead of returning a list of the various context values returned from their respective `__enter__` methods:

In [13]:
class NestedContexts:
    def __init__(self):
        self._exits = []
        
    def __enter__(self):
        return self

    def enter_context(self, ctx):
        self._exits.append(ctx.__exit__)
        value = ctx.__enter__()
        return value
        
    def __exit__(self, exc_type, exc_value, exc_tb):
        for exit in self._exits[::-1]:
            exit(exc_type, exc_value, exc_tb)
        return False

Now let's try it again, but this time we'll "register" our contexts once the `NestedContexts` context has been entered:

In [14]:
f_names = 'file1.txt', 'file2.txt', 'file3.txt'

with NestedContexts() as stack:
    files = [stack.enter_context(open_file(f_name)) for f_name in f_names]

opening file file1.txt
opening file file2.txt
opening file file3.txt
closing file file3.txt
closing file file2.txt
closing file file1.txt


Nice! Now let's just do the work as well:

In [15]:
f_names = 'file1.txt', 'file2.txt', 'file3.txt'

with NestedContexts() as stack:
    files = [stack.enter_context(open_file(f_name)) for f_name in f_names]
    
    while True:
        try:
            rows = [next(f).strip('\n') for f in files]
        except StopIteration:
            break
        else:
            row = ','.join(rows)
            print(row)

opening file file1.txt
opening file file2.txt
opening file file3.txt
file1_line1,file2_line1,file3_line1
file1_line2,file2_line2,file3_line2
file1_line3,file2_line3,file3_line3
closing file file3.txt
closing file file2.txt
closing file file1.txt


Hopefully you can now see why I said we can decide to add contyexts to that stack at any time inside the `with` statement - we are not restricted to adding them in the `__init__` - which means we can even use `if` statements to add contexts to the stack if we want to - this is far more flexible.

So, this is a common enough scenario that the standard library has something up its sleeve for us!

The `contextlib` has an `ExitStack` context manager that works the same way as our `NestedContexts`, but, unlike our approach, it actually does exception handling properly too!

Let's see how we use it:

In [16]:
from contextlib import ExitStack

In [17]:
f_names = 'file1.txt', 'file2.txt', 'file3.txt'

with ExitStack() as stack:
    files = [stack.enter_context(open_file(f_name))
            for f_name in f_names]    

opening file file1.txt
opening file file2.txt
opening file file3.txt
closing file file3.txt
closing file file2.txt
closing file file1.txt


As you can see, the files were opened and automatically closed. Now all we need to do is the work itself:

In [18]:
f_names = 'file1.txt', 'file2.txt', 'file3.txt'

with ExitStack() as stack:
    files = [stack.enter_context(open_file(f_name))
            for f_name in f_names]
    while True:
        try:
            rows = [next(f).strip('\n') for f in files]
        except StopIteration:
            break
        else:
            row = ','.join(rows)
            print(row)

opening file file1.txt
opening file file2.txt
opening file file3.txt
file1_line1,file2_line1,file3_line1
file1_line2,file2_line2,file3_line2
file1_line3,file2_line3,file3_line3
closing file file3.txt
closing file file2.txt
closing file file1.txt


To finish up we can use the built-in `open` context manager:

In [19]:
f_names = 'file1.txt', 'file2.txt', 'file3.txt'

with ExitStack() as stack:
    files = [stack.enter_context(open(f_name))
            for f_name in f_names]
    while True:
        try:
            rows = [next(f).strip('\n') for f in files]
        except StopIteration:
            break
        else:
            row = ','.join(rows)
            print(row)

file1_line1,file2_line1,file3_line1
file1_line2,file2_line2,file3_line2
file1_line3,file2_line3,file3_line3


# Section 12 - Generator Based Coroutines

##  Coroutines

What is a coroutine?

The word co actually comes from **cooperative**.

A coroutine is a generalization of subroutines (think functions), focused on **cooperation** between routines.

If you have some concepts of multi-threading, this is similar in some ways. But whereas in multi-threaded applications the **operating system** usually decides when to suspend one thread and resume another, without asking permission, so-called **preemptive** multitasking, here we have routines that voluntarily yield control to something else - hence the term **cooperative**.

We actually have all the tools we need to start looking at this.

It is the `yield` statement we studied in the last section on generator functions.

Let's dig a little further to truly understand what coroutines are and how they can be used.

We'll need to first define quickly what a queue is.

It is a collection where items are added to the back of the queue, and items are removed from the end of the queue. So, very similar to a queue in a supermarket - you join the queue at the back of the queue, and the person in the front of the queue is the first one to leave the queue and go to the checkout counter.

This is also called a First-In First-Out data structure.

(For comparison, you also have a **stack** which is like a stack of pancakes - the last cooked pancake is placed on top of the stack of pancakes (called a **push**), and it's the first one you take fomr the stack and eat (called a **pop**) - so that is called First-In Last-Out)

We can just use a simple list to act as queue, but lists are not particularly effecient when adding elements to the beginning of the list - they are fine for adding element to the end, but less so at inserting elements, including at the front.

So, instead of using a list, let's just use a more efficient data structure for our queue.

The `queue` module has some queue implementations, including some very specialized ones. In Python 3.7, it also has the `SimpleQueue` class that is more lightweight.

In this case though, I'm going to use the `deque` class (double-ended queue) from the `collections` module - it is very efficient adding and removing elements from both the start and the end of the queue - so, it's very general purpose and widely used. The `queue` implementations are more specialized and have several features useful for multi-tasking that we won't actually need.

In [1]:
from collections import deque

We can specify a maximum size for the queue when create it - this allows us to limit the number of items in the queue. 

We can then add and remove items by using the methods:
* `append`: appends an element to the right of the queue
* `appendleft`: appends an element to the left of the queue
* `pop`: remove and return the element at the very right of the queue
* `popleft`: remove and return the element at the very left of the queue

(Note that I'm avoiding calling it the start and end of the queue, because what you consider the start/end of the queue might depend on how you are using it)

Let's just try it out to make sure we're comfortable with it:

In [2]:
dq = deque([1, 2, 3, 4, 5])
dq

deque([1, 2, 3, 4, 5])

In [3]:
dq.append(100)
dq

deque([1, 2, 3, 4, 5, 100])

In [4]:
dq

deque([1, 2, 3, 4, 5, 100])

In [5]:
dq.appendleft(-10)
dq

deque([-10, 1, 2, 3, 4, 5, 100])

In [6]:
dq.pop()

100

In [7]:
dq

deque([-10, 1, 2, 3, 4, 5])

In [8]:
dq.popleft()

-10

In [9]:
dq

deque([1, 2, 3, 4, 5])

We can create a capped queue:

In [10]:
dq = deque([1, 2, 3, 4], maxlen=5)

In [11]:
dq.append(100)
dq

deque([1, 2, 3, 4, 100])

In [12]:
dq.append(200)
dq

deque([2, 3, 4, 100, 200])

In [13]:
dq.append(300)
dq

deque([3, 4, 100, 200, 300])

As you can see the first item (`2`) was automatically discarded from the left of the queue when we added `300` to the right.

We can also find the number of elements in the queue by using the `len()` function:

In [14]:
len(dq)

5

as well as query the `maxlen`:

In [15]:
dq.maxlen

5

There are more methods, but these will do for now.

Now let's create an empty queue, and write two functions - one that will add elements to the queue, and one that will consume elements from the queue:

In [16]:
def produce_elements(dq):
    for i in range(1, 36):
        dq.appendleft(i)

In [17]:
def consume_elements(dq):
    while len(dq) > 0:
        item = dq.pop()
        print('processing item', item)

Now we can use them as follows:

In [18]:
def coordinator():
    dq = deque()
    producer = produce_elements(dq)
    consume_elements(dq)

In [19]:
coordinator()

processing item 1
processing item 2
processing item 3
processing item 4
processing item 5
processing item 6
processing item 7
processing item 8
processing item 9
processing item 10
processing item 11
processing item 12
processing item 13
processing item 14
processing item 15
processing item 16
processing item 17
processing item 18
processing item 19
processing item 20
processing item 21
processing item 22
processing item 23
processing item 24
processing item 25
processing item 26
processing item 27
processing item 28
processing item 29
processing item 30
processing item 31
processing item 32
processing item 33
processing item 34
processing item 35


But suppose now that the `produce_elements` function is reading a ton of data from somewhere (maybe an API call that returns course ratings on some Python course :-) ).

The goal is to process these after some time, and not wait until all the items have been added to the queue - maybe the incoming stream is infinite even.

In that case, we want to "pause" adding elements to the queue, process (consume) those items, then once they've all been processed we want to resume adding elements, and rinse and repeat.

We'll use a capped `deque`, and change our producer and consumers slightly, so that each one does it's work, the yields control back to the caller once it's done with its work - the producer adding elements to the queue, and the consumer removing and processing elements from the queue:

In [20]:
def produce_elements(dq, n):
    for i in range(1, n):
        dq.appendleft(i)
        if len(dq) == dq.maxlen:
            print('queue full - yielding control')
            yield
        
def consume_elements(dq):
    while True:
        while len(dq) > 0:
            print('processing ', dq.pop())
        print('queue empty - yielding control')
        yield
    
def coordinator():
    dq = deque(maxlen=10)
    producer = produce_elements(dq, 36)
    consumer = consume_elements(dq)
    while True:
        try:
            print('producing...')
            next(producer)
        except StopIteration:
            # producer finished
            break
        finally:
            print('consuming...')
            next(consumer)

In [21]:
coordinator()

producing...
queue full - yielding control
consuming...
processing  1
processing  2
processing  3
processing  4
processing  5
processing  6
processing  7
processing  8
processing  9
processing  10
queue empty - yielding control
producing...
queue full - yielding control
consuming...
processing  11
processing  12
processing  13
processing  14
processing  15
processing  16
processing  17
processing  18
processing  19
processing  20
queue empty - yielding control
producing...
queue full - yielding control
consuming...
processing  21
processing  22
processing  23
processing  24
processing  25
processing  26
processing  27
processing  28
processing  29
processing  30
queue empty - yielding control
producing...
consuming...
processing  31
processing  32
processing  33
processing  34
processing  35
queue empty - yielding control


Notice a **really important** point here - the producer and consumer generator functions do not use `yield` for iteration purposes - they are simply using `yield` to suspend themselves and cooperatively hand control back to the caller - our coordinator function in this case.

The generators used `yield` to cooperatively suspend themselves and yield control back to the caller.

Similarly, we are not using `next` for iteration purposes, but more for starting and resuming the generators.

This is a fundamentally different idea than using `yield` to implement iterators, and forms the basis for the idea of using generators as coroutines.

### Timings using Lists and Deques for Queues

Let's see some timing differences between `lists` and `deques` when inserting and popping elements. We'll compare this with appending elements to a `list` as well.

In [1]:
from timeit import timeit

In [62]:
list_size = 10_000

def append_to_list(n=list_size):
    lst = []
    for i in range(n):
        lst.append(i)

def insert_front_of_list(n=list_size):
    lst = []
    for i in range(n):
        lst.insert(0, i)
        
lst = [i for i in range(list_size)]
def pop_from_list(lst=lst):
    for _ in range(len(lst)):
        lst.pop()
        
lst = [i for i in range(list_size)]
def pop_from_front_of_list(lst=lst):
    for _ in range(len(lst)):
        lst.pop(0)

Let's time those out:

In [63]:
timeit('append_to_list()', globals=globals(), number=1_000)

0.8679745109602663

In [64]:
timeit('insert_front_of_list()', globals=globals(), number=1_000)

20.793169873565148

In [65]:
timeit('pop_from_list()', globals=globals(), number=1_000)

0.0017591912596799375

In [66]:
timeit('pop_from_front_of_list()', globals=globals(), number=1_000)

0.012326529086294613

As you can see, insert elements at the front of the list is not very efficient compared to the end of the list. So lists are OK to use as stacks, but not as queues.

The standard library's `deque` is efficient at adding/removing items from both the start and end of the collection:

In [49]:
from collections import deque

In [67]:
list_size = 10_000

def append_to_deque(n=list_size):
    dq = deque()
    for i in range(n):
        dq.append(i)

def insert_front_of_deque(n=list_size):
    dq = deque()
    for i in range(n):
        dq.appendleft(i)
        
dq = deque(i for i in range(list_size))
def pop_from_deque(dq=dq):
    for _ in range(len(lst)):
        dq.pop()
        
dq = deque(i for i in range(list_size))
def pop_from_front_of_deque(dq=dq):
    for _ in range(len(lst)):
        dq.popleft()

In [68]:
timeit('append_to_deque()', globals=globals(), number=1_000)

0.8704001035901001

In [69]:
timeit('insert_front_of_deque()', globals=globals(), number=1_000)

0.8407907529494878

In [70]:
timeit('pop_from_deque()', globals=globals(), number=1_000)

0.000532037516904893

In [71]:
timeit('pop_from_front_of_deque()', globals=globals(), number=1_000)

0.0005195763528718089

##  Generator States

Let's look at a simple generator function:

In [34]:
def gen(s):
    for c in s:
        yield c

We create an generator object by calling the generator function:

In [35]:
g = gen('abc')

At this point the generator object is **created**, but we have not actually started running it. To do so, we call `next()`, which then starts running the function body until the first `yield` is encountered:

In [36]:
next(g)

'a'

Now the generator is **suspended**, waiting for us to call next again:

In [37]:
next(g)

'b'

Every time we call `next`, the generator function runs, or is in a **running** state until the next yield is encountered, or no more results are yielded and the function actually returns:

In [38]:
next(g)

'c'

In [39]:
next(g)

StopIteration: 

Once we exhaust the generator, we get a `StopIteration` exception, and we can think of the generator as being **closed**.

As we can see, a generator can be in one of four states:

* created
* running
* suspended
* closed

We can actually request the state of a generator programmatically by using the `inspect` module's `getgeneratorstate()` function:

In [33]:
from inspect import getgeneratorstate

In [46]:
g = gen('abc')

In [47]:
getgeneratorstate(g)

'GEN_CREATED'

We can start running the generator by calling `next`:

In [48]:
next(g)

'a'

And the state is now:

In [49]:
getgeneratorstate(g)

'GEN_SUSPENDED'

Once we exhaust the generator:

In [50]:
next(g), next(g), next(g)

StopIteration: 

The generator is now in a closed state:

In [51]:
getgeneratorstate(g)

'GEN_CLOSED'

Now we haven't seen the running state - to do that we just need to print the state from inside the generator - but to do that we need to have a reference to the generator object itself. This is not that easy to do, so I'm going to cheat and assume that the generator object will be referenced by a global variable `global_gen`:

In [52]:
def gen(s):
    for c in s:
        print(getgeneratorstate(global_gen))
        yield c

In [53]:
global_gen = gen('abc')

In [54]:
next(global_gen)

GEN_RUNNING


'a'

So a generator can be in these four very distinct states.

When the generator is created, it is not in a running or suspended state - it is simply in a **created** state.

We have to kick-off, or prime, the generator by calling `next` on it.

After the generator has yielded a value, it it is in **suspended** state.

Finally, once the generator **returns** (not yields), i.e. the StopIteration is raised, the generator is **closed**.

Finally it is really important to understand that when a `yield` is encountered, the generator is suspended **exactly** at that point, but not before it has evaluated the expression to the right of the yield statement so it can produce that value in the return value of the `next()` function.

To see this, let's write a simple function and a generator function as follows:

In [55]:
def square(i):
    print(f'squaring {i}')
    return i ** 2

In [58]:
def squares(n):
    for i in range(n):
        yield square(i)
        print ('right after yield')

In [59]:
sq = squares(5)

In [60]:
next(sq)

squaring 0


0

As you can see `square(i)` was evaluated, **then** the value was yielded, and the genrator was suspended exactly at the point the `yield` statement was encountered:

In [61]:
next(sq)

right after yield
squaring 1


1

As you can see, only now does the `right after yield` string get printed from our generator.

##  Generator States

Let's look at a simple generator function:

In [34]:
def gen(s):
    for c in s:
        yield c

We create an generator object by calling the generator function:

In [35]:
g = gen('abc')

At this point the generator object is **created**, but we have not actually started running it. To do so, we call `next()`, which then starts running the function body until the first `yield` is encountered:

In [36]:
next(g)

'a'

Now the generator is **suspended**, waiting for us to call next again:

In [37]:
next(g)

'b'

Every time we call `next`, the generator function runs, or is in a **running** state until the next yield is encountered, or no more results are yielded and the function actually returns:

In [38]:
next(g)

'c'

In [39]:
next(g)

StopIteration: 

Once we exhaust the generator, we get a `StopIteration` exception, and we can think of the generator as being **closed**.

As we can see, a generator can be in one of four states:

* created
* running
* suspended
* closed

We can actually request the state of a generator programmatically by using the `inspect` module's `getgeneratorstate()` function:

In [33]:
from inspect import getgeneratorstate

In [46]:
g = gen('abc')

In [47]:
getgeneratorstate(g)

'GEN_CREATED'

We can start running the generator by calling `next`:

In [48]:
next(g)

'a'

And the state is now:

In [49]:
getgeneratorstate(g)

'GEN_SUSPENDED'

Once we exhaust the generator:

In [50]:
next(g), next(g), next(g)

StopIteration: 

The generator is now in a closed state:

In [51]:
getgeneratorstate(g)

'GEN_CLOSED'

Now we haven't seen the running state - to do that we just need to print the state from inside the generator - but to do that we need to have a reference to the generator object itself. This is not that easy to do, so I'm going to cheat and assume that the generator object will be referenced by a global variable `global_gen`:

In [52]:
def gen(s):
    for c in s:
        print(getgeneratorstate(global_gen))
        yield c

In [53]:
global_gen = gen('abc')

In [54]:
next(global_gen)

GEN_RUNNING


'a'

So a generator can be in these four very distinct states.

When the generator is created, it is not in a running or suspended state - it is simply in a **created** state.

We have to kick-off, or prime, the generator by calling `next` on it.

After the generator has yielded a value, it it is in **suspended** state.

Finally, once the generator **returns** (not yields), i.e. the StopIteration is raised, the generator is **closed**.

Finally it is really important to understand that when a `yield` is encountered, the generator is suspended **exactly** at that point, but not before it has evaluated the expression to the right of the yield statement so it can produce that value in the return value of the `next()` function.

To see this, let's write a simple function and a generator function as follows:

In [55]:
def square(i):
    print(f'squaring {i}')
    return i ** 2

In [58]:
def squares(n):
    for i in range(n):
        yield square(i)
        print ('right after yield')

In [59]:
sq = squares(5)

In [60]:
next(sq)

squaring 0


0

As you can see `square(i)` was evaluated, **then** the value was yielded, and the genrator was suspended exactly at the point the `yield` statement was encountered:

In [61]:
next(sq)

right after yield
squaring 1


1

As you can see, only now does the `right after yield` string get printed from our generator.

##  Sending data to Generators

With PEP 342, generators were enhanced to allow not just sending data out (yielding), but also receiving data.

The basic idea is that when a generator is **suspended** after a yield statement, why not allow sending it some data when we resume its execution, exactly at the point where it resumes.

In other words, immediately after the `yield` statement.

And not on the next line of code, but actually in the same line as the `yield` - we should now think of the `yield` keyword, not just as a statement, but as an expression that also *receives* data - and we can assign that received value to a variable using an assignment. We can send data to the suspended generator (and resume running it) by using the `send()` **method** of the generator (instead of just using the `__next__` method (or, same thing, `next()`).

**Note:**
The **same** `yield` keyword is actually used to do both - but make no mistake, these are very different usages of the same keyword.

The key difference is that `yield` is actually an expression, not a simple statement - and of course we can assign expressions to variables.

Let's take a look at a simple example to illustrate how this works:

In [25]:
def echo():
    while True:
        received = yield
        print('You said:', received)

In [26]:
e = echo()

We now have a generator object, but what is it's state?

In [27]:
from inspect import getgeneratorstate

getgeneratorstate(e)

'GEN_CREATED'

Right, it is created, but not suspended - in order for it to receive data it should be run up to the `yield` expression.

Remember that in assignments, the right hand side is evaluated **first**, and the result of the expression is assigned to the left hand variable.

So, if we call `next` on the generator it will start running.
Once it hits the line
```
received = yield
```
it will first evaluate the right hand side - at which time it will yield and therefore become suspended!

In [28]:
next(e)

In [29]:
getgeneratorstate(e)

'GEN_SUSPENDED'

Now that it is waiting to resume, we can send it data, and the generator will received that data as if it were the right hand side of the assignment:

In [30]:
e.send('python')

You said: python


And now the generator continued running until it hit a `yield` again - which it does since we have our yield inside an infinite loop:

In [31]:
e.send('I said')

You said: I said


So, the `send` method essentially resume the generator just as the `__next__` does - but it also sends in some data that we can capture if we want to, inside the generator.

What happens if we do call `next()` or `__next__` instead of `send()`?

The same as if we had sent the `None` value:

In [32]:
next(e)

You said: None


In [33]:
e.send(None)

You said: None


See, same thing...

You might be asking whether we could have used `send` with all the generators we had written so far - sure!
The `yield` keyword is an expression, and you don't have to assign the result of an expression to a variable:

In [35]:
10 < 100

True

That was an expression, and it was perfectly fine not to assign it to a variable. We can, but we don't have to.

So, in fact the following also works just fine:

In [36]:
def squares(n):
    for i in range(n):
        yield i**2

In [40]:
sq = squares(5)

In [41]:
next(sq)

0

In [42]:
sq.send(100)

1

In [43]:
sq.send(100)

4

Now, the only thing is that we cannot change a generator from `created` to `suspended` using the `send()` function - we **have** to call `next` first.

In other words this will not work:

In [44]:
e = echo()

In [45]:
e.send('hello')

TypeError: can't send non-None value to a just-started generator

We need to **start** or **prime** the generator first, using, `next` which will run the code until the `yield` expression is encountered.

In [46]:
next(e)

In [47]:
e.send('hello')

You said: hello


At this point we can see that generators can be used to both send and receive data.

You might be asking yourself whether it is possible to do both **at the same time** - i.e. use ` yield` to both yield data and receive data (upon resumption). The answer is yes, but it's kind of mind bending, and unless you actually need to do so, resist the temptation to do it - it can be extremely confusing:

In [56]:
def squares(n):
    for i in range(n):
        received = yield i ** 2
        print('received:', received)

In [57]:
sq = squares(5)

In [58]:
next(sq)

0

In [59]:
yielded = sq.send('hello')
print('yielded:', yielded)

received: hello
yielded: 1


In [60]:
yielded = sq.send('hello')
print('yielded:', yielded)

received: hello
yielded: 4


Of course, once the generator no longer `yields`, but `returns` we'll get the same `StopIteration` exception:

In [69]:
def echo(max_times):
    for i in range(max_times):
        received = yield
        print('You said:', received)
    print("that's all, folks!")

In [70]:
e = echo(3)
next(e)

In [71]:
e.send('python')
e.send('is')

You said: python
You said: is


The next `send` is going to resume the generator, it will print what we send it, and continue running - but this time the loop is done, so it will print our final `that's all, folks`, and the function will return (`None`) and hence cause a `StopIteration` exception to be raised:

In [72]:
e.send('awesome')

You said: awesome
that's all, folks!


StopIteration: 

So, when we deal with generators and the `yield` expression, we need to distinguish between two different uses:
* one way produces data for iteration
* the other way consumes data

To avoid confusion, try not to mix the two concepts together unless you have to. Try to keep them separate, i.e., either use:
```
yield <expression>
```
or 
```
variable = yield
```
but not both at the same time such as:
```
variable = yield <expression>
```

There are cases where using the combination is definitely useful.

Consider this example where we want a generator/coroutine that maintains (and yields) a running average of values we send it.

Let's first see how we would do it without using a coroutine - instead we'll use a closure so we can maintain the state (`total` and `count`):

In [1]:
def averager():
    total = 0
    count = 0
    def inner(value):
        nonlocal total
        nonlocal count
        total += value
        count += 1
        return total / count
    return inner

def running_averages(iterable):
    avg = averager()
    for value in iterable:
        running_average = avg(value)
        print(running_average)

In [2]:
running_averages([1, 2, 3, 4])

1.0
1.5
2.0
2.5


And now the same, but using a coroutine:

In [5]:
def running_averager():
    total = 0
    count = 0
    running_average = None
    while True:
        value = yield running_average
        total += value
        count += 1
        running_average = total / count

In [6]:
def running_averages(iterable):
    averager = running_averager()
    next(averager)  # prime generator
    for value in iterable:
        running_average = averager.send(value)
        print(running_average)

In [7]:
running_averages([1, 2, 3, 4])

1.0
1.5
2.0
2.5


As you can see it was useful to use `yield` to both send in the new value, and in the subsequent yield to receive the new average.

##  Sending Exceptions to Generators

So far we have seen how to send values to a generator using the `send()` method.

We have also seen how we can close a generator using the `close()` method and how that, in essence, raises a `GeneratorExit` exception inside the generator.

In fact we can also raise any exception inside a generator by using the `throw()` method.

Let's first see a simple example:

In [1]:
def gen():
    try:
        while True:
            received = yield
            print(received)
    finally:
        print('exception must have happened...')

In [2]:
g = gen()

In [3]:
next(g)

In [4]:
g.send('hello')

hello


In [5]:
g.throw(ValueError, 'custom message')

exception must have happened...


ValueError: custom message

As you can see, the exception occurred **inside** the generator, and then propagated up to the caller (we did not intercept and silence the exception). Of course we can do that if we want to:

In [6]:
def gen():
    try:
        while True:
            received = yield
            print(received)
    except ValueError:
        print('received the value error...')
    finally:
        print('generator exiting and closing')

In [7]:
g = gen()

In [8]:
next(g)
g.send('hello')

hello


In [9]:
g.throw(ValueError, 'stop it!')

received the value error...
generator exiting and closing


StopIteration: 

We caught the `ValueError` exception, so why did we get a `StopIteration` exception?

Because the generator returned - this raises a `StopIteration` exception.

The behavior of the `throw` is as follows:

* if the generator catches the exception and yields a value, that is the return value of the `throw()` method
* if the generator does not catch the exception, the exception is propagated back to the caller
* if the generator catches the exception, and exits (returns), the `StopIteration` exception is propagated to the caller
* if the generator catches the exception, and raises another exception, that exception is propagated to the caller

Let's see an example of each of those:

##### if the generator catches the exception and yields a value, that is the return value of the throw() method

In [11]:
from inspect import getgeneratorstate

In [12]:
def gen():
    while True:
        try:
            received = yield
            print(received)
        except ValueError as ex:
            print('ValueError received...', ex)

In [13]:
g = gen()
next(g)

In [14]:
g.send('hello')

hello


In [15]:
g.throw(ValueError, 'custom message')

ValueError received... custom message


In [16]:
g.send('hello')

hello


And the generator is now in a suspended state, waiting for our next call:

In [17]:
getgeneratorstate(g)

'GEN_SUSPENDED'

##### if the generator does not catch the exception, the exception is propagated back to the caller

In [18]:
def gen():
    while True:
        received = yield
        print(received)

In [19]:
g = gen()
next(g)
g.send('hello')

hello


In [20]:
g.throw(ValueError, 'custom message')

ValueError: custom message

And the generator is now in a closed state:

In [21]:
getgeneratorstate(g)

'GEN_CLOSED'

##### if the generator catches the exception, and exits (returns), the StopIteration exception is propagated to the caller

In [22]:
def gen():
    try:
        while True:
            received = yield
            print(received)
    except ValueError as ex:
        print('ValueError received', ex)
        return None

In [23]:
g = gen()
next(g)
g.send('hello')

hello


In [24]:
g.throw(ValueError, 'custom message')

ValueError received custom message


StopIteration: 

And, once again, the generator is in a closed state:

In [25]:
getgeneratorstate(g)

'GEN_CLOSED'

##### if the generator catches the exception, and raises another exception, that exception is propagated to the caller

In [26]:
def gen():
    try:
        while True:
            received = yield
            print(received)
    except ValueError as ex:
        print('ValueError received...', ex)
        raise ZeroDivisionError('not really...')

In [27]:
g = gen()
next(g)
g.send('hello')

hello


In [28]:
g.throw(ValueError, 'custom message')

ValueError received... custom message


ZeroDivisionError: not really...

And out generator is, once again, in a closed state:

In [29]:
getgeneratorstate(g)

'GEN_CLOSED'

As you can see our traceback includes both the `ZeroDivisionError` and the `ValueError` that caused the `ZeroDivisionError` to happen in the first place. If you don't want to have that  traceback you can easily remove it and only display the `ZeroDivisionError` (I will cover this and exceptions in detail in a later part of this series):

In [30]:
def gen():
    try:
        while True:
            received = yield
            print(received)
    except ValueError as ex:
        print('ValueError received...', ex)
        raise ZeroDivisionError('not really...') from None

In [31]:
g = gen()
next(g)
g.send('hello')

hello


In [32]:
g.throw(ValueError, 'custom message')

ValueError received... custom message


ZeroDivisionError: not really...

#### Example of where this can be useful

Suppose we have a coroutine that handles writing data to a database.
We have seen in some previous examples where we could use a coroutine to start and either commit or abort a transaction - based on closing the generator or forcing an exception to happen in the body of the generator.

Let's revisit this example, but now we'll want to use exceptions to indicate to our generator whether to commit or abort a transaction, without necessarily exiting the generator:

In [33]:
class CommitException(Exception):
    pass

class RollbackException(Exception):
    pass

def write_to_db():
    print('opening database connection...')
    print('start transaction...')
    try:
        while True:
            try:
                data = yield
                print('writing data to database...', data)
            except CommitException:
                print('committing transaction...')
                print('opening next transaction...')
            except RollbackException:
                print('aborting transaction...')
                print('opening next transaction...')
    finally:
        print('generator closing...')
        print('aborting transaction...')
        print('closing database connection...')

In [34]:
sql = write_to_db()

In [35]:
next(sql)

opening database connection...
start transaction...


In [36]:
sql.send(100)

writing data to database... 100


In [37]:
sql.throw(CommitException)

committing transaction...
opening next transaction...


In [38]:
sql.send(200)

writing data to database... 200


In [39]:
sql.throw(RollbackException)

aborting transaction...
opening next transaction...


In [40]:
sql.send(200)
sql.throw(CommitException)
sql.close()

writing data to database... 200
committing transaction...
opening next transaction...
generator closing...
aborting transaction...
closing database connection...


As you can see, we can use exceptions to control the **flow** of our code. Exceptions are not necessarily **errors**! As we have seen with the `StopIteration` exception, or the `GeneratorExit` exception.

#### throw() and close()

The `close()` method does essentially the same thing as `throw(GeneratorExit)` except that when that exception is thrown using `throw()`, Python does not silence the exception for the caller:

In [41]:
def gen():
    try:
        while True:
            received = yield
            print(received)
    finally:
        print('closing down...')

In [42]:
g = gen()
next(g)
g.send('hello')
g.close()

hello
closing down...


In [45]:
g = gen()
next(g)
g.send('hello')
g.throw(GeneratorExit)

hello
closing down...


GeneratorExit: 

Even if we catch the exception, we are still exiting the generator, so using `throw` will result in the caller receiving a `StopIteration` exception.

In [46]:
def gen():
    try:
        while True:
            received = yield
            print(received)
    except GeneratorExit:
        print('received generator exit...')
    finally:
        print('closing down...')

In [47]:
g = gen()
next(g)
g.close()

received generator exit...
closing down...


In [48]:
g = gen()
next(g)
g.throw(GeneratorExit)

received generator exit...
closing down...


StopIteration: 

So, we can use `throw` to close the generator, but as the caller we now have to handle the exception that propagates up to us:

In [49]:
g = gen()
next(g)
try:
    g.throw(GeneratorExit)
except StopIteration:
    print('silencing GeneratorExit...')
    pass
        

received generator exit...
closing down...
silencing GeneratorExit...


Basically this is the exact same scenario as the catch and exit (return) we saw a couple of examples back.

##  Using decorators to prime coroutines

We saw how we always to 'prime' a coroutine (i.e. get the generator in a suspended state) before we can start sending values to it.

This is something that **must** always be done - and this is an excellent use case for decorators.

We're going to create a decorator that will create and prime the coroutine for us.

Essentially we want to be able to:
1. create the coroutine (`gen()`)
2. prime the coroutine (`next(g)`)

in one step - so that's what the decorator is going to do - it will wrap our original coroutine and return a new function that will perform those steps for us, and return the newly created and primed coroutine:

In [31]:
def coroutine(gen_fn):
    def inner():
        gen = gen_fn()
        next(gen)
        return gen
    return inner    

In [32]:
@coroutine
def echo():
    while True:
        received = yield
        print(received)

In [33]:
ec = echo()

In [34]:
import inspect
inspect.getgeneratorstate(ec)

'GEN_SUSPENDED'

As you can see our generator was automatically advanced from CREATED to SUSPENDED - and we can now use it straight away:

In [36]:
ec.send('hello')

hello


Now, we still need to expand this slightly to accomodate passing arguments to our generator function (coroutine):

In [65]:
def coroutine(gen_fn):
    def inner(*args, **kwargs):
        gen = gen_fn(*args, **kwargs)
        next(gen)
        return gen
    return inner  

In [66]:
import math

@coroutine
def power_up(p):
    result = None
    while True:
        received = yield result
        result = math.pow(received, p)       

In [67]:
squares = power_up(2)
cubes = power_up(3)

In [68]:
squares.send(2)

4.0

In [69]:
cubes.send(2)

8.0

What happens if we send the wrong type in?

In [70]:
squares.send('abc')

TypeError: must be real number, not str

And now our generator stops functioning, it is in a closed state:

In [71]:
inspect.getgeneratorstate(squares)

'GEN_CLOSED'

In this particular case, we don't want our generator to close down - it should simply yield None and ignore the exception, so it can continue working:

In [72]:
@coroutine
def power_up(p):
    result = None
    while True:
        received = yield result
        try:
            result = math.pow(received, p)    
        except TypeError:
            result = None

In [73]:
squares = power_up(2)

In [74]:
squares.send(2)

4.0

In [75]:
squares.send('abc')

In [76]:
squares.send(3)

9.0

Of course, we can close the generator ourselves still:

In [77]:
squares.close()

In [78]:
inspect.getgeneratorstate(squares)

'GEN_CLOSED'

##  Yield From - Two-Way Communications

In the last section on generators, we started looking at `yield from` and how we could delegate iteration to another iterator.

Let's see a simple example again:

In [None]:
def squares(n):
    for i in range(n):
        yield i ** 2

In [2]:
def delegator(n):
    for value in squares(n):
        yield value

In [3]:
gen = delegator(5)
for _ in range(5):
    print(next(gen))

0
1
4
9
16


Alternatively we could write the same thing this way:

In [4]:
def delegator(n):
    yield from squares(n)

In [5]:
gen = delegator(5)
for _ in range(5):
    print(next(gen))

0
1
4
9
16


**Terminology:** 
When we use `yield from subgen` we are **delegating** to `subgen`.

The generator that delegates to the other generator is called the **delegator** and the generator that it delegates to is called the **subgenerator**.

So in our example `squares(n)` was the subgenerator, and `delegator()` was the delegator.

The context that contains the code making `next` calls to the delegator, is called the **caller's context**, or simply the **caller**.

What is actually happening when we call
```
next(gen)
```
is that `gen` (the delegator) is passing along the `next` request to the `squares(n)` (the subgenerator).

In return, the subgenerator is yielding values back to the delegator, which in turn yields it back to us (the caller).

There is in fact a **two-way communication channel** established between the caller and the subgenerator - all because of `yield from`.

* caller: next --> delegator --> subgenerator
* caller <-- delegator (yield) <-- subgenerator (yield)

So, if `yield from` establishes this 2-way communication channel, and we can send `next` to the subgenerator via the delegator, can we send data using `send` as well?

The answer is yes. We'll take a look at this in some detail over the next few videos.

Let's start by looking at how the delegator works when a subgenerator closes by itself:

We'll want to inspect the delegator and the subgenerator, so let's import what we'll need from the `inspect` module:

In [6]:
from inspect import getgeneratorstate, getgeneratorlocals

In [7]:
def song():
    yield "I'm a lumberjack and I'm OK"
    yield "I sleep all night and I work all day"

In [8]:
def play_song():
    count = 0
    s = song()
    yield from s
    yield 'song finished'
    print('player is exiting...')

Here `play_song` is the delegator, and `song` is the subgenerator. We, the Jupyter notebook, are the caller.

In [9]:
player = play_song()

In [10]:
print(getgeneratorstate(player))
print(getgeneratorlocals(player))

GEN_CREATED
{}


As you can see, no local variables have been created in `player` yet - that's because it is created, not actually started.

Let's start it:

In [11]:
next(player)

"I'm a lumberjack and I'm OK"

Now let's look at the state of things:

In [12]:
print(getgeneratorstate(player))
print(getgeneratorlocals(player))

GEN_SUSPENDED
{'s': <generator object song at 0x000002658B7E3F10>, 'count': 0}


We can now get a handle to the subgenerator `s`:

In [13]:
s = getgeneratorlocals(player)['s']

And we can check the state of `s`:

In [14]:
print(getgeneratorstate(s))

GEN_SUSPENDED


As we can see the subgenerator is suspended.

Let's iterate a few more times:

In [15]:
print(next(player))
print(getgeneratorstate(player))
print(getgeneratorstate(s))

I sleep all night and I work all day
GEN_SUSPENDED
GEN_SUSPENDED


In [16]:
print(next(player))
print(getgeneratorstate(player))
print(getgeneratorstate(s))

song finished
GEN_SUSPENDED
GEN_CLOSED


At this point the subgenerator exited, so its state is `GEN_CLOSED`, but the delegator (`player`) is just suspended, and in fact yielded `song finished`.

We can advance one more time:

In [17]:
print(next(player))

player is exiting...


StopIteration: 

We get the `StopIteration` exception because `player` returned, and now both the delegator and the subgenerator are in a closed state:

In [18]:
print(getgeneratorstate(player))
print(getgeneratorstate(s))

GEN_CLOSED
GEN_CLOSED


Important to note here is that when the subgenerator returned, the delegator **continued running normally**.

Let's make a tweak to our `player` generator to make this even more evident:

In [19]:
def player():
    count = 1
    while True:
        print('Run count:', count)
        yield from song()
        count += 1

In [20]:
p = player()

In [21]:
next(p), next(p)

Run count: 1


("I'm a lumberjack and I'm OK", 'I sleep all night and I work all day')

In [22]:
next(p), next(p)

Run count: 2


("I'm a lumberjack and I'm OK", 'I sleep all night and I work all day')

In [23]:
next(p), next(p)

Run count: 3


("I'm a lumberjack and I'm OK", 'I sleep all night and I work all day')

and so on...

##  Yield From - Sending Data

We have seen how we can send data to a generator by using the generator's `send` method.

When we use `yield from` to delegate to a subgenerator, the established communication conduit also carries any data sent to the delegator generator.

Let's write a simple coroutine that will receive string data and print the reversed string to the console:

In [1]:
def echo():
    while True:
        received = yield
        print(received[::-1])

We can use this coroutine this way:

In [2]:
e = echo()
next(e)  # prime the coroutine

In [3]:
e.send('stressed')

desserts


In [5]:
e.send('tons')

snot


And we can close the generator:

In [6]:
e.close()

Now let's write a simple delegator generator:

In [17]:
def delegator():
    e = echo()
    yield from e

We can create the delegator generator and prime the delegator:

In [18]:
d = delegator()
next(d)

Now, calling `next` on the delegator will establish the connection to the subgenerator and automatically prime it as well.

We can easily see this by doing some inspection:

In [19]:
from inspect import getgeneratorstate, getgeneratorlocals

In [20]:
getgeneratorlocals(d)

{'e': <generator object echo at 0x000001910CE0DDB0>}

In [21]:
e = getgeneratorlocals(d)['e']

In [22]:
print(getgeneratorstate(d))
print(getgeneratorstate(e))

GEN_SUSPENDED
GEN_SUSPENDED


We can now send data to the delegator, and it will pass that along to the subgenerator:

In [23]:
d.send('stressed')

desserts


Let's modify our `echo` coroutine to both receive and yield a result, instead of just printing to the console:

In [25]:
def echo():
    output = None
    while True:
        received = yield output
        output = received[::-1]

We can use it directly this way:

In [26]:
e = echo()
next(e)

In [27]:
e.send('stressed')

'desserts'

And we can use delegation as follows:

In [28]:
def delegator():
    yield from echo()

In [29]:
d = delegator()
next(d)

In [30]:
d.send('stressed')

'desserts'

#### Example

Let's take a look at a more interesting example of `yield from`.

Our goal is to flatten a list containing nested lists to any level.

In [1]:
l = [1, 2, [3, 4, [5, 6]], [7, [8, 9, 10]]]

How could we approach this?

Let's try a more traditional approach using a recursive function that will build up the flattened list as we work our way through the original nested list.

Let's start simple, by just printing out the elements as we iterate:

In [32]:
def flatten(curr_item):
    if isinstance(curr_item, list):
        for item in curr_item:
            flatten(item)
    else:
        print(curr_item)

In [33]:
flatten(l)

1
2
3
4
5
6
7
8
9
10


Now let's create a flattened list instead of just printing the results:

In [36]:
def flatten(curr_item, output):
    if isinstance(curr_item, list):
        for item in curr_item:
            flatten(item, output)
    else:
        output.append(curr_item)

In [37]:
output = []
flatten(l, output)
print(output)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


This isn't too bad to understand, but let's try it using generators and `yield from`:

In [65]:
def flatten_gen(curr_item):
    if isinstance(curr_item, list):
        for item in curr_item:
            yield from flatten_gen(item)
    else:
        yield curr_item        

In [66]:
for item in flatten_gen(l):
    print(item)

1
2
3
4
5
6
7
8
9
10


And of course we can, if we prefer, make a list out of it:

In [67]:
list(flatten_gen(l))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

I much prefer this approach - first of all we can iterate through the flattened list without making a list out of it - so much better memory wise, and secondly we don't need to lug around that `output` list at every iteration.

Notice by the way, how we nested subgenerators recursively.

Technically we can expand this to cover any iterable types - not just lists:

Let's first create a utility function to see if something is iterable:

In [68]:
def is_iterable(item):
    try:
        iter(item)
    except:
        return False
    else:
        return True

In [69]:
def flatten_gen(curr_item):
    if is_iterable(curr_item):
        for item in curr_item:
            yield from flatten_gen(item)
    else:
        yield curr_item

In [70]:
l = [1, 2, (3, 4, {5, 6}), (7, 8, [9, 10])]

In [71]:
list(flatten_gen(l))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

But there's potentially a slight wrinkle - strings:

In [72]:
l = ['abc', [1, 2, (3, 4)]]

In [73]:
list(flatten_gen(l))

RecursionError: maximum recursion depth exceeded

Why are we getting this recursion error?

That's because strings are iterables too - even a single character string!

So, two issues: we may not want to treat strings as iterables, and if we do, then we need to be careful with single character strings.

We're going to tweak our `is_iterable` function, and our `flatten` generator to handle these two issues:

In [112]:
def is_iterable(item, *, str_is_iterable=True):
    try:
        iter(item)
    except:
        return False
    else:
        if isinstance(item, str):
            if str_is_iterable and len(item) > 1:
                return True
            else:
                return False
        else:
            return True

Let's just make sure our function works as expected:

In [113]:
print(is_iterable([1, 2, 3]))
print(is_iterable('abc'))
print(is_iterable('a'))

True
True
False


In [114]:
print(is_iterable([1, 2, 3], str_is_iterable=False))
print(is_iterable('abc', str_is_iterable=False))
print(is_iterable('a', str_is_iterable=False))

True
False
False


Good, now we can tweak our `flatten` generator so we can tell it whether to handle strings as iterables or not:

In [120]:
def flatten_gen(curr_item, *, str_is_iterable=True):
    if is_iterable(curr_item, str_is_iterable=str_is_iterable):
        for item in curr_item:
            yield from flatten_gen(item, str_is_iterable=str_is_iterable)
    else:
        yield curr_item

In [121]:
l

['abc', [1, 2, (3, 4)]]

In [122]:
list(flatten_gen(l))

['a', 'b', 'c', 1, 2, 3, 4]

In [123]:
list(flatten_gen(l, str_is_iterable=False))

['abc', 1, 2, 3, 4]

Here we saw we could use `yield from` recursively.
In fact a generator can be both a delegator and a subgenerator.
Here's a simple example of this:

In [6]:
def coro():
    while True:
        received = yield
        print(received)

In [7]:
def gen1():
    yield from gen2()
    
def gen2():
    yield from gen3()
    
def gen3():
    yield from coro()
    

In [8]:
g = gen1()
next(g)

In [9]:
g.send('hello')

hello


As you can see we were able to push data through a "pipeline":

```caller --> gen1 --> gen2 --> gen3 --> coro```

##  Yield From - Closing and Return

Just as we can send `next` and `send` through a delegator, we can also send `close`.

How does this affect the delegator and the subgenerator?

Let's take a look.

In [1]:
def subgen():
    try:
        while True:
            received = yield
            print(received)
    finally:
        print('subgen: closing...')

In [2]:
def delegator():
    s = subgen()
    yield from s
    yield 'delegator: subgen closed'
    print('delegator: closing...')

In [3]:
d = delegator()
next(d)

At this point, both the delegator and the subgenerator are primed and suspended:

In [4]:
from inspect import getgeneratorstate, getgeneratorlocals

In [5]:
getgeneratorlocals(d)

{'s': <generator object subgen at 0x0000022677AA7F10>}

In [6]:
s = getgeneratorlocals(d)['s']
print(getgeneratorstate(d))
print(getgeneratorstate(s))

GEN_SUSPENDED
GEN_SUSPENDED


We can send data to the delegator:

In [7]:
d.send('hello')

hello


We can even send data directly to the subgenerator since we now have a handle on it:

In [8]:
s.send('python')

python


In fact, we can close it too:

In [9]:
s.close()

subgen: closing...


So, what is the state of the delegator now?

In [10]:
getgeneratorstate(d)

'GEN_SUSPENDED'

But the subgenerator closed, so let's see what happens when we call `next` on `d`:

In [11]:
next(d)

'delegator: subgen closed'

As you can see, the generator code resume right after the `yield from`, and we can do this one more time to close the delegator:

In [12]:
next(d)

delegator: closing...


StopIteration: 

OK, so this is what happens when the subgenerator closes (directly or indirectly) - the delegator simply resumes running right after the `yield from` when we call `next`.

But what happens if we close the delegator instead of just closing the subgenerator?

In [13]:
d = delegator()
next(d)
s = getgeneratorlocals(d)['s']
print(getgeneratorstate(d))
print(getgeneratorstate(s))

GEN_SUSPENDED
GEN_SUSPENDED


In [14]:
d.close()

subgen: closing...


As you can see the subgenerator also closed. Is the delegator closed too?

In [15]:
print(getgeneratorstate(d))
print(getgeneratorstate(s))

GEN_CLOSED
GEN_CLOSED


Yes. So closing the delegator will close not only the delegator itself, but also close the currently active subgenerator (if any).

We should notice that when we closed the subgenerator directly no apparent exception was raised in our context.

What happens if the subgenerator returns something when it closes?

In [16]:
def subgen():
    try:
        while True:
            received = yield
            print(received)
    finally:
        print('subgen: closing...')
        return 'subgen: return value'

In [17]:
s = subgen()
next(s)
s.send('hello')
s.close()

hello
subgen: closing...


Hmmm, the `StopIteration` exception was silenced. Let's do this a different way, since we know the `StopIteration` exception should contain the return value:

In [18]:
s = subgen()
next(s)
s.send('hello')
s.throw(GeneratorExit, 'force exit')

hello
subgen: closing...


StopIteration: subgen: return value

OK, so now we can see that the `StopIteration` exception contains the return value.

The `yield from` actually captures that value as it's return value - in other words `yield from` is not just a statement, it is in fact, like `yield`, also an expression.

Let's see how that works:

In [19]:
def subgen():
    try:
        yield 1
        yield 2
    finally:
        print('subgen: closing...')
        return 100

In [20]:
def delegator():
    s = subgen()
    result = yield from s
    print('subgen returned:', result)
    yield 'delegator suspended'
    print('delegator closing')

In [21]:
d = delegator()

In [22]:
next(d)

1

In [23]:
next(d)

2

In [24]:
next(d)

subgen: closing...
subgen returned: 100


'delegator suspended'

As you can see the return value of the subgenerator ended up as the result of the `yield from` expression. 

##  Pipelines - Pushing Data

We can also create pipelines where we **push** data through multiple stages of this pipeline, using `send`, so, essentially, using coroutines.

First let's create a simple decorator to auto-prime our coroutines:

In [11]:
def coroutine(coro):
    def inner(*args, **kwargs):
        gen = coro(*args, **kwargs)
        next(gen)
        return gen
    return inner

Let's start with a data consumer generator that will simply print what it receives - but it could equally well write data to a file, a database, or other processing.

In [13]:
@coroutine
def handle_data():
    while True:
        received = yield
        print(received)

Now let's write a coroutine that will receive some data, transform it, and send it along to the next generator:

In [14]:
import math

@coroutine
def power_up(n, next_gen):
    while True:
        received = yield
        output = math.pow(received, n)
        next_gen.send(output)

We are going to generate some data, send it to `power_up`, and specify the next stage as being `handle_data`:

In [15]:
print_data = handle_data()
gen = power_up(2, print_data)
# pipeline: gen --> print_data
for i in range(1, 6):
    gen.send(i)

1.0
4.0
9.0
16.0
25.0


Ok, as you can see we are now **pushing** data through this pipeline.

But why stop there? Let's add another `power_up` in the pipeline:

In [16]:
print_data = handle_data()
gen2 = power_up(3, print_data)
gen1 = power_up(2, gen2)
# pipeline: gen1 --> gen2 --> print_data
for i in range(1, 6):
    gen1.send(i)

1.0
64.0
729.0
4096.0
15625.0


Now let's add a filter to our pipeline that will only retain even values:

In [17]:
@coroutine
def filter_even(next_gen):
    while True:
        received = yield
        if received %2 == 0:
            next_gen.send(received)

And let's insert it as the final stage of our pipeline:

In [18]:
print_data = handle_data()
filtered = filter_even(print_data)
gen2 = power_up(3, filtered)
gen1 = power_up(2, gen2)

# pipeline: gen1 --> gen2 --> filtered --> print_data

for i in range(1, 6):
    gen1.send(i)

64.0
4096.0


So as you can see we can easily push data through our pipeline as well.

##  Closing Generators

We can actually close a generator by sending it a special message, calling its `close()` method.

When that happens, an exception is raised **inside** the generator, and we may or may not want to do something - maybe cleaning up a resource, commiting a transaction to a database, etc.

Let's try it out, without any exception handling first:

In [1]:
from inspect import getgeneratorstate

In [2]:
import csv

def parse_file(f_name):
    print('opening file...')
    f = open(f_name, 'r')
    try:
        dialect = csv.Sniffer().sniff(f.read(2000))
        f.seek(0)
        reader = csv.reader(f, dialect=dialect)
        for row in reader:
            yield row
    finally:
        print('closing file...')
        f.close()        

You may notice by the way, that this could easily be turned into a context manager by yielding `reader` instead of yielding individual `rows` from within a loop (as it stands, you cannot make it into a context manager - remember that for context managers there should be a **single** yield!

In [3]:
import itertools

parser = parse_file('cars.csv')
for row in itertools.islice(parser, 10):
    print(row)

opening file...
['Car', 'MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model', 'Origin']
['Chevrolet Chevelle Malibu', '18.0', '8', '307.0', '130.0', '3504.', '12.0', '70', 'US']
['Buick Skylark 320', '15.0', '8', '350.0', '165.0', '3693.', '11.5', '70', 'US']
['Plymouth Satellite', '18.0', '8', '318.0', '150.0', '3436.', '11.0', '70', 'US']
['AMC Rebel SST', '16.0', '8', '304.0', '150.0', '3433.', '12.0', '70', 'US']
['Ford Torino', '17.0', '8', '302.0', '140.0', '3449.', '10.5', '70', 'US']
['Ford Galaxie 500', '15.0', '8', '429.0', '198.0', '4341.', '10.0', '70', 'US']
['Chevrolet Impala', '14.0', '8', '454.0', '220.0', '4354.', '9.0', '70', 'US']
['Plymouth Fury iii', '14.0', '8', '440.0', '215.0', '4312.', '8.5', '70', 'US']
['Pontiac Catalina', '14.0', '8', '455.0', '225.0', '4425.', '10.0', '70', 'US']


At this point, we have read 10 rows from the file, but since we have not exhausted our generator, the file is still open.

How do we close it without iterating through the entire generator?

Easy, we call the `close()` method on it:

In [4]:
parser.close()

closing file...


And the state of the generator is now closed:

In [5]:
from inspect import getgeneratorstate

In [6]:
getgeneratorstate(parser)

'GEN_CLOSED'

Which means we can no longer call `next()` on it - we'll just get a `StopIteration` exception:

In [7]:
next(parser)

StopIteration: 

What's actually happening is that when we call `close()`, an exception is raised **inside** our generator. Notice that we don't actually catch that exception - we have a finally block, but we do not catch the exception.

So, an exception is raised while processing that loop, which means our `finally` block runs right away.

But we are not actually catching the exception, yet we do not actually see the exception appear in our console. This is because that exception is handled specially by Python. When it receives that exception it simply knows that the generator state is now closed.

This is similar to how the `StopIteration` exception that is raised when we use a `for` loop on an iterator, does not actually show up - Python handles it silently, noting that the iterator is exhausted.

OK, so now, let's catch that exception inside our generator. The exception is called `GeneratorExit` (and inherits from `BaseException`, not `Exception`, if that matters to you at this point).

But we have to be careful here - when we call `close()`, Python **expects** one of three things to happen:
* the generator raises a `GeneratorExit` exception
* the generator exits cleanly
* some other exception to be raised - in which case it will propagate the exception to the caller.

If we trap it, silence it, and try to continue running the generator, Python **will** complain and throw a runtime exception! 

So, it's OK to catch the exception, but if we do, we need to make sure we re raise it, terminate the function, or raise another exception (but that will bubble up an exception)

Here's what the Python docs have to say about that:

```
generator.close()
Raises a GeneratorExit at the point where the generator function was paused. If the generator function then exits gracefully, is already closed, or raises GeneratorExit (by not catching the exception), close returns to its caller. If the generator yields a value, a RuntimeError is raised. If the generator raises any other exception, it is propagated to the caller. close() does nothing if the generator has already exited due to an exception or normal exit.
```

Let's look at an example of this:

In [8]:
def parse_file(f_name):
    print('opening file...')
    f = open(f_name, 'r')
    try:
        dialect = csv.Sniffer().sniff(f.read(2000))
        f.seek(0)
        next(f)  # skip header row
        reader = csv.reader(f, dialect=dialect)
        for row in reader:
            yield row
    except Exception as ex:
        print('some exception occurred', str(ex))
    except GeneratorExit:
        print('Generator was closed!')
    finally:
        print('cleaning up...')
        f.close()        

Now let's try that again:

In [9]:
parser = parse_file('cars.csv')
for row in itertools.islice(parser, 5):
    print(row)

opening file...
['Chevrolet Chevelle Malibu', '18.0', '8', '307.0', '130.0', '3504.', '12.0', '70', 'US']
['Buick Skylark 320', '15.0', '8', '350.0', '165.0', '3693.', '11.5', '70', 'US']
['Plymouth Satellite', '18.0', '8', '318.0', '150.0', '3436.', '11.0', '70', 'US']
['AMC Rebel SST', '16.0', '8', '304.0', '150.0', '3433.', '12.0', '70', 'US']
['Ford Torino', '17.0', '8', '302.0', '140.0', '3449.', '10.5', '70', 'US']


In [10]:
parser.close()

Generator was closed!
cleaning up...


You'll notice that the exception occurred, and then the generator ran the `finally` block and had a clean exit - in other words, the `GeneratorExit` exception was silenced, but the generator terminated (returned), so that's perfectly fine.

But what happens if we catch that exception inside a loop maybe, and simply ignore it and try to keep going?

In [11]:
def parse_file(f_name):
    print('opening file...')
    f = open(f_name, 'r')
    try:
        dialect = csv.Sniffer().sniff(f.read(2000))
        f.seek(0)
        next(f)  # skip header row
        reader = csv.reader(f, dialect=dialect)
        for row in reader:
            try:
                yield row
            except GeneratorExit:
                print('ignoring call to close generator...')
    finally:
        print('cleaning up...')
        f.close()    

In [12]:
parser = parse_file('cars.csv')
for row in itertools.islice(parser, 5):
    print(row)

opening file...
['Chevrolet Chevelle Malibu', '18.0', '8', '307.0', '130.0', '3504.', '12.0', '70', 'US']
['Buick Skylark 320', '15.0', '8', '350.0', '165.0', '3693.', '11.5', '70', 'US']
['Plymouth Satellite', '18.0', '8', '318.0', '150.0', '3436.', '11.0', '70', 'US']
['AMC Rebel SST', '16.0', '8', '304.0', '150.0', '3433.', '12.0', '70', 'US']
['Ford Torino', '17.0', '8', '302.0', '140.0', '3449.', '10.5', '70', 'US']


In [13]:
parser.close()

ignoring call to close generator...


RuntimeError: generator ignored GeneratorExit

Aha! See, one does not simply ignore the call to `close()` the generator!

Generators should be cooperative, and ignore a request to close down is not exactly being cooperative.

If we really want to catch the exception inside our loop, we have to either re-raise it or return from the generator:

So, both of these will work just fine:

In [14]:
def parse_file(f_name):
    print('opening file...')
    f = open(f_name, 'r')
    try:
        dialect = csv.Sniffer().sniff(f.read(2000))
        f.seek(0)
        next(f)  # skip header row
        reader = csv.reader(f, dialect=dialect)
        for row in reader:
            try:
                yield row
            except GeneratorExit:
                print('got a close...')
                raise
    finally:
        print('cleaning up...')
        f.close()    

In [15]:
parser = parse_file('cars.csv')
for row in itertools.islice(parser, 5):
    print(row)

ignoring call to close generator...
opening file...
['Chevrolet Chevelle Malibu', '18.0', '8', '307.0', '130.0', '3504.', '12.0', '70', 'US']
['Buick Skylark 320', '15.0', '8', '350.0', '165.0', '3693.', '11.5', '70', 'US']
['Plymouth Satellite', '18.0', '8', '318.0', '150.0', '3436.', '11.0', '70', 'US']
['AMC Rebel SST', '16.0', '8', '304.0', '150.0', '3433.', '12.0', '70', 'US']
['Ford Torino', '17.0', '8', '302.0', '140.0', '3449.', '10.5', '70', 'US']


Exception ignored in: <generator object parse_file at 0x000002377566FF68>
RuntimeError: generator ignored GeneratorExit


In [16]:
parser.close()

got a close...
cleaning up...


As will this:

In [17]:
def parse_file(f_name):
    print('opening file...')
    f = open(f_name, 'r')
    try:
        dialect = csv.Sniffer().sniff(f.read(2000))
        f.seek(0)
        next(f)  # skip header row
        reader = csv.reader(f, dialect=dialect)
        for row in reader:
            try:
                yield row
            except GeneratorExit:
                print('got a close...')
                return
    finally:
        print('cleaning up...')
        f.close()   

In [18]:
parser = parse_file('cars.csv')
for row in itertools.islice(parser, 5):
    print(row)

opening file...
['Chevrolet Chevelle Malibu', '18.0', '8', '307.0', '130.0', '3504.', '12.0', '70', 'US']
['Buick Skylark 320', '15.0', '8', '350.0', '165.0', '3693.', '11.5', '70', 'US']
['Plymouth Satellite', '18.0', '8', '318.0', '150.0', '3436.', '11.0', '70', 'US']
['AMC Rebel SST', '16.0', '8', '304.0', '150.0', '3433.', '12.0', '70', 'US']
['Ford Torino', '17.0', '8', '302.0', '140.0', '3449.', '10.5', '70', 'US']


In [19]:
parser.close()

got a close...
cleaning up...


And of course, our `finally` block still ran.

If we want to we can also raise an exception, but this will then be received by the caller, who either has to handle it, or let it bubble up:

In [20]:
def parse_file(f_name):
    print('opening file...')
    f = open(f_name, 'r')
    try:
        dialect = csv.Sniffer().sniff(f.read(2000))
        f.seek(0)
        next(f)  # skip header row
        reader = csv.reader(f, dialect=dialect)
        for row in reader:
            try:
                yield row
            except GeneratorExit:
                print('got a close...')
                raise Exception('why, oh why, did you do this?') from None
    finally:
        print('cleaning up...')
        f.close()   

In [21]:
parser = parse_file('cars.csv')
for row in itertools.islice(parser, 5):
    print(row)

opening file...
['Chevrolet Chevelle Malibu', '18.0', '8', '307.0', '130.0', '3504.', '12.0', '70', 'US']
['Buick Skylark 320', '15.0', '8', '350.0', '165.0', '3693.', '11.5', '70', 'US']
['Plymouth Satellite', '18.0', '8', '318.0', '150.0', '3436.', '11.0', '70', 'US']
['AMC Rebel SST', '16.0', '8', '304.0', '150.0', '3433.', '12.0', '70', 'US']
['Ford Torino', '17.0', '8', '302.0', '140.0', '3449.', '10.5', '70', 'US']


In [22]:
parser.close()

got a close...
cleaning up...


Exception: why, oh why, did you do this?

Another very important point to note is that the `GeneratorExit` exception does not inherit from `Exception` - because of that, you can still trap an `Exception`, even very broadly, without accidentally catching, and potentially silencing, a `GeneratorExit` exception.

We'll see an example of this next.

So, what about applying the same `close()` to generators acting not as iterators, but as coroutines?

Suppose we have a generator whose job is to open a database transaction, receive data to be written to the database, and then commit the transactions to the database once the work is "over".

We can certainly do it using a context manager - but we can also do it using a coroutine.

In [23]:
def save_to_db():
    print('starting new transaction')
    while True:
        try:
            data = yield
            print('sending data to database:', data)
        except GeneratorExit:
            print('committing transaction')
            raise

In [24]:
trans = save_to_db()

In [25]:
next(trans)

starting new transaction


In [26]:
trans.send('data 1')

sending data to database: data 1


In [27]:
trans.send('data 2')

sending data to database: data 2


In [28]:
trans.close()

committing transaction


But of course, something could go wrong while writing the data to the database, in which case we would want to abort the transaction instead:

In [5]:
def save_to_db():
    print('starting new transaction')
    while True:
        try:
            data = yield
            print('sending data to database:', eval(data))
        except Exception:
            print('aborting transaction')  
        except GeneratorExit:
            print('committing transaction')
            raise

In [6]:
trans = save_to_db()
next(trans)

starting new transaction


In [7]:
trans.send('1 + 10')

sending data to database: 11


In [8]:
trans.send('1/0')

aborting transaction


But we have a slight problem:

In [34]:
trans.send('2 + 2')

sending data to database: 4


We'll circle back to this in a bit.

But we can still commit the transaction when things do not go wrong:

In [35]:
trans = save_to_db()
next(trans)
trans.send('1+10')
trans.send('2+10')
trans.close()

committing transaction
starting new transaction
sending data to database: 11
sending data to database: 12
committing transaction


OK, so this works but is far from ideal:

1. We do not know that an exception occurred and that a rollback happened (well we do from the console output, but not programmatically)
2. if an abort took place, we really need to close the generator
3. It would be safer to have a `finally` clause, that either commits or rollbacks the transaction - we could have an exception that is not caught by any of our exception handlers - and that would be a problem!

Let's fix those issues up:

In [36]:
def save_to_db():
    print('starting new transaction')
    is_abort = False
    try:
        while True:
            data = yield
            print('sending data to database:', eval(data))
    except Exception:
        is_abort = True
        raise
    finally:
        if is_abort:
            print('aborting transaction')
        else:
            print('committing transaction')

Notice how we're not even catching the `GeneratorExit` exception - we really don't need to - that exception will be raised, the `finally` block will run, and the `GeneratorExit` exception will be bubbled up to Python who will expect it after the call to `close()`

In [37]:
trans = save_to_db()
next(trans)
trans.send('1 + 1')
trans.close()

starting new transaction
sending data to database: 2
committing transaction


In [38]:
trans = save_to_db()
next(trans)
trans.send('1 / 0')

starting new transaction
aborting transaction


ZeroDivisionError: division by zero

##  Yield From - Throwing Exceptions

We have seen that `yield from` allows us to establish a 2-way communication channel with a subgenerator and we could use `next`, and `send` to send a "request" to a delegated subgenerator via the delegator generator.

In fact, we can also send exceptions by throwing an exception into the delegator, just like a `send`.

In [1]:
class CloseCoroutine(Exception):
    pass

def echo():
    try:
        while True:
            received = yield
            print(received)
    except CloseCoroutine:
        return 'coro was closed'
    except GeneratorExit:
        print('closed method was called')

In [2]:
e = echo()
next(e)

In [3]:
e.throw(CloseCoroutine, 'just closing')

StopIteration: coro was closed

In [4]:
e = echo()
next(e)
e.close()

closed method was called


As we can see the difference between `throw` and `close` is that although `close` causes an exception to be raised in the generator, Python essentially silences it.

It works the same way when we delegate to the coroutine in a delegator:

In [5]:
def delegator():
    result = yield from echo()
    yield 'subgen closed and returned:', result
    print('delegator closing...')

In [6]:
d = delegator()
next(d)
d.send('hello')

hello


In [7]:
d.throw(CloseCoroutine)

('subgen closed and returned:', 'coro was closed')

Now what happens if the `throw` in the subgenerator does not close subgenerator but instead silences the exception and yields a value instead?

In [8]:
class CloseCoroutine(Exception):
    pass

class IgnoreMe(Exception):
    pass

def echo():
    try:
        while True:
            try:
                received = yield
                print(received)
            except IgnoreMe:
                yield "I'm ignoring you..."
    except CloseCoroutine:
        return 'coro was closed'
    except GeneratorExit:
        print('closed method was called')

In [9]:
d = delegator()
next(d)

In [10]:
d.send('python')

python


In [11]:
result = d.throw(IgnoreMe, 1000)

In [12]:
result

"I'm ignoring you..."

In [13]:
d.send('rocks!')

Why did we not get a yielded value back?

That's because the subgenerator was paused at the yield that yielded "I'm, ignoring you".

If we want to coroutine to continue running normally after ignoring that exception we need to tweak it slightly:

Let's first make sure we close our previous delegator!

In [14]:
d.close()

closed method was called


In [15]:
def echo():
    try:
        output = None
        while True:
            try:
                received = yield output
                print(received)
            except IgnoreMe:
                output = "I'm ignoring you..."
            else:
                output = None
    except CloseCoroutine:
        return 'coro was closed'
    except GeneratorExit:
        print('closed method was called')

In [16]:
d = delegator()
next(d)

In [17]:
d.send('hello')

hello


In [18]:
d.throw(IgnoreMe)

"I'm ignoring you..."

In [19]:
d.send('python')

python


In [20]:
d.close()

closed method was called


What happens if we do not handle the error in the subgenerator and simply let the exception propagate up?
Who gets the exception, the delegator, or the caller?

In [37]:
def echo():
    while True:
        received = yield
        print(received)

In [38]:
def delegator():
    yield from echo()

In [39]:
d = delegator()
next(d)

In [24]:
d.throw(ValueError)

ValueError: 

OK, so we, the caller see the exception. But did the delegator see it too? i.e. can we catch the exception in the delegator?

In [25]:
def delegator():
    try:
        yield from echo()
    except ValueError:
        print('got the value error')

In [26]:
d = delegator()
next(d)

In [27]:
d.throw(ValueError)

got the value error


StopIteration: 

As you can see, we were able to catch the exception in the delegator.
Of course, the way we wrote our code, the delegator still closed, and hence we now see a `StopIteration` exception.

#### Example

Suppose we have a coroutine that creates running averages, and we want to occasionally write the current data to a file:

In [28]:
class WriteAverage(Exception):
    pass

def averager(out_file):
    total = 0
    count = 0
    average = None
    with open(out_file, 'w') as f:
        f.write('count,average\n')
        while True:
            try:
                received = yield average
                total += received
                count += 1
                average = total / count
            except WriteAverage:
                if average is not None:
                    print('saving average to file:', average)
                    f.write(f'{count},{average}\n')

In [29]:
avg = averager('sample.csv')
next(avg)

In [30]:
avg.send(1)
avg.send(2)

1.5

In [31]:
avg.throw(WriteAverage)

saving average to file: 1.5


1.5

In [32]:
avg.send(3)

2.0

In [33]:
avg.send(2)

2.0

In [34]:
avg.throw(WriteAverage)

saving average to file: 2.0


2.0

In [35]:
avg.close()

Now we can read the data back and make sure it worked as expected:

In [36]:
with open('sample.csv') as f:
    for row in f:
        print(row.strip())

count,average
2,1.5
4,2.0


Of course we can use a delegator as well.
Maybe the delegator is charged with figuring out the output file name.
Here we'll just hardcode it inside the delegator:

In [40]:
def delegator():
    yield from averager('sample.csv')

In [41]:
d = delegator()
next(d)

In [42]:
d.send(1)

1.0

In [43]:
d.send(2)

1.5

In [44]:
d.send(3)

2.0

In [45]:
d.send(4)

2.5

In [46]:
d.throw(WriteAverage)

saving average to file: 2.5


2.5

In [47]:
d.send(5)

3.0

In [48]:
d.throw(WriteAverage)

saving average to file: 3.0


3.0

In [49]:
d.close()

In [50]:
with open('sample.csv') as f:
    for row in f:
        print(row.strip())

count,average
4,2.5
5,3.0


##  Pipelines - Pulling Data

Included with this notebook, we are going to use the `cars.csv` data file.

Let's start by writing a generator that will produce data from that file:

In [1]:
import csv

def parse_data(f_name):
    f = open(f_name)
    try:
        dialect = csv.Sniffer().sniff(f.read(2000))
        f.seek(0)
        next(f)  # skip header row
        yield from csv.reader(f, dialect=dialect)
    finally:
        f.close()

Notice how we are already using delegation to delegate iteration fo the csv reader iterator. Here we are therefore pulling data from the csv reader and yielding that out from the `parse_data` generator.

Let's look at the data:

In [2]:
import itertools

for row in itertools.islice(parse_data('cars.csv'), 5):
    print(row)

['Chevrolet Chevelle Malibu', '18.0', '8', '307.0', '130.0', '3504.', '12.0', '70', 'US']
['Buick Skylark 320', '15.0', '8', '350.0', '165.0', '3693.', '11.5', '70', 'US']
['Plymouth Satellite', '18.0', '8', '318.0', '150.0', '3436.', '11.0', '70', 'US']
['AMC Rebel SST', '16.0', '8', '304.0', '150.0', '3433.', '12.0', '70', 'US']
['Ford Torino', '17.0', '8', '302.0', '140.0', '3449.', '10.5', '70', 'US']


Now let's filter out rows based on the car make:

In [3]:
def filter_data(rows, contains):
    for row in rows:
        if contains in row[0]:
            yield row

We can now start building a (pull) pipeline by pulling data from the data source, through the filter:
```
caller <-- filter <-- data
```

In [4]:
data = parse_data('cars.csv')
filtered_data = filter_data(data, 'Chevrolet')

# pipeline: caller <-- filtered_data <-- data

for row in itertools.islice(filtered_data, 5):
    print(row)

['Chevrolet Chevelle Malibu', '18.0', '8', '307.0', '130.0', '3504.', '12.0', '70', 'US']
['Chevrolet Impala', '14.0', '8', '454.0', '220.0', '4354.', '9.0', '70', 'US']
['Chevrolet Chevelle Concours (sw)', '0', '8', '350.0', '165.0', '4142.', '11.5', '70', 'US']
['Chevrolet Monte Carlo', '15.0', '8', '400.0', '150.0', '3761.', '9.5', '70', 'US']
['Chevrolet Vega 2300', '28.0', '4', '140.0', '90.00', '2264.', '15.5', '71', 'US']


As you can see, using iteration we are pulling data all the way from the file, through the csv reader, through the filter and back to us (the caller).

But why stop there?
Let's further filter out rows that contain the word 'Carlo' as well:

In [5]:
data = parse_data('cars.csv')
filter_1 = filter_data(data, 'Chevrolet')
filter_2 = filter_data(filter_1, 'Carlo')

# pipeline: caller <-- filter_2 <-- filtered_1 <-- data

for row in itertools.islice(filter_2, 5):
    print(row)

['Chevrolet Monte Carlo', '15.0', '8', '400.0', '150.0', '3761.', '9.5', '70', 'US']
['Chevrolet Monte Carlo S', '15.0', '8', '350.0', '145.0', '4082.', '13.0', '73', 'US']
['Chevrolet Monte Carlo Landau', '15.5', '8', '350.0', '170.0', '4165.', '11.4', '77', 'US']
['Chevrolet Monte Carlo Landau', '19.2', '8', '305.0', '145.0', '3425.', '13.2', '78', 'US']


We can package all this up into a single delegator generator:

In [6]:
def output(f_name):
    data = parse_data(f_name)
    filter_1 = filter_data(data,'Chevrolet')
    filter_2 = filter_data(filter_1, 'Carlo')
    yield from filter_2

And we can use our delegator generator this way:

In [7]:
results = output('cars.csv')
for row in results:
    print(row)

['Chevrolet Monte Carlo', '15.0', '8', '400.0', '150.0', '3761.', '9.5', '70', 'US']
['Chevrolet Monte Carlo S', '15.0', '8', '350.0', '145.0', '4082.', '13.0', '73', 'US']
['Chevrolet Monte Carlo Landau', '15.5', '8', '350.0', '170.0', '4165.', '11.4', '77', 'US']
['Chevrolet Monte Carlo Landau', '19.2', '8', '305.0', '145.0', '3425.', '13.2', '78', 'US']


We can actually make this a little more generic while we're at it:

In [8]:
def output(f_name, *filter_words):
    data = parse_data(f_name)
    for filter_word in filter_words:
        data = filter_data(data, filter_word)
    yield from data

In [9]:
results = output('cars.csv', 'Chevrolet')
for row in itertools.islice(results, 5):
    print(row)

['Chevrolet Chevelle Malibu', '18.0', '8', '307.0', '130.0', '3504.', '12.0', '70', 'US']
['Chevrolet Impala', '14.0', '8', '454.0', '220.0', '4354.', '9.0', '70', 'US']
['Chevrolet Chevelle Concours (sw)', '0', '8', '350.0', '165.0', '4142.', '11.5', '70', 'US']
['Chevrolet Monte Carlo', '15.0', '8', '400.0', '150.0', '3761.', '9.5', '70', 'US']
['Chevrolet Vega 2300', '28.0', '4', '140.0', '90.00', '2264.', '15.5', '71', 'US']


In [10]:
results = output('cars.csv', 'Chevrolet', 'Carlo', 'Landau')
for row in itertools.islice(results, 5):
    print(row)

['Chevrolet Monte Carlo Landau', '15.5', '8', '350.0', '170.0', '4165.', '11.4', '77', 'US']
['Chevrolet Monte Carlo Landau', '19.2', '8', '305.0', '145.0', '3425.', '13.2', '78', 'US']


##  Pipelines - Broadcasting

To work along with this notebook you'll need the included data file, `car_data.csv`.

We are going to want to split the data into different files based on some criteria of our choosing.

For example, we may want to create a file that contains all pink cars, another file that contains all Mercedes brands, and another that contains only blue cars of a specific model year, etc.

Let's first write a generator to parse the data for us:

In [1]:
import csv

def data_reader(f_name):
    f = open(f_name)
    try:
        dialect = csv.Sniffer().sniff(f.read(2000))
        f.seek(0)
        reader = csv.reader(f, dialect=dialect)
        yield from reader
    finally:
        f.close()

In [2]:
for row in data_reader('car_data.csv'):
    print(row)

['car_make', 'car_model', 'model_year', 'vin', 'color']
['Mitsubishi', 'Outlander', '2011', 'WBAEV33453K542952', 'Indigo']
['Pontiac', 'Sunfire', '2001', 'SCFAD06D99G713780', 'Maroon']
['Pontiac', 'Grand Am', '1994', 'WBA6B8C59ED852919', 'Red']
['Chrysler', 'Town & Country', '2008', '1GD422CGXEF757763', 'Violet']
['Isuzu', 'Trooper', '1999', '3GTU2YEJ6CG150061', 'Red']
['Acura', 'NSX', '2002', 'WA1DGBFP6DA002021', 'Orange']
['Oldsmobile', 'Cutlass Supreme', '1997', '5TFCW5F1XBX807662', 'Red']
['Ford', 'F-Series', '1995', 'WAUDF98E55A083878', 'Red']
['Saab', '900', '1998', '1C3ADEAZXDV389424', 'Indigo']
['Land Rover', 'LR3', '2008', 'WBANE53507B229964', 'Pink']
['Audi', 'V8', '1994', 'WVWAN7ANXDE961674', 'Mauv']
['Isuzu', 'Stylus', '1993', '1VWAH7A39EC443135', 'Purple']
['Dodge', 'Ramcharger', '1993', 'WAUSH78E07A410079', 'Yellow']
['Aston Martin', 'DBS', '2008', 'WAUFMBFC7EN209268', 'Pink']
['Ford', 'Fusion', '2010', '2HNYD2H83DH124143', 'Violet']
['GMC', 'Yukon XL 2500', '2000', 'WBAS

Let's create our indices, output headers and data converters for this file - basically these are our configuration parameters for this data file.

In [3]:
input_file = 'car_data.csv'

idx_make = 0
idx_model = 1
idx_year = 2
idx_vin = 3
idx_color = 4

headers = ('make', 'model', 'year', 'vin', 'color')

converters = (str, str, int, str, str)

Now let's create a generator that will return the parsed data:

In [4]:
def data_parser():
    data = data_reader(input_file)
    next(data)  # skip header row
    for row in data:
        parsed_row = [converter(item)
                      for converter, item in zip(converters, row)]
        yield parsed_row

Let's just make sure this is working properly:

In [5]:
data = data_parser()
for _ in range(5):
    print(next(data))

['Mitsubishi', 'Outlander', 2011, 'WBAEV33453K542952', 'Indigo']
['Pontiac', 'Sunfire', 2001, 'SCFAD06D99G713780', 'Maroon']
['Pontiac', 'Grand Am', 1994, 'WBA6B8C59ED852919', 'Red']
['Chrysler', 'Town & Country', 2008, '1GD422CGXEF757763', 'Violet']
['Isuzu', 'Trooper', 1999, '3GTU2YEJ6CG150061', 'Red']


Let's also write our coroutine decorator that will auto prime coroutines:

In [6]:
def coroutine(fn):
    def inner(*args, **kwargs):
        g = fn(*args, **kwargs)
        next(g)
        return g
    return inner

Next we are going to write a coroutine that will create and write data to a file. We'll need to pass the output file name to the coroutine, and the coroutine will assume that the data is being passed in as a list (basically whatever is coming back from `data_parser`). To make it easier, we'll also pass it the column headers so we can include that in the output file.

In [7]:
@coroutine
def save_data(f_name, headers):
    with open(f_name, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(headers)
        while True:
            data_row = yield
            writer.writerow(data_row)

Now we're going to create a filter coroutine that will have the following parameters:
* filter function (predicate)
* next coroutine to send to in the pipeline

That filter coroutine will receive a data row and test if the predicate applied to that data row is True. If it is, it will send the row to the next stage (target) of the pipeline, otherwise it just ignores the data row.

In [8]:
@coroutine
def filter_data(filter_predicate, target):
    while True:
        data_row = yield
        if filter_predicate(data_row):
            target.send(data_row)

Next, let's write our broadcaster. It just sends received data to all the generators specified in the `targets` argument:

In [9]:
@coroutine
def broadcast(targets):
    while True:
        data_row = yield
        for target in targets:
            target.send(data_row)

OK, we're now ready to put all this together.

```
     data                      
      |                      |--> filter --> save
      v                      |
process_data --> broadcast --|--> filter --> save
                             |
                             |--> filter --> save
```

In [10]:
def process_data():
    data = data_parser()
    
    out_pink_cars = save_data('pink_cars.csv', headers)
    out_ford_green = save_data('ford_green.csv', headers)
    out_older = save_data('older.csv', headers)
    
    filter_pink_cars = filter_data(lambda d: d[idx_color].lower() == 'pink',
                                   out_pink_cars)
    
    def pred_ford_green(data_row):
        return (data_row[idx_make].lower() == 'ford' 
                and data_row[idx_color].lower() == 'green')
    
    filter_ford_green = filter_data(pred_ford_green, out_ford_green)
    filter_older = filter_data(lambda d: d[idx_year] <= 2010, out_older)
    filters = (filter_pink_cars, filter_ford_green, filter_older)
    broadcaster = broadcast(filters)
    
    for row in data:
        broadcaster.send(row)
    
    print('Finished processing.')

And now let's call it and see what happens!

In [11]:
process_data()

Finished processing.


Let's see what those files contain:

In [12]:
def print_file_data():
    for file_name in ('pink_cars.csv', 'ford_green.csv', 'older.csv'):
        print(f'***** {file_name} *****')
        for row in data_reader(file_name):
            print(row)
        print('\n\n\n')

print_file_data()

***** pink_cars.csv *****
['make', 'model', 'year', 'vin', 'color']
['Land Rover', 'LR3', '2008', 'WBANE53507B229964', 'Pink']
['Aston Martin', 'DBS', '2008', 'WAUFMBFC7EN209268', 'Pink']
['Ford', 'F150', '1993', 'JN8AF5MR5BT143315', 'Pink']
['Nissan', 'Murano', '2012', '1D7RB1CP2AS005941', 'Pink']
['Chevrolet', 'HHR Panel', '2006', 'WA1VMAFE4BD281230', 'Pink']
['Chrysler', 'New Yorker', '1996', '19UUA8F31DA965112', 'Pink']
['Dodge', 'Viper', '2005', 'WA1VGBFPXEA735467', 'Pink']
['Maserati', 'Karif', '1990', '3LN6L2LU5FR389539', 'Pink']
['Chrysler', 'Voyager', '2001', 'SCFHDDAJ7BA859249', 'Pink']
['Ford', 'Expedition', '2002', 'JTJHY7AX8D4370558', 'Pink']
['Suzuki', 'Grand Vitara', '2010', '1GYUCHEF9AR508723', 'Pink']
['GMC', '2500', '1995', 'JH4DC53004S939059', 'Pink']
['GMC', 'Yukon XL 1500', '2009', '3D4PG4FB1BT559916', 'Pink']
['Dodge', 'Avenger', '2009', 'SCBCR7ZA0AC733739', 'Pink']
['GMC', 'Jimmy', '2001', '2C4RRGAGXDR589104', 'Pink']
['Subaru', 'Legacy', '1998', 'WA1CM74L39D1881

There's one more bit of cleanup I want to do though.

I would prefer to have the definition of my pipeline not also be the consumer of the data. Just trying to keep functionality more separated.

So let's rewrite change `process_data` to just be another step in the pipeline.

In [13]:
@coroutine
def pipeline_coro():
    out_pink_cars = save_data('pink_cars.csv', headers)
    out_ford_green = save_data('ford_green.csv', headers)
    out_older = save_data('older.csv', headers)
    
    filter_pink_cars = filter_data(lambda d: d[idx_color].lower() == 'pink',
                                  out_pink_cars)
    
    def pred_ford_green(data_row):
        return (data_row[idx_make].lower() == 'ford'
               and data_row[idx_color].lower() == 'green')
    filter_ford_green = filter_data(pred_ford_green, out_ford_green)
    filter_older = filter_data(lambda d: d[idx_year] <= 2010, out_older)
    
    filters = (filter_pink_cars, filter_ford_green, filter_older)
    
    broadcaster = broadcast(filters)
    
    while True:
        data_row = yield
        broadcaster.send(data_row)    

And now we can use the pipeline this way:

In [14]:
pipe = pipeline_coro()
data = data_parser()
for row in data:
    pipe.send(row)

OK, so now let's make sure the correct data is in those output files:

In [15]:
print_file_data()

***** pink_cars.csv *****


Error: Could not determine delimiter

Uh-oh, we get an exception. Why did the parser fail to figure out the dialect of the file?

Let's see what's in the file:

In [16]:
with open('pink_cars.csv') as f:
    for row in f:
        print('row', row)

The file is empty!!

The issue is that our files have not been closed yet!

The pipeline coroutine is still active, so nothing go released or closed - including the endpoints of our pipeline.

Fortunately this is easy to do - we just need to close the pipeline.

In [17]:
pipe.close()

And now we should be able to read those files:

In [18]:
print_file_data()

***** pink_cars.csv *****
['make', 'model', 'year', 'vin', 'color']
['Land Rover', 'LR3', '2008', 'WBANE53507B229964', 'Pink']
['Aston Martin', 'DBS', '2008', 'WAUFMBFC7EN209268', 'Pink']
['Ford', 'F150', '1993', 'JN8AF5MR5BT143315', 'Pink']
['Nissan', 'Murano', '2012', '1D7RB1CP2AS005941', 'Pink']
['Chevrolet', 'HHR Panel', '2006', 'WA1VMAFE4BD281230', 'Pink']
['Chrysler', 'New Yorker', '1996', '19UUA8F31DA965112', 'Pink']
['Dodge', 'Viper', '2005', 'WA1VGBFPXEA735467', 'Pink']
['Maserati', 'Karif', '1990', '3LN6L2LU5FR389539', 'Pink']
['Chrysler', 'Voyager', '2001', 'SCFHDDAJ7BA859249', 'Pink']
['Ford', 'Expedition', '2002', 'JTJHY7AX8D4370558', 'Pink']
['Suzuki', 'Grand Vitara', '2010', '1GYUCHEF9AR508723', 'Pink']
['GMC', '2500', '1995', 'JH4DC53004S939059', 'Pink']
['GMC', 'Yukon XL 1500', '2009', '3D4PG4FB1BT559916', 'Pink']
['Dodge', 'Avenger', '2009', 'SCBCR7ZA0AC733739', 'Pink']
['GMC', 'Jimmy', '2001', '2C4RRGAGXDR589104', 'Pink']
['Subaru', 'Legacy', '1998', 'WA1CM74L39D1881

Perfect, so just to recap, here's how we would use our pipeline:

In [19]:
pipe = pipeline_coro()
data = data_parser()
for row in data:
    pipe.send(row)
pipe.close()

Hmm... Notice how we open the pipeline, and then close it?
Does this remind you of a context manager?

Let's write a context manager for our pipeline - that way we'll never forget to close it!

In [20]:
from contextlib import contextmanager

In [21]:
@contextmanager
def pipeline():
    p = pipeline_coro()
    try:
        yield p
    finally:
        p.close()

And now we can use it this way:

In [22]:
with pipeline() as pipe:
    data = data_parser()
    for row in data:
        pipe.send(row)

And again, let's just make sure the files are OK:

In [23]:
print_file_data()

***** pink_cars.csv *****
['make', 'model', 'year', 'vin', 'color']
['Land Rover', 'LR3', '2008', 'WBANE53507B229964', 'Pink']
['Aston Martin', 'DBS', '2008', 'WAUFMBFC7EN209268', 'Pink']
['Ford', 'F150', '1993', 'JN8AF5MR5BT143315', 'Pink']
['Nissan', 'Murano', '2012', '1D7RB1CP2AS005941', 'Pink']
['Chevrolet', 'HHR Panel', '2006', 'WA1VMAFE4BD281230', 'Pink']
['Chrysler', 'New Yorker', '1996', '19UUA8F31DA965112', 'Pink']
['Dodge', 'Viper', '2005', 'WA1VGBFPXEA735467', 'Pink']
['Maserati', 'Karif', '1990', '3LN6L2LU5FR389539', 'Pink']
['Chrysler', 'Voyager', '2001', 'SCFHDDAJ7BA859249', 'Pink']
['Ford', 'Expedition', '2002', 'JTJHY7AX8D4370558', 'Pink']
['Suzuki', 'Grand Vitara', '2010', '1GYUCHEF9AR508723', 'Pink']
['GMC', '2500', '1995', 'JH4DC53004S939059', 'Pink']
['GMC', 'Yukon XL 1500', '2009', '3D4PG4FB1BT559916', 'Pink']
['Dodge', 'Avenger', '2009', 'SCBCR7ZA0AC733739', 'Pink']
['GMC', 'Jimmy', '2001', '2C4RRGAGXDR589104', 'Pink']
['Subaru', 'Legacy', '1998', 'WA1CM74L39D1881

Perfect!

# Section 13 - Project 6

##  Project 6 - Description

The goal of this project is to rewrite the pull pipeline we created in the **Application - Pipelines - Pulling** video in the **Generators as Coroutines** section.

You should look at the techniques we used in the **Application - Pipelines - Broadcasting** video and apply them here.

The goal is to write a pipeline that will push data from the source file, `cars.csv`, and push it through some filters and a save coroutine to ultimately save the results as a csv file.

Try to make your code as generic as possible, and don't worry about column headers in the output file (unless you really want to!).

When you are done with your solution you should be able to specify an arbitrary number of filters on the name field.

If you specify `Chevrolet`, `Carlo` and `Landau` for three filters, your output file should contain two lines of data only:

```
Chevrolet Monte Carlo Landau,15.5,8,350.0,170.0,4165.,11.4,77,US
Chevrolet Monte Carlo Landau,19.2,8,305.0,145.0,3425.,13.2,78,US
```

Good luck!!

##  Project 6 - Solution

The first thing I'm going to do is to simply copy the `parse_data` code from the pull pipeline example - we can reuse that code:

In [1]:
import csv

def parse_data(f_name):
    f = open(f_name)
    try:
        dialect = csv.Sniffer().sniff(f.read(2000))
        f.seek(0)
        next(f)  # skip header row
        yield from csv.reader(f, dialect=dialect)
    finally:
        f.close()

Since we're going to be creating coroutines, we're going to find a coroutine decorator to auto prime coroutines useful:

In [2]:
def coroutine(fn):
    def inner(*args, **kwargs):
        coro = fn(*args, **kwargs)
        next(coro)
        return coro
    return inner

Let's start writing the various coroutines and functions we are going to need for our pipeline:

* coroutine to save data to a file
* coroutine to filter data based on the vehicle name - but we'll make it generic and use a filter function (predicate) as an argument
* coroutine to act as the pipeline
* a context manager that will open and close the pipeline automatically

In [7]:
@coroutine
def save_csv(f_name):
    with open(f_name, 'w', newline='') as f:
        writer = csv.writer(f)
        while True:
            row = yield
            writer.writerow(row)

In [84]:
@coroutine
def filter_data(filter_pred, target):
    while True:
        row = yield
        if filter_pred(row):
            target.send(row)

Now we need to create our custom pipeline that will create the various coroutines with appropriate filters, as well as the `save_csv` coroutine, and orchestrates the data flow:

In [89]:
@coroutine
def pipeline_coro(out_file, name_filters):
    save = save_csv(out_file)
    
    target = save
    for name_filter in name_filters:
        target = filter_data(lambda d, v=name_filter: v in d[0], target)
        # warning: we have to use the trick above because
        # lambdas are actually closures and the free variable name_filter
        # is a shared free variable - we have seen this problem before!
    while True:
        received = yield
        target.send(received)

Next, we are going to create a context manager to automatically close the pipeline when we are done with it:

In [86]:
from contextlib import contextmanager

@contextmanager
def pipeline(out_file, name_filters):
    p = pipeline_coro(out_file, name_filters)
    try:
        yield p
    finally:
        p.close()

And now we can start using the pipeline:

In [87]:
with pipeline('out.csv', ('Chevrolet', 'Landau', 'Carlo')) as p:
    for row in parse_data('cars.csv'):
        p.send(row)

And finally let's make sure the data was written out correctly:

In [88]:
with open('out.csv') as f:
    for row in f:
        print(row, end='')

Chevrolet Monte Carlo Landau,15.5,8,350.0,170.0,4165.,11.4,77,US
Chevrolet Monte Carlo Landau,19.2,8,305.0,145.0,3425.,13.2,78,US


Perfect!