# Generators

## for loops in Python are very flexible

Classic loop

In [63]:
for i in range(4):
    print(i)

0
1
2
3


Loop over any list

In [73]:
for word in ['Cheese', 'Sausage', 'Bread']:
    print(word)

Cheese
Sausage
Bread


Loop over a dictionary

In [74]:
italian_to_english = {
    'ciao': 'hi',
    'sottaceti': 'pickles',
    'pizza': 'pizza',
}

for italian, english in italian_to_english.items():
    print(italian, '->', english)

sottaceti -> pickles
ciao -> hi
pizza -> pizza


Loop over a text file

In [72]:
with open('consumption_201710.csv') as f:
    for line in f:
        print(line, end='')

USER_ID,TV_201710_M,TV_201710_A,TV_201710_N,VOD_201710_M,VOD_201710_A,VOD_201710_N
0,1690,2515,4285,892,953,2805
1,1243,1952,3105,1240,1259,3711
2,1203,1797,2910,200,162,538
3,312,461,787,1256,1261,3663
4,242,379,609,215,226,729
5,279,394,677,97,106,307
6,105,123,222,911,916,2703
7,468,677,1015,233,272,695
8,1702,2607,4301,238,212,660
9,547,774,1409,183,228,549
10,926,1362,2302,668,690,1972
11,316,433,721,1215,1208,3517
12,1483,2049,3428,638,680,1862
13,306,487,887,1397,1487,4256
14,462,679,1049,138,148,464
15,823,1216,2011,740,689,2201
16,1914,2761,4415,375,377,1228
17,217,373,610,1333,1250,3867
18,798,1116,1866,493,498,1589
19,2012,3080,4808,1024,1004,2922
20,1494,2299,3917,895,985,2620
21,1026,1562,2602,1375,1350,4055
22,641,931,1518,416,454,1320
23,767,1190,1853,960,947,2863
24,808,1219,1993,605,599,1826
25,782,1124,1874,232,269,759
26,2040,2971,5039,1341,1308,3921
27,845,1324,2151,1101,1133,3226
28,1668,2586,4146,479,478,1533
29,633,883,1514,1026,1071,3243
30,1272,1999,3119,415,43

Libraries often define their own "iterators"

In [6]:
import pandas as pd

world_pop = pd.DataFrame(
    columns=['Country', '2000', '2015', '2030'], 
    data=[['China', 1270, 1376, 1416],
          ['India', 1053, 1311, 1528],
          ['United States', 283, 322, 356],
          ['Indonesia', 212, 258, 295]],
)
world_pop

Unnamed: 0,Country,2000,2015,2030
0,China,1270,1376,1416
1,India,1053,1311,1528
2,United States,283,322,356
3,Indonesia,212,258,295


In [11]:
for column_name in world_pop:
    print(column_name)

Country
2000
2015
2030


In [89]:
for idx, row in world_pop.iterrows():
    print('{0:-^7}'.format(idx))
    print(row)

---0---
Country    China
2000        1270
2015        1376
2030        1416
Name: 0, dtype: object
---1---
Country    India
2000        1053
2015        1311
2030        1528
Name: 1, dtype: object
---2---
Country    United States
2000                 283
2015                 322
2030                 356
Name: 2, dtype: object
---3---
Country    Indonesia
2000             212
2015             258
2030             295
Name: 3, dtype: object


## Defining your own for-loop thingy: generators

### First contact

iterate over the first `n` odd numbers

In [96]:
def odd_numbers(n):
    """ Generator for the first `n` odd numbers. """
    for i in range(n):
        # Use `yield` instead of `return`: execution will start again from here
        yield i * 2 + 1

for i in odd_numbers(5):
    print(i)

1
3
5
7
9


Second example: first `n` numbers not divisible by x

In [97]:
def not_divisibles(n, divisor):
    current = 0
    while n > 0:
        if (current % divisor != 0):
            yield current
            n -= 1
        current += 1

In [101]:
for x in not_divisibles(7, 3):
    print(x)

1
2
4
5
7
8
10


Generated content does not need to be deterministic, or finite!
(Skip this?)

In [41]:
import numpy as np

def generate_n_random_numbers(n):
    for i in range(n):
        yield np.random.uniform()

for x in generate_n_random_numbers(5):
    print(x)

0.4619555775621401
0.12155224601900061
0.4372532548202044
0.38146244862334666
0.37110685251188813


What do you suggest to make the random number generation repeatable?

In [48]:
import numpy as np

def generate_n_random_numbers(n, random_state=np.random):
    for i in range(n):
        yield random_state.uniform()

random_state = np.random.RandomState(99393)
for x in generate_n_random_numbers(5, random_state=random_state):
    print(x)

0.8185248652607853
0.4339316753475848
0.42732389125630066
0.16523422841675972
0.1816074798396904


### Exercise

Write a generator called `all_pairs` that returns all pairs of items from a list

E.g. `all_pairs(['A', 'B', 'C'])` will yield three sets `{'A', 'B'}`, `{'A', 'C'}`, `{'B', 'C'}` (not necessarily in this order)

In [107]:
def all_pairs(lst):
    len_lst = len(lst)
    for idx1 in range(len_lst):
        for idx2 in range(idx1 + 1, len_lst):
            yield {lst[idx1], lst[idx2]}


for pair in all_pairs(['A', 'B', 'C']):
    print(pair)

{'A', 'B'}
{'C', 'A'}
{'C', 'B'}


### Exercise

Write a generator called `without_punctuation` that iterates over a list of strings and removes puncuation characters at the end of the string. If the string is empty after the removal, the string is skipped.

For instance, `without_punctuation(['Apple', 'Banana...', 'Carrot!!', '*$', '!Dinosaur'])` would yield `Apple`, `Banana`, `Carrot`, and `!Dinosaur`.

(see the `.rstrip` method of strings, and the constant `punctuation` in the module `string`)

In [106]:
from string import punctuation

def iter_strip_punctuation(words):
    for word in words:
        stripped = word.rstrip(punctuation)
        if len(stripped) > 0:
            yield stripped

words = ['Apple', 'Banana...', 'Carrot!!', '*$', '!Dinosaur']
for word in iter_strip_punctuation(words):
    print(word)

Apple
Banana
Carrot
!Dinosaur


The pattern in the exercise above is the one that I find most common in my code: refactor common filtering, cleaning up, and transformations in `for` loops.

It comes up all the time when processing messy data.

## Real-life example: ETL workflow for PayTV data

Switch to the other notebook

## Generators can be chained

In [141]:
def readfiles(filenames):
    """ Generator that yields all lines from multiple files. """
    for filename in filenames:
        for line in open(filename):
            yield line

def grep(lines, pattern):
    """ Generator that yields all lines that contain a certain string. """
    for line in lines:
        if pattern in line:
            yield line


def number_lines(lines):
    """ Format each line in a pretty string. """
    for idx, line in enumerate(lines):
        yield '{} - "{}"'.format(idx, line.strip())


filenames = ['first_commented_data.csv', 'second_commented_data.csv', 'third_commented_data.csv']

for line in number_lines(grep(readfiles(filenames), pattern='REM')):
    print(line)


0 - "REM Skip me"
1 - "REM Do not bother"
2 - "REM Skip me"
3 - "REM Ignore this line"
4 - "REM Ignore this line"
5 - "REM Skip me"
6 - "REM Do not bother"
7 - "REM Skip me"
8 - "REM Ignore this line"
9 - "REM Do not bother"
10 - "REM Skip me"
11 - "REM Do not bother"
12 - "REM Do not bother"
13 - "REM Skip me"
14 - "REM Do not bother"
15 - "REM Skip me"
16 - "REM Skip me"
17 - "REM Do not bother"
18 - "REM Ignore this line"
19 - "REM Skip me"
20 - "REM Ignore this line"
21 - "REM Do not bother"
22 - "REM Skip me"
23 - "REM Ignore this line"
24 - "REM Skip me"
25 - "REM Skip me"
26 - "REM Ignore this line"
27 - "REM Skip me"
28 - "REM Do not bother"
29 - "REM Ignore this line"
30 - "REM Skip me"
31 - "REM Do not bother"
32 - "REM Do not bother"
33 - "REM Ignore this line"
34 - "REM Skip me"
35 - "REM Do not bother"
36 - "REM Do not bother"
37 - "REM Skip me"
38 - "REM Do not bother"
39 - "REM Ignore this line"
40 - "REM Ignore this line"
41 - "REM Ignore this line"


### Exercise

The code below parses 3 CSV containing comment lines that start with the prefix `'# '`, `'-- '`, or `'REM '`.

**Get rid of the smell!**

In [135]:
# Script that computes the sum of all the columns in 3 CSV files that contain commented lines

comment_prefixes = ['# ', '-- ', 'REM ']

filename1 = 'first_commented_data.csv'
print('Load data from', filename1)
with open(filename1, 'rt') as f:
    valid_lines = []
    for line in f:
        for prefix in comment_prefixes:
            if line.startswith(prefix):
                break
        else:
            data = [int(x) for x in line.split(',')]
            valid_lines.append(data)

data1 = pd.DataFrame(valid_lines, columns=['unci', 'dunci', 'trinci', 'quari'])


filename2 = 'second_commented_data.csv'
print('Load data from', filename2)
with open(filename2, 'rt') as f:
    valid_lines = []
    for line in f:
        for prefix in comment_prefixes:
            if line.startswith(prefix):
                break
        else:
            data = [int(x) for x in line.split(',')]
            valid_lines.append(data)

data2 = pd.DataFrame(valid_lines, columns=['unci', 'dunci', 'trinci', 'quari'])


filename3 = 'third_commented_data.csv'
print('Load data from', filename3)
with open(filename3, 'rt') as f:
    valid_lines = []
    for line in f:
        for prefix in comment_prefixes:
            if line.startswith(prefix):
                break
        else:
            data = [int(x) for x in line.split(',')]
            valid_lines.append(data)

data3 = pd.DataFrame(valid_lines, columns=['unci', 'dunci', 'trinci', 'quari'])

print(data1.sum() + data2.sum() + data3.sum())

Load data from first_commented_data.csv
Load data from second_commented_data.csv
Load data from third_commented_data.csv
unci      254425
dunci     244622
trinci    233027
quari     245013
dtype: int64


Solution 1

In [138]:
CSV_COMMENT_PREFIXES = ['# ', '-- ', 'REM ']

def lines_without_comments(filename, comment_prefixes=CSV_COMMENT_PREFIXES):
    with open(filename, 'rt') as f:
        valid_lines = []
        for line in f:
            for prefix in comment_prefixes:
                if line.startswith(prefix):
                    break
            else:
                data = [int(x) for x in line.split(',')]
                yield data

filenames = ['first_commented_data.csv', 'second_commented_data.csv', 'third_commented_data.csv']
data = []
for filename in filenames:
    print('Load data from', filename)
    data_chunk = pd.DataFrame(
        lines_without_comments(filename),
        columns=['unci', 'dunci', 'trinci', 'quari'],
    )
    data.append(data_chunk)

data = pd.concat(data)
print(data.sum())

Load data from first_commented_data.csv
Load data from second_commented_data.csv
Load data from third_commented_data.csv
unci      254425
dunci     244622
trinci    233027
quari     245013
dtype: int64


Solution 2

In [137]:
CSV_COMMENT_PREFIXES = ['# ', '-- ', 'REM ']


def readfiles(filenames):
    """ Generator that yields all lines from multiple files. """
    for filename in filenames:
        for line in open(filename, 'rt'):
            yield line


def filter_comments(lines, comment_prefixes=CSV_COMMENT_PREFIXES):
    """ Generator that yields all lines that do not start with comment prefixes. """
    for line in lines:
        for prefix in comment_prefixes:
            if line.startswith(prefix):
                break
        else:
            yield line


def parse_data(lines):
    """ Generator that parses each line as a list of integers. """
    for line in lines:
        yield [int(x) for x in line.split(',')]

        
filenames = ['first_commented_data.csv', 'second_commented_data.csv', 'third_commented_data.csv']
data = pd.DataFrame(
    parse_data(filter_comments(readfiles(filenames))),
    columns=['unci', 'dunci', 'trinci', 'quari'],
)
print(data.sum())

unci      254425
dunci     244622
trinci    233027
quari     245013
dtype: int64


## itertools

A tour of the content of `itertools`

A typical case that shows up in my code: going through combinations of experimental conditions.

In [148]:
from itertools import product

concentrations = [1, 10, 100]
times = [60, 120, 180]
applications = [1, 2, 3]
for idx, (concentration, time, application) in enumerate(product(concentrations, times, applications)):
    print('Run experiment #{}'.format(idx))
    print('Concentration', concentration)
    print('Time', time)
    print('Applications', application)
    print()

Run experiment #0
Concentration 1
Time 60
Applications 1

Run experiment #1
Concentration 1
Time 60
Applications 2

Run experiment #2
Concentration 1
Time 60
Applications 3

Run experiment #3
Concentration 1
Time 120
Applications 1

Run experiment #4
Concentration 1
Time 120
Applications 2

Run experiment #5
Concentration 1
Time 120
Applications 3

Run experiment #6
Concentration 1
Time 180
Applications 1

Run experiment #7
Concentration 1
Time 180
Applications 2

Run experiment #8
Concentration 1
Time 180
Applications 3

Run experiment #9
Concentration 10
Time 60
Applications 1

Run experiment #10
Concentration 10
Time 60
Applications 2

Run experiment #11
Concentration 10
Time 60
Applications 3

Run experiment #12
Concentration 10
Time 120
Applications 1

Run experiment #13
Concentration 10
Time 120
Applications 2

Run experiment #14
Concentration 10
Time 120
Applications 3

Run experiment #15
Concentration 10
Time 180
Applications 1

Run experiment #16
Concentration 10
Time 180
Appl

Another common case is when one needs to compute statistics on all pairs of variables

In [150]:
df = pd.DataFrame(
    data = [[1, 0.1, 32],
            [4, 0.3, 11],
            [8, 0.9, 1],
            [12, 0.12, -4]],
    columns=['unci', 'dunci', 'trinci']
)
df

Unnamed: 0,unci,dunci,trinci
0,1,0.1,32
1,4,0.3,11
2,8,0.9,1
3,12,0.12,-4


In [151]:
# Without itertools

n_cols = df.shape[1]
for idx1 in range(n_cols):
    for idx2 in range(idx1 + 1, n_cols):
        corr = (df.iloc[:, idx1] * df.iloc[:, idx2]).sum()
        print(df.columns[idx1], df.columns[idx2], corr)

unci dunci 9.94
unci trinci 36
dunci trinci 6.92


In [154]:
# With itertools
from itertools import combinations

for col1, col2 in combinations(df.columns, 2):
    corr = (df.loc[:, col1] * df.loc[:, col2]).sum()
    print(col1, col2, corr)

unci dunci 9.94
unci trinci 36
dunci trinci 6.92


### Exercise

Write a generator that deals cards at random from a deck of card.