# Generators

In [None]:
# Before starting, we run the "Generate data" notebook to make sure we have everything we need.
%run "./01. Generate data.ipynb"

## `for` loops in Python are very flexible

Classic loop

Loop over any list

Loop over a dictionary

Loop over a text file

In [None]:
filename = 'consumption_201710.csv')

Libraries often define their own "iterators"

In [None]:
import pandas as pd

world_pop = pd.DataFrame(
    columns=['Country', '2000', '2015', '2030'], 
    data=[['China', 1270, 1376, 1416],
          ['India', 1053, 1311, 1528],
          ['United States', 283, 322, 356],
          ['Indonesia', 212, 258, 295]],
)
world_pop

**IMPORTANT CONCEPT: these iterators do not create a list in memory over which `for` iterates!**

## Defining your own for-loop thingy: generators

"Generators" are like functions, but for loops.

### First contact

iterate over the first `n` odd numbers

In [None]:
def odd_numbers(n):
    """ Generator for the first `n` odd numbers. """

for i in odd_numbers(5):
    print(i)

Second example: first `n` numbers not divisible by x

In [None]:
def not_divisibles(n, divisor):
    """ Generator for the first `n` numbers not divisible by x. """


In [None]:
for x in not_divisibles(7, 3):
    print(x)

Generated content does not need to be deterministic, or finite! It could even be generated on the fly.

In [None]:
import numpy as np

def generate_n_random_numbers(n):
    for i in range(n):
        yield np.random.uniform()

for x in generate_n_random_numbers(5):
    print(x)

### Hands-on: Your first generator

Write a generator that generates even numbers between 0 and `n`.

Expected:
```
for i in even(7):
    print(i)
```

outputs

```
0
2
4
6
```

### Hands-on: Recognize the smell of generators

Get rid of the smell in the code below by defining a generator.

In [None]:
for i in range(9):
    if i % 3 == 0:
        continue
    print('Square is', i ** 2)

for j in range(5):
    if j % 2 == 0:
        continue
    print('Cube is', j ** 3)

for k in range(13):
    if k % 5 == 0:
        continue
    print('A' * k)

### Hands-on: All pairs (skip)

We have a list of subjects whose individual performance needs to be compared in pairs.

Write a generator called `all_pairs` that returns all pairs of items from a list

E.g. `all_pairs(['A', 'B', 'C'])` will yield three sets `{'A', 'B'}`, `{'A', 'C'}`, `{'B', 'C'}` (not necessarily in this order)

Suggestion: starting writing a solution for this task without generators, then transform the for loops in a generator.

### Hands-on: A common generators pattern

Write a generator called `without_punctuation` that iterates over a list of strings and removes puncuation characters at the end of the string. If the string is empty after the removal, the string is skipped.

For instance, `without_punctuation(['Apple', 'Banana...', 'Carrot!!', '*$', '!Dinosaur'])` would yield `Apple`, `Banana`, `Carrot`, and `!Dinosaur`.

(see the `.rstrip` method of strings, and the constant `punctuation` in the module `string`)

The pattern in the exercise above is the one that I find most common in my code: refactor common filtering, cleaning up, and transformations in `for` loops.

It comes up all the time when processing messy data.

## Generators can be chained

In [None]:
def readfiles(filenames):
    """ Generator that yields all lines from multiple files. """
    for filename in filenames:
        for line in open(filename):
            yield line


def filter_pattern(lines, pattern):
    """ Generator that yields all lines that contain a certain string. """
    for line in lines:
        if pattern in line:
            yield line


def pprint_with_line_numbers(lines):
    """ Format each line in a pretty string. """
    for idx, line in enumerate(lines):
        yield '{} - "{}"'.format(idx, line.strip())


filenames = ['first_commented_data.csv', 'second_commented_data.csv', 'third_commented_data.csv']


## Real-life example: ETL workflow for PayTV data

Switch to the other notebook

### Hands-on: Get rid of the smell!

The code below parses 3 CSV containing comment lines that start with the prefix `'# '`, `'-- '`, or `'REM '`.

**Get rid of the smell!**

In [None]:
# Script that computes the sum of all the columns in 3 CSV files that contain commented lines

comment_prefixes = ['# ', '-- ', 'REM ']

filename1 = 'first_commented_data.csv'
print('Load data from', filename1)
with open(filename1, 'rt') as f:
    valid_lines = []
    for line in f:
        for prefix in comment_prefixes:
            if line.startswith(prefix):
                break
        else:
            data = [int(x) for x in line.split(',')]
            valid_lines.append(data)

data1 = pd.DataFrame(valid_lines, columns=['unci', 'dunci', 'trinci', 'quari'])


filename2 = 'second_commented_data.csv'
print('Load data from', filename2)
with open(filename2, 'rt') as f:
    valid_lines = []
    for line in f:
        for prefix in comment_prefixes:
            if line.startswith(prefix):
                break
        else:
            data = [int(x) for x in line.split(',')]
            valid_lines.append(data)

data2 = pd.DataFrame(valid_lines, columns=['unci', 'dunci', 'trinci', 'quari'])


filename3 = 'third_commented_data.csv'
print('Load data from', filename3)
with open(filename3, 'rt') as f:
    valid_lines = []
    for line in f:
        for prefix in comment_prefixes:
            if line.startswith(prefix):
                break
        else:
            data = [int(x) for x in line.split(',')]
            valid_lines.append(data)

data3 = pd.DataFrame(valid_lines, columns=['unci', 'dunci', 'trinci', 'quari'])

print(data1.sum() + data2.sum() + data3.sum())

Solution 1

In [None]:
CSV_COMMENT_PREFIXES = ['# ', '-- ', 'REM ']

def lines_without_comments(filename, comment_prefixes=CSV_COMMENT_PREFIXES):
    with open(filename, 'rt') as f:
        valid_lines = []
        for line in f:
            for prefix in comment_prefixes:
                if line.startswith(prefix):
                    break
            else:
                data = [int(x) for x in line.split(',')]
                yield data

filenames = ['first_commented_data.csv', 'second_commented_data.csv', 'third_commented_data.csv']
data = []
for filename in filenames:
    print('Load data from', filename)
    data_chunk = pd.DataFrame(
        lines_without_comments(filename),
        columns=['unci', 'dunci', 'trinci', 'quari'],
    )
    data.append(data_chunk)

data = pd.concat(data)
print(data.sum())

Solution 2

In [None]:
CSV_COMMENT_PREFIXES = ['# ', '-- ', 'REM ']


def readfiles(filenames):
    """ Generator that yields all lines from multiple files. """
    for filename in filenames:
        for line in open(filename, 'rt'):
            yield line


def filter_comments(lines, comment_prefixes=CSV_COMMENT_PREFIXES):
    """ Generator that yields all lines that do not start with comment prefixes. """
    for line in lines:
        for prefix in comment_prefixes:
            if line.startswith(prefix):
                break
        else:
            yield line


def parse_data(lines):
    """ Generator that parses each line as a list of integers. """
    for line in lines:
        yield [int(x) for x in line.split(',')]

        
filenames = ['first_commented_data.csv', 'second_commented_data.csv', 'third_commented_data.csv']
data = pd.DataFrame(
    parse_data(filter_comments(readfiles(filenames))),
    columns=['unci', 'dunci', 'trinci', 'quari'],
)
print(data.sum())

## itertools (time permitting)

A tour of the content of `itertools`.

https://docs.python.org/3.6/library/itertools.html


A typical case that shows up in my code: going through combinations of experimental conditions.

In [None]:
from itertools import product

concentrations = [1, 10, 100]
times = [60, 120, 180]
applications = [1, 2, 3]

for idx, (concentration, time, application) in enumerate(product(concentrations, times, applications)):
    print('Run experiment #{}'.format(idx))
    print('Concentration', concentration)
    print('Time', time)
    print('Applications', application)
    print()

Another common case is when one needs to compute statistics on all pairs of variables

In [None]:
df = pd.DataFrame(
    data = [[1, 0.1, 32],
            [4, 0.3, 11],
            [8, 0.9, 1],
            [12, 0.12, -4]],
    columns=['unci', 'dunci', 'trinci']
)
df

In [None]:
# Without itertools

n_cols = df.shape[1]
for idx1 in range(n_cols):
    for idx2 in range(idx1 + 1, n_cols):
        corr = (df.iloc[:, idx1] * df.iloc[:, idx2]).sum()
        print(df.columns[idx1], df.columns[idx2], corr)

In [None]:
# With itertools
from itertools import combinations

for col1, col2 in combinations(df.columns, 2):
    corr = (df.loc[:, col1] * df.loc[:, col2]).sum()
    print(col1, col2, corr)

### Hands-on

Write a generator that deals cards at random from a deck of card.