# Generators

In [None]:
# Before starting, we run the "Generate data" notebook to make sure we have everything we need.
%run "./01. Generate data.ipynb"

## `for` loops in Python are very flexible

Classic loop

In [None]:
for i in range(4):
    print(i)

Loop over any list

In [None]:
for word in ['Cheese', 'Sausage', 'Bread']:
    print(word)

Loop over a dictionary

In [None]:
italian_to_english = {
    'ciao': 'hi',
    'sottaceti': 'pickles',
    'pizza': 'pizza',
}

for italian, english in italian_to_english.items():
    print(italian, '->', english)

Loop over a text file

In [None]:
with open('consumption_201710.csv') as f:
    for line in f:
        print(line, end='')

Libraries often define their own "iterators"

In [None]:
import pandas as pd

world_pop = pd.DataFrame(
    columns=['Country', '2000', '2015', '2030'], 
    data=[['China', 1270, 1376, 1416],
          ['India', 1053, 1311, 1528],
          ['United States', 283, 322, 356],
          ['Indonesia', 212, 258, 295]],
)
world_pop

In [None]:
for column_name in world_pop:
    print(column_name)

In [None]:
for idx, row in world_pop.iterrows():
    print('{0:-^7}'.format(idx))
    print(row)

**IMPORTANT CONCEPT: these iterators do not create a list in memory over which `for` iterates!**

## Defining your own for-loop thingy: generators

"Generators" are like functions, but for loops.

### First contact

iterate over the first `n` odd numbers

In [None]:
def odd_numbers(n):
    """ Generator for the first `n` odd numbers. """
    for i in range(n):
        # Use `yield` instead of `return`: execution will start again from here
        yield i * 2 + 1

for i in odd_numbers(5):
    print(i)

Second example: first `n` numbers not divisible by x

In [None]:
def not_divisibles(n, divisor):
    """ Generator for the first `n` numbers not divisible by x. """
    current = 0
    while n > 0:
        if (current % divisor != 0):
            yield current
            n -= 1
        current += 1

In [None]:
for x in not_divisibles(7, 3):
    print(x)

Generated content does not need to be deterministic, or finite! It could even be generated on the fly.

In [None]:
import numpy as np

def generate_n_random_numbers(n):
    for i in range(n):
        yield np.random.uniform()

for x in generate_n_random_numbers(5):
    print(x)

### Hands-on: Your first generator

Write a generator that generates even numbers between 0 and `n`.

Expected:
```
for i in even(7):
    print(i)
```

outputs

```
0
2
4
6
```

In [None]:
# Your code here

### Hands-on: Recognize the smell of generators

Submit a PR for Issue #1 on GitHub.

### Hands-on: A common generators pattern

Submit a PR for Issue #2 on GitHub.

## Generators can be chained

In [None]:
def readfiles(filenames):
    """ Generator that yields all lines from multiple files. """
    for filename in filenames:
        for line in open(filename):
            yield line


def filter_pattern(lines, pattern):
    """ Generator that yields all lines that contain a certain string. """
    for line in lines:
        if pattern in line:
            yield line


def pprint_with_line_numbers(lines):
    """ Format each line in a pretty string. """
    for idx, line in enumerate(lines):
        yield '{} - "{}"'.format(idx, line.strip())


filenames = ['first_commented_data.csv', 'second_commented_data.csv', 'third_commented_data.csv']

for line in pprint_with_line_numbers(filter_pattern(readfiles(filenames), pattern='REM')):
    print(line)


## Real-life example: ETL workflow for PayTV data

Switch to the other notebook

### Hands-on: Sum of CSV columns, get rid of the smell!

Submit a PR for Issue #3 on GitHub.

## itertools (time permitting)

A tour of the content of `itertools`.

https://docs.python.org/3.6/library/itertools.html


A typical case that shows up in my code: going through combinations of experimental conditions.

In [None]:
from itertools import product

concentrations = [1, 10, 100]
times = [60, 120, 180]
applications = [1, 2, 3]

for idx, (concentration, time, application) in enumerate(product(concentrations, times, applications)):
    print('Run experiment #{}'.format(idx))
    print('Concentration', concentration)
    print('Time', time)
    print('Applications', application)
    print()

Another common case is when one needs to compute statistics on all pairs of variables

In [None]:
df = pd.DataFrame(
    data = [[1, 0.1, 32],
            [4, 0.3, 11],
            [8, 0.9, 1],
            [12, 0.12, -4]],
    columns=['unci', 'dunci', 'trinci']
)
df

In [None]:
# Without itertools

n_cols = df.shape[1]
for idx1 in range(n_cols):
    for idx2 in range(idx1 + 1, n_cols):
        corr = (df.iloc[:, idx1] * df.iloc[:, idx2]).sum()
        print(df.columns[idx1], df.columns[idx2], corr)

In [None]:
# With itertools
from itertools import combinations

for col1, col2 in combinations(df.columns, 2):
    corr = (df.loc[:, col1] * df.loc[:, col2]).sum()
    print(col1, col2, corr)

### Hands-on

Submit a PR for Issue #5 on GitHub.
