## Firstday: list comprehensions and generators

> List comprehensions and generators are in my top 5 favorite Python features leading to clean, robust and Pythonic code.

In [1]:
from collections import Counter
import calendar
import itertools
import random
import re
import string

import requests

### List comprehensions

Let's dive straight into a practical example.We all know how to use the classical for loop in Python,say I want to loop through a bunch of names title-casing each one:

In [2]:
names = 'pybites mike bob julian tim sara guido'.split()
names

['pybites', 'mike', 'bob', 'julian', 'tim', 'sara', 'guido']

In [3]:
for name in names:
    print(name.title())

Pybites
Mike
Bob
Julian
Tim
Sara
Guido


Python I want to only keep the names that start with A-M,the `strings` module makes it easier(we love Python's standard library!):

In [4]:
first_half_alphabet = list(string.ascii_lowercase)[:13]
first_half_alphabet

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm']

In [5]:
new_names = []
for name in names:
    if name[0] in first_half_alphabet:
        new_names.append(name.title())
new_names

['Mike', 'Bob', 'Julian', 'Guido']

Feels verbose,not?

If you don't know about list comprehensions yout might start using them everywhere after seeing the next refactoring

In [6]:
new_names2 = [name.title() for name in names if name[0] in first_half_alphabet]
new_names

['Mike', 'Bob', 'Julian', 'Guido']

In [7]:
assert new_names == new_names2

From 4 to 1 lines of code, and it reads pretty well too.That's why we love and stick with Python!

Here is another example I used recently to do a most common word count on Harry Potter.I used some list comprehensions to clean up the words before counting them:

In [10]:
resp = requests.get('http://projects.bobbelderbos.com/pcc/harry.txt')
words = resp.text.lower().split()
words[:5]

['the', 'boy', 'who', 'lived', 'mr.']

Hmm should not count stopwords, also:

In [11]:
'-' in words

True

Let's first clean up any non-alphabetic characters:

In [13]:
words = [re.sub(r'\W+' ,r'', word) for word in words]

In [14]:
'-' in words

False

In [15]:
'the' in words

True

Ok let's filter those stopwords out plus the empty strings caussed by the previous list comprehension:

In [17]:
resp = requests.get('http://projects.bobbelderbos.com/pcc/stopwords.txt')
stopwords = resp.text.lower().split()
stopwords[:5]

['a', 'about', 'above', 'across', 'after']

In [18]:
words = [word for word in words if word.strip() and word not in stopwords]
words[:5]

['boy', 'lived', 'mr', 'mrs', 'dursley']

In [19]:
'the' in words

False

Now it looks way better:

In [20]:
cnt = Counter(words)
cnt.most_common(5)

[('dursley', 45),
 ('dumbledore', 35),
 ('said', 32),
 ('mr', 30),
 ('professor', 30)]

What's interesting here is that the first bit of the list comprehension can be an expression like `re.sub`.The final bit can be a compound statement: here we checked for a non-empty word(''-> strip() -> "=False in Python) `and` we checked `word not in stopwords`.

Again, a lot is going on in one line of code,but the beauty of it is that it is totally fine,because it reads like plain English:)

### Generators

A generator is a function that returns an iterator.It generates values using the `yeild` keyword, when called with next() (a for loop does this implicitly),and it raises a `StopIteration` exception when there are no more values to generate.Let's see what this means with a very simple example.

In [26]:
def num_gen():
    for i in range(5):
        yield i
gen = num_gen()

In [27]:
next(gen)

0

In [28]:
for i in gen:
    print(i)

1
2
3
4


In [29]:
# no more values to generate
next(gen)

StopIteration: 

> The `StopIteration` error appears because there are no more yield statements in the function.Calling next on the generator after this does not cause it to loop over and start again. - [Generators are Awesome, Learning by Example\n](https://pybit.es/generators.html)

Since learning about generators,a common pattern I use is to build up my sequences:

In [30]:
options = 'red yellow blue white black green purple'.split()
options

['red', 'yellow', 'blue', 'white', 'black', 'green', 'purple']

In [31]:
def create_select_options(options=options):
    select_list = []
    
    for option in options:
        select_list.append(f'<option value={option}>{option.title()}</option>')
        
    return select_list

In [32]:
from pprint import pprint as pp
pp(create_select_options())

['<option value=red>Red</option>',
 '<option value=yellow>Yellow</option>',
 '<option value=blue>Blue</option>',
 '<option value=white>White</option>',
 '<option value=black>Black</option>',
 '<option value=green>Green</option>',
 '<option value=purple>Purple</option>']


Using a generator can write this in 2 lines of code - new code:

In [34]:
def create_select_options_gen(options=options):
    for option in options:
        yield f'<option value={option}>{option.title()}</option>'

In [35]:
print(create_select_options_gen())

<generator object create_select_options_gen at 0x109fef138>


Note that generators are lazy so need to explicitly consume them by iterating over them,for example by looping over them.Another way is to pass them into the `list()` constructor:

In [36]:
list(create_select_options_gen())

['<option value=red>Red</option>',
 '<option value=yellow>Yellow</option>',
 '<option value=blue>Blue</option>',
 '<option value=white>White</option>',
 '<option value=black>Black</option>',
 '<option value=green>Green</option>',
 '<option value=purple>Purple</option>']

Specially when working with large data sets you definitely want to use generators.Lists can only get as big as they fit memory size.Generators are lazily evaluated meaning that they only hold a certain amount of data in memory at once.Just for the sake of giving Python somthing to do,let's calculate leap years for a million years,and compare performance of list vs generator:

In [37]:
# List
def leap_years_lst(n=1000000):
    leap_years = []
    for year in range(1, n+1):
        if calendar.isleap(year):
            leap_years.append(year)
    return leap_years

# Genrator
def leap_years_gen(n=1000000):
    for year in range(1, n+1):
        if calendar.isleap(year):
            yield year

In [39]:
%timeit -n1 leap_years_lst()

323 ms ± 6.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [40]:
%timeit -n1 leap_years_gen()

863 ns ± 560 ns per loop (mean ± std. dev. of 7 runs, 1 loop each)


That is pretty impressive.This is an important concept to know about because Big Data is here to stay!

### Second day: practice

Look at code and see if can refactor it to use list comprehensions.Same for generators.Are you building up a list somewhere where you could potentially use a generator?

And/or exercise here, take this list of names:

In [41]:
NAMES = [
'arnold schwarzenegger', 'alec baldwin', 'bob belderbos',
'julian sequeira', 'sandra bullock', 'keanu reeves',
'julbob pybites', 'bob belderbos', 'julian sequeira',
'al pacino', 'brad pitt', 'matt damon', 'brad pitt'
]

In [45]:
def names_title_gen():
    for name in NAMES:
        yield name.title()
        
list(names_title_gen())

['Arnold Schwarzenegger',
 'Alec Baldwin',
 'Bob Belderbos',
 'Julian Sequeira',
 'Sandra Bullock',
 'Keanu Reeves',
 'Julbob Pybites',
 'Bob Belderbos',
 'Julian Sequeira',
 'Al Pacino',
 'Brad Pitt',
 'Matt Damon',
 'Brad Pitt']

In [60]:
def names_reverse_gen():
    for name in NAMES:
        new_name = name.split()[len(name.split())-1] + ' ' + name.split()[0]
        yield new_name

list(names_reverse_gen())

['schwarzenegger arnold',
 'baldwin alec',
 'belderbos bob',
 'sequeira julian',
 'bullock sandra',
 'reeves keanu',
 'pybites julbob',
 'belderbos bob',
 'sequeira julian',
 'pacino al',
 'pitt brad',
 'damon matt',
 'pitt brad']

In [61]:
def names_reverse_lst():
    new_names = []
    for name in NAMES:
        new_name = name.split()[len(name.split())-1] + ' ' + name.split()[0]
        new_names.append(new_name)
    return new_names

newn = names_reverse_lst()
newn    

['schwarzenegger arnold',
 'baldwin alec',
 'belderbos bob',
 'sequeira julian',
 'bullock sandra',
 'reeves keanu',
 'pybites julbob',
 'belderbos bob',
 'sequeira julian',
 'pacino al',
 'pitt brad',
 'damon matt',
 'pitt brad']

Then use this same list and make a little generator, for example to randomly return a pair of names, try to make this work:

~~~
pairs = gen_pairs()
for _ in range(10):
    next(pairs)
~~~

Should print (values might change as random):

~~~
Arnold teams up with Brad
Alec teams up with Julian
~~~

Have fun!

### Third day: solution / simulate unix pipelines

I hope yesterday's exercise was reasonably doable for you.Here the answers in case you got stuck:

In [62]:
# List comprehension to title csse names

[name.title() for name in NAMES]

['Arnold Schwarzenegger',
 'Alec Baldwin',
 'Bob Belderbos',
 'Julian Sequeira',
 'Sandra Bullock',
 'Keanu Reeves',
 'Julbob Pybites',
 'Bob Belderbos',
 'Julian Sequeira',
 'Al Pacino',
 'Brad Pitt',
 'Matt Damon',
 'Brad Pitt']

In [63]:
# list comprehension to reverse first and last names
# using a helper here to show you that list comprehensions can be passed in functions!

def reverse_first_last_names(name):
    first, last = name.split()
    return f'{last} {first}'

[reverse_first_last_names(name) for name in NAMES]

['schwarzenegger arnold',
 'baldwin alec',
 'belderbos bob',
 'sequeira julian',
 'bullock sandra',
 'reeves keanu',
 'pybites julbob',
 'belderbos bob',
 'sequeira julian',
 'pacino al',
 'pitt brad',
 'damon matt',
 'pitt brad']

In [66]:
def gen_pairs():
    # again a list comprehension is great here to get the first names
    # and title case them in just 1 line of code (this comment took 2)
    first_names = [name.split()[0].title() for name in NAMES]
    while True:
        # added this when I saw Julian teaming up with Julian (always test you code!)
        first, second = None, None
        while first == second:
            first, second = random.sample(first_names, 2)
        yield f'{first} teams up with {second}'

In [67]:
pairs = gen_pairs()
for _ in range(10):
    print(next(pairs))

Bob teams up with Julian
Brad teams up with Alec
Alec teams up with Sandra
Julbob teams up with Bob
Julbob teams up with Arnold
Julian teams up with Al
Brad teams up with Julian
Brad teams up with Matt
Sandra teams up with Bob
Julian teams up with Arnold


Another way to get a slice of a generator is using `itertools.islice`:

In [69]:
first_ten = itertools.islice(pairs, 10)
first_ten

<itertools.islice at 0x10b1a1048>

In [70]:
list(first_ten)

['Al teams up with Keanu',
 'Alec teams up with Al',
 'Al teams up with Brad',
 'Alec teams up with Julian',
 'Bob teams up with Arnold',
 'Brad teams up with Bob',
 'Arnold teams up with Matt',
 'Julbob teams up with Brad',
 'Matt teams up with Julbob',
 'Keanu teams up with Alec']