# CME 211: Extra Python Topics!

These notes contain some examples and pointers to various Python topics that I
find myself using quite a bit.

## Iterators

We've seen the `range` function in Python. In Python 2 `range` returned a list.
In Python 3, `range` returns an "iterator", which avoids the memory allocation
for a full list.

In [1]:
r = range(10)
print(r)
print(type(r))

range(0, 10)
<class 'range'>


We often write:

In [2]:
for i in range(10):
    print(i, end=" ")
print()

0 1 2 3 4 5 6 7 8 9 


Python defines an interface
for [iterators](https://docs.python.org/3/tutorial/classes.html#iterators). The
key methods are `iter()` and `next()`:

In [3]:
r = iter(range(3)) # iter returns an iterator
print(next(r))
print(next(r))
print(next(r))
print(next(r))

0
1
2


StopIteration: 

Let's write our own simplified implementation of `range`:

In [4]:
class my_range:
    def __init__(self,n):
        self.i = 0
        self.n = n
    def __iter__(self):
        return self
    def __next__(self):
        if self.i == self.n:
            raise StopIteration
        t = self.i
        self.i += 1
        return t

In [5]:
for i in my_range(4):
    print(i,end=" ")
print()

0 1 2 3 


An object with a `next()` method that behaves in the above manner is called an
**iterator** in Python.

## Generators

Defining a class for an iterator can be a bit verbose. Python has a keyword
called `yield` which allows you to easily write a **generator**.

In [6]:
def my_range2(n):
    i = 0
    while i < n:
        yield i
        i += 1

In [7]:
for i in my_range2(4):
    print(i,end=" ")
print()

0 1 2 3 


This allows you to create an iterator by just writing a function!

## Application: iterating over words in a file

Let's say we wanted to count unique words in Shakespeare's entire body of work.

First, let's download it:

In [8]:
!wget https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt

--2016-10-21 11:59:12--  https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
Resolving ocw.mit.edu... 184.30.176.137, 2001:418:142c:18e::18a8, 2001:418:142c:194::18a8
Connecting to ocw.mit.edu|184.30.176.137|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5458199 (5.2M) [text/plain]
Saving to: 't8.shakespeare.txt.1'


2016-10-21 11:59:17 (1006 KB/s) - 't8.shakespeare.txt.1' saved [5458199/5458199]



And inspect:

In [9]:
!head t8.shakespeare.txt

This is the 100th Etext file presented by Project Gutenberg, and
is presented in cooperation with World Library, Inc., from their
Library of the Future and Shakespeare CDROMS.  Project Gutenberg
often releases Etexts that are NOT placed in the Public Domain!!

Shakespeare

*This Etext has certain copyright implications you should read!*

<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM


To loop over words, we could write code like:

In [10]:
with open("t8.shakespeare.txt","r") as f:
    for i, line in enumerate(f):
        # only loop over 10 lines
        if i > 10:
            break
        # loop over words
        for word in line.split():
            print(word)

This
is
the
100th
Etext
file
presented
by
Project
Gutenberg,
and
is
presented
in
cooperation
with
World
Library,
Inc.,
from
their
Library
of
the
Future
and
Shakespeare
CDROMS.
Project
Gutenberg
often
releases
Etexts
that
are
NOT
placed
in
the
Public
Domain!!
Shakespeare
*This
Etext
has
certain
copyright
implications
you
should
read!*
<<THIS
ELECTRONIC
VERSION
OF
THE
COMPLETE
WORKS
OF
WILLIAM
SHAKESPEARE
IS
COPYRIGHT
1990-1993
BY
WORLD
LIBRARY,
INC.,
AND
IS


Now let's say we have to loop over words of two different files. The code could
get quite messy.  Let's use a generator:

In [11]:
def words(filename):
    with open(filename,"r") as f:
        for line in f:
            for word in line.split():
                yield word

Now, the code that operates on words can be:

In [12]:
for i, word in enumerate(words("t8.shakespeare.txt")):
    if i > 10:
        break
    print(word)

This
is
the
100th
Etext
file
presented
by
Project
Gutenberg,
and


### Counting words

Let's use a dictionary to count words.  Our first try might look like this:

In [13]:
word_count = {}
for word in words("t8.shakespeare.txt"):
    # let's clean the word up
    w = word.strip().lower()
    if w in word_count:
        # w is already in dict, so increment
        word_count[w] += 1
    else:
        # w is not in dict, do set to 1
        word_count[w] = 1

This is a common patters. The Python `collections` module has something called
`defaultdict` that can help:

In [14]:
from collections import defaultdict
word_count = defaultdict(int)
for word in words("t8.shakespeare.txt"):
    w = word.strip().lower()
    word_count[w] += 1

Let's explore a little bit:

In [15]:
print(word_count['love'])
print(word_count['hate'])
print(word_count['king'])
print(word_count['queen'])

1279
119
1698
466


How might we get words with the highest count?

In [16]:
# create a list of tuples with (count, word)
word_count_list = [(c, w) for w, c in word_count.items()]
# sort will first sort by first element in the tuples
word_count_list.sort(reverse=True)
word_count_list[0:10]

[(27549, 'the'),
 (26037, 'and'),
 (19540, 'i'),
 (18700, 'to'),
 (18010, 'of'),
 (14383, 'a'),
 (12455, 'my'),
 (10671, 'in'),
 (10630, 'you'),
 (10487, 'that')]

Sweet!

## Generator expressions

We've seen list comprehensions:

In [17]:
xs = list(range(10))
ys = [x*x for x in xs]

A list comprehension creates a entirely new list in memory and does all of the
computation before the result (elements of `ys`) is used.

A generator expression creates an iterator that behaves like the list:

```
yg = (x*x for x in xs)
type(yg)
next(yg)
next(yg)
```


A generator expression does not create a list in memory. The computation is only
performed when `next` is called.

You can pass a generator expression any where an iterator is expected:

In [18]:
xs = [1, 0, 0, 1, 0, 1]

In [19]:
# any returns True if any items of a collection (or iterator) are True
any(x == 1 for x in xs)

True

In [20]:
# all returns True only if all items in a collection (or iterator) are True
all(x == 1 for x in xs)

False

You can also use a generator expression in a `for` loop:

In [21]:
word_count = defaultdict(int)
for word in (w.strip().lower() for w in words("t8.shakespeare.txt")):
    word_count[w] += 1

## Itertools

Have a look at
the [`itertools`](https://docs.python.org/3/library/itertools.html) module. It
contains functions to help you write loops.

First, let's see `zip`, which is actually a
Python [built-in function](https://docs.python.org/3/library/functions.html):

In [22]:
letters = 'abcde'
numbers = [1,2,3,4,5]

# let's iterate over letters and numbers at same time
for l, n in zip(letters, numbers):
    print("{} {}".format(l, n))

a 1
b 2
c 3
d 4
e 5


Exercise: what does `zip` return?  How can I get a list from it?

Now let's use the `product` function to iterate over all pairs:

In [23]:
from itertools import product
for l, n in product(letters, numbers):
    print("{} {}".format(l, n))

a 1
a 2
a 3
a 4
a 5
b 1
b 2
b 3
b 4
b 5
c 1
c 2
c 3
c 4
c 5
d 1
d 2
d 3
d 4
d 5
e 1
e 2
e 3
e 4
e 5
