# Understanding Iterators and (mostly) Generators
Seetha Krishnan
<br>
ASPP - Asia Pacific 2018

## Iterators
Iterators are everywhere. 
An iterator is simply a function that can iterate, say using a `for` loop, over an object (iterable)
<br>In this extremely simple example, the __range(4)__ is the iterable object which at each iteration provides a different value to the __"i"__ variable.

In [1]:
for i in range(4):
    print(i)

0
1
2
3


You can iterate over strings, lists, files, dictionaries etc

In [3]:
import numpy as np
filename = 'Textfiles/sometext.txt'
with open(filename) as f:
    for linenumber, lines in enumerate(f):
        print(f'{linenumber} > {lines}')

0 > The skill to do math on a page

1 > Has declined to the point of outrage.

2 > Equations quadratica

3 > Are solved on Mathematica,

4 > And on birthdays we don't know our age.


### Task - Find read length of lines
The file WallabyDNAseq.txt contains 10000 lines of DNA sequence. Calculate the average and standard deviation of the length of the sequences

In [None]:
a = []
filename = 'WallabyDNAseq.txt'
with open(filename) as f:
    for linenumber, lines in enumerate(f):
        a.append(len(lines))
        
print(f'mean {np.mean(a):0.2f}, standard deviation {np.std(a):0.2f}')

## Generators
Generators are a simple, yet elegant type of iterators.

__To create generators:__ 
- Define a function
- instead of the return statement, use the __yield__ keyword. 

In [4]:
def charcount(filename):
    """ Generator function that reads lines and  yields the line and characters in each line """
    with open(filename) as fin:
        for linenumber, line in enumerate(fin):
            yield line, len(line)

In [5]:
c = charcount(filename='Textfiles/sometext.txt')
print(c)

<generator object charcount at 0x119c46938>


A generator does not hold anything in memory
<br>It "yields" one result at a time and hasnt computed anything till you ask for the value - by saying next

In [16]:
c1 = next(c)

StopIteration: 

In [15]:
print(c1)

("And on birthdays we don't know our age.", 39)


Instead of calling next every time, you will typically use generator functions as an __iterator object__

In [18]:
c = charcount(filename='Textfiles/sometext.txt')
for l in c:
    print(f'> {l[0][:-1]} \t char count: {l[1]}')

> The skill to do math on a page 	 char count: 31
> Has declined to the point of outrage. 	 char count: 38
> Equations quadratica 	 char count: 21
> Are solved on Mathematica, 	 char count: 27
> And on birthdays we don't know our age 	 char count: 39


#### See basic_generator_example.py

### Task 1a
- Define a second generator that yields the number of words in each line
- Use charcount(filename) as an input to this generator
- output print statement should include (linenumber, charcount and word count)

Tip : What I like to do when writing a generator is to print statements instead of yield.
When I am satisfied with the accuracy of the print statement, I convert it to yield

In [20]:
def countwords(linearray):
    """ Write a generator that gets the 
    number of words from each line of the csv file
    Hint : Loop through the input argument linearray and yield the result
    """
    for lines, ccount in linearray:
        yield ccount, len(lines.split())

In [21]:
for i, l in enumerate(countwords(charcount(filename='Textfiles/sometext.txt'))):
    print(f'> line number:{i} char count: {l[0]}, \
    number of words: {l[1]}')

> line number:0 char count: 31, number of words: 8
> line number:1 char count: 38, number of words: 7
> line number:2 char count: 21, number of words: 2
> line number:3 char count: 27, number of words: 4
> line number:4 char count: 39, number of words: 8


#### See memoryusage_generators.py

## Generators are great for large datasets that you want to process one line at a time
- a __generator__ is also an __iterator__!(not vice versa)
- generators can iterate over data __lazily__ without loading the entire data source into memory at once.

- When functions `return`, they are done for good. Generators are alive till values are exhausted
- Functions always start from the first line, generators start where you left off : at __yield__ 
- __Limitation__ - with a generator you can only iterate. You can't peak ahead or look behind

<img src="files/generator-iterator-confusion.png"  height="500" >

## Task 2 : Streaming with `yield`
Multiple CSV files stored in a directory, contain information of x-y position of a swimming zebrafish across time.
<br>__The task:__
1. Loop through each csv file, acquire the x and y position and find distance travelled by the fish at each time point.
2. To find distance travelled between two timepoints, you need to get the x and y position of fish at two consecutive frames.
3. Using the acquired distance travelled, print time spent by the fish at a speed below the threshold. 

  <img src="files/fish.png"  width="350" >

### Read from csv files - line by line

In [23]:
import csv
import os


def CSVfileGrabber(dirname):
    """Step 1 : Grab CSV files from a directory """
    for filename in os.listdir(dirname):
        if filename.endswith('.csv'):
            print('Working on: {}'.format(filename[:5]))  # Print name of fish
            yield os.path.join(dirname, filename)


def readxy(filename):
    """Step 2 : read the csv files line by line """
    with open(filename) as f:
        csvreader = csv.reader(f)
        for i, line in enumerate(csvreader):
            # Skip a few lines
            if i < 10:
                continue
            else:
                 # x and y coordinates
                x = int(line[2])
                y = int(line[3])
                yield (x, y)

Just to make sure things are working

In [27]:
dirname = '/Users/seetha/Desktop/ASPP2018/FishtrackingExample/'  # A small sample dataset

# for files in CSVfileGrabber(dirname):
#     print(files)
    
for files in CSVfileGrabber(dirname):
    numline = 0
    for g in readxy(files):
#         print(g)
        numline += 1
    print('Parsed lines from this csv file is {}'.format(numline))

Working on: Fish1
Parsed lines from this csv file is 17
Working on: Fish2
Parsed lines from this csv file is 17


### Get consecutive values for distance calculation

In [30]:
def consecutivexy1(linearray):
    """Step 4: get consecutive xy values"""
    # Here we want to get two consecutive xy to get speed/frame
    # Make use of the next keyword
    for i, line in enumerate(linearray):
        if i == 0:
            prevxy = line
            nextxy = next(linearray)
        else:
            prevxy = nextxy
            nextxy = line
        yield prevxy, nextxy

A nice way to do this is to use itertools (which is an amazing library for looping through iterators) https://docs.python.org/3/library/itertools.html
<br> `tee` : Return n independent iterators from a single iterable. `tee(seq, n)`

In [31]:
from itertools import tee


def consecutivexy2(linearray):
    # This makes two copies of the same iterable
    prevxy, nextxy = tee(linearray, 2)
    next(nextxy)  # discard one
    yield from zip(prevxy, nextxy)  # Note here I am using "yield from"
    
#     prev = next(linearray)
#     for item in linearray:
#         yield prev, item
#         prev = item
    

In [34]:
# Just to make sure things are working
for files in CSVfileGrabber(dirname):
    numline = 0
    for x, y in consecutivexy2(readxy(files)):
#         print(x, y)
        numline += 1
    print('Parsed lines from this csv file is {}'.format(numline))

Working on: Fish1
(219, 113) (219, 113)
(219, 113) (238, 110)
(238, 110) (224, 110)
(224, 110) (248, 109)
(248, 109) (266, 109)
(266, 109) (278, 111)
(278, 111) (269, 110)
(269, 110) (292, 117)
(292, 117) (310, 118)
(310, 118) (319, 120)
(319, 120) (338, 119)
(338, 119) (330, 115)
(330, 115) (339, 114)
(339, 114) (356, 119)
(356, 119) (353, 121)
(353, 121) (358, 121)
Parsed lines from this csv file is 16
Working on: Fish2
(705, 130) (666, 151)
(666, 151) (659, 151)
(659, 151) (651, 151)
(651, 151) (645, 151)
(645, 151) (633, 150)
(633, 150) (622, 147)
(622, 147) (615, 147)
(615, 147) (604, 147)
(604, 147) (609, 149)
(609, 149) (614, 147)
(614, 147) (570, 146)
(570, 146) (587, 148)
(587, 148) (578, 148)
(578, 148) (572, 149)
(572, 149) (565, 150)
(565, 150) (561, 150)
Parsed lines from this csv file is 16


### Sidenote : `yield from`
With `yield from`, we can skip an extra `for` loop

In [35]:
# A simple example to see what the yield from function will do
A = range(5)
B = range(6, 11)

# Without yield from
def temp(range1, range2):
    for a, b in zip(range1, range2):
        yield a, b
        
# Two loops!! You need two loops!!
for i in temp(A, B):
    print(i)

(0, 6)
(1, 7)
(2, 8)
(3, 9)
(4, 10)


In [36]:
# After Python 3.3 and existance of yield from
def yieldfromexample(A, B):
    yield from zip(A, B)
for i in yieldfromexample(A, B):
    print(i)

(0, 6)
(1, 7)
(2, 8)
(3, 9)
(4, 10)


`Yield from` is especially useful when you have multiple iterators, recursive data structures

## Write the next parts on your own
- Step 5 : Calculate distance between the two consecutive points
- Step 6 : Put it all together

In [None]:
# Step 5: Calculate euclidean distance
import math


def getdist(xy):
    """  
    Write a generator function that recieves 
    the previous and next x-y location of the fish 
    and calculates the distance between the two points
    
    Euclidean distance between two points (x1, y1) and (x2, y2) is 
    sqrt((x1-x2)^2 + (y1-y2)^2)
   """

In [None]:
# Step 6: Put it all together
def getframes(dist, threshold, frames_per_sec):
    """
    Count frames with distance below a user-defined threshold and
    complete the print statement given below
    (Hint: use enumerate to find number of frames)
    
    Example:
    Of 16.27 seconds recording time, time spent with speed less than 10 is 12.83 seconds
    """
    
    print('Of {:0.2f} seconds recording time, time spent with speed less than {} is {:0.2f} seconds')

### Task 2: Solution

In [39]:
import math

def getdist(xy):
    # Calculate euclidean distance
    for prevxy, nextxy in xy:
        # zip allows you to iterate two lists parallely
        dist = [(a - b)**2 for a, b in zip(prevxy, nextxy)]
        dist = math.sqrt(sum(dist))
        yield dist


def getframes(dist, threshold=10, frames_per_sec=30):
    dist_count = 0
    for i, d in enumerate(dist):
        if d < threshold:
            dist_count += 1
    print(f'Of {(i / frames_per_sec):0.3f} seconds recording time,\
    time spent with speed less than {threshold}\
    is {(dist_count / frames_per_sec):0.3f} seconds')

In [38]:
# Test your code with larger datasets
dirname = '/Users/seetha/Desktop/Microbetest/Collective/'
for files in CSVfileGrabber(dirname):
    getframes(
        getdist(
            consecutivexy1(
                readxy(files))), threshold=10, frames_per_sec=30)

Working on: Fish1
Of 16.267 seconds recording time,    time spent with speed less than 10     is 12.833 seconds
Working on: Fish2
Of 600.367 seconds recording time,    time spent with speed less than 10     is 554.267 seconds
Working on: Fish4
Of 599.567 seconds recording time,    time spent with speed less than 10     is 511.933 seconds
Working on: Fish5
Of 598.733 seconds recording time,    time spent with speed less than 10     is 379.467 seconds
Working on: Fish6
Of 16.267 seconds recording time,    time spent with speed less than 10     is 15.133 seconds


## The above statement that calls multiple generators looks ugly. <br> 
In such cases, with multiple genertors lined up, yield can start to feel unintuitive and tedious

Enter Toolz
<br> Toolz by Matt Rocklin - http://toolz.readthedocs.io/en/latest/
<br> It makes streaming super easy - intuitive and concise !

For more examples and explanation from Elegant Scipy written by the brilliant ASPP faculty - https://github.com/elegant-scipy/notebooks/blob/master/notebooks/ch8.ipynb

(Filed under things I can't believe I hardly used before this tutorial)

#### tz.pipe - passes a value through a sequence of functions - one by one
Pipe is simply syntactic sugar to make multiple function calls easy

In [61]:
import toolz as tz
""" This pipe function will do exactly as the previous call (without the added brackets).
The function calls are cleaner and can be read from left to right - which is so much better"""


def pipeline(filename):
    pipe = tz.pipe(filename,
                   readxy,
                   consecutivexy1,
                   getdist,
#                    frame_someinput
                   getframes(threshold=10, frames_per_sec=20)
                   )
    return pipe

In [62]:
dirname = '/Users/seetha/Desktop/Microbetest/Collective/'
for i in CSVfileGrabber(dirname):
    pipeline(i)

Working on: Fish1
Of 16.267 seconds recording time,    time spent with speed less than 10    is 12.833 seconds
Working on: Fish2
Of 600.367 seconds recording time,    time spent with speed less than 10    is 554.267 seconds
Working on: Fish4
Of 599.567 seconds recording time,    time spent with speed less than 10    is 511.933 seconds
Working on: Fish5
Of 598.733 seconds recording time,    time spent with speed less than 10    is 379.467 seconds
Working on: Fish6
Of 16.267 seconds recording time,    time spent with speed less than 10    is 15.133 seconds


#### Solution1 - hard code the outputs

In [42]:
def frame_someinput(sequence):
    return getframes(sequence, threshold=10, frames_per_sec=30)

#### Solution2 - functools.partial
"Partially" evaluates functions based on arguments given. Waits for other arguments to fully evaluate the function

In [48]:
import functools

In [60]:
f = functools.partial(getframes, threshold=10, frames_per_sec=30)

### Solution3 - currying

In [63]:
from toolz import curry
curried_get_frames = curry(getframes)