# DS-GA-3001 Advanced Python for Data Science

Before you turn this problem in, make sure you **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart). You can then either run all cells (in the menubar, select Cell$\rightarrow$Run All), or run each cell individually, **in order**, during the class.

Any textual answers that need to be provided will be marked with "YOUR ANSWER HERE". Replace this text with your answer to the question.

Any code answers that need to be provided will be marked with:

```
# YOUR CODE HERE
raise NotImplementedError()
```

Replace all this code with your answer to the question. If you do not answer the question, the `NotImplementedError` exception will be raised, which will indicate to the grader that no answer has been supplied.

In many cases, code answers will also have some associated test code. You should execute the tests after you have entered your code in order to ensure that your answer is correct. You should not proceed to the next question until your answer is correct.

Finally, insert your name and any collaborators in the cell below.

In [1]:
NAME = "Jiayi Lu (jl6583)" 
COLLABORATORS = ""

---

# The `itertools` module

The [`itertools`](https://docs.python.org/2/library/itertools.html) module implements a number of iterator building blocks that provide a set of fast, memory efficient tools. These building blocks can be used to form an “iterator algebra” making it possible to construct specialized tools succinctly and efficiently in pure Python.

Iterator-based code may be preferred over code which uses lists for several reasons. Since data is not produced from the iterator until it is needed, all of the data is not stored in memory at the same time. Reducing memory usage can reduce swapping and other side-effects of large data sets, increasing performance.

## Merging and splitting iterators

The [`chain()`](https://docs.python.org/2/library/itertools.html#itertools.chain) function takes several iterators as arguments and returns a single iterator that produces the contents of all of them as though they came from a single sequence.

In [2]:
from itertools import *

for i in chain([1, 2, 3], ['a', 'b', 'c']):
    print i

1
2
3
a
b
c


The [`izip()`](https://docs.python.org/2/library/itertools.html#itertools.izip) returns an iterator that combines the elements of several iterators into tuples. It works like the built-in function `zip()`, except that it returns an iterator instead of a list.

In [3]:
from itertools import *

for i in izip([1, 2, 3], ['a', 'b', 'c']):
    print i

(1, 'a')
(2, 'b')
(3, 'c')


The [`islice()`](https://docs.python.org/2/library/itertools.html#itertools.islice) function returns an iterator which returns selected items from the input iterator, by index. It takes the same arguments as the slice operator for lists: start, stop, and step. The start and step arguments are optional.

In [4]:
from itertools import *

print 'Stop at 5:'
for i in islice(count(), 5):
    print i

print 'Start at 5, Stop at 10:'
for i in islice(count(), 5, 10):
    print i

print 'By tens to 100:'
for i in islice(count(), 0, 100, 10):
    print i

Stop at 5:
0
1
2
3
4
Start at 5, Stop at 10:
5
6
7
8
9
By tens to 100:
0
10
20
30
40
50
60
70
80
90


The [`tee()`](https://docs.python.org/2/library/itertools.html#itertools.tee) function returns several independent iterators (defaults to 2) based on a single original input. It has semantics similar to the Unix tee utility, which repeats the values it reads from its input and writes them to a named file and standard output.

In [5]:
from itertools import *

r = islice(count(), 5)
i1, i2 = tee(r)

for i in i1:
    print 'i1:', i
for i in i2:
    print 'i2:', i

i1: 0
i1: 1
i1: 2
i1: 3
i1: 4
i2: 0
i2: 1
i2: 2
i2: 3
i2: 4


## Converting inputs

The [`imap()`](https://docs.python.org/2/library/itertools.html#itertools.imap) function returns an iterator that calls a function on the values in the input iterators, and returns the results. It works like the built-in `map()`, except that it stops when any input iterator is exhausted (instead of inserting `None` values to completely consume all of the inputs).

In the first example, the lambda function multiplies the input values by 2. In a second example, the lambda function multiplies 2 arguments, taken from separate iterators, and returns a tuple with the original arguments and the computed value.

In [6]:
from itertools import *

print 'Doubles:'
for i in imap(lambda x:2*x, xrange(5)):
    print i

print 'Multiples:'
for i in imap(lambda x,y:(x, y, x*y), xrange(5), xrange(5,10)): #This stops when any of the iterators are exhausted
    print '%d * %d = %d' % i

Doubles:
0
2
4
6
8
Multiples:
0 * 5 = 0
1 * 6 = 6
2 * 7 = 14
3 * 8 = 24
4 * 9 = 36


The [`starmap()`](https://docs.python.org/2/library/itertools.html#itertools.starmap) function is similar to `imap()`, but instead of constructing a tuple from multiple iterators it splits up the items in a single iterator as arguments to the mapping function using the `*` syntax. Where the mapping function to `imap()` is called `f(i1, i2)`, the mapping function to `starmap()` is called `f(*i)`.

In [7]:
from itertools import *

values = [(0, 5), (1, 6), (2, 7), (3, 8), (4, 9)]
for i in starmap(lambda x,y:(x, y, x*y), values):
    print '%d * %d = %d' % i

0 * 5 = 0
1 * 6 = 6
2 * 7 = 14
3 * 8 = 24
4 * 9 = 36


## Producing new values

The [`count()`](https://docs.python.org/2/library/itertools.html#itertools.count) function returns an interator that produces consecutive integers, indefinitely. The first number can be passed as an argument, the default is zero. There is no upper bound argument (see the built-in `xrange()` for more control over the result set). In this example, the iteration stops because the list argument is consumed.

In [8]:
from itertools import *

for i in izip(count(1), ['a', 'b', 'c']): #stops when one iterable is consumed
    print i

(1, 'a')
(2, 'b')
(3, 'c')


The [`cycle()`](https://docs.python.org/2/library/itertools.html#itertools.cycle) function returns an iterator that repeats the contents of the arguments it is given indefinitely. Since it has to remember the entire contents of the input iterator, it may consume quite a bit of memory if the iterator is long. In this example, a counter variable is used to break out of the loop after a few cycles.

In [9]:
from itertools import *

i = 0
for item in cycle(['a', 'b', 'c']):
    i += 1
    if i == 10:
        break
    print (i, item)

(1, 'a')
(2, 'b')
(3, 'c')
(4, 'a')
(5, 'b')
(6, 'c')
(7, 'a')
(8, 'b')
(9, 'c')


The [`repeat()`](https://docs.python.org/2/library/itertools.html#itertools.repeat) function returns an iterator that produces the same value each time it is accessed. It keeps going forever, unless the optional times argument is provided to limit it.

In [10]:
from itertools import *

for i in repeat('over-and-over', 5): #cycle through arguments
    print i

over-and-over
over-and-over
over-and-over
over-and-over
over-and-over


It is useful to combine `repeat()` with `izip()` or `imap()` when invariant values need to be included with the values from the other iterators.

In [11]:
from itertools import *

for i in imap(lambda x,y:(x, y, x*y), repeat(2), xrange(5)):
    print '%d * %d = %d' % i

2 * 0 = 0
2 * 1 = 2
2 * 2 = 4
2 * 3 = 6
2 * 4 = 8


## Filtering

The [`dropwhile()`](https://docs.python.org/2/library/itertools.html#itertools.dropwhile) function returns an iterator that returns elements of the input iterator after a condition becomes false for the first time. It does not filter every item of the input; after the condition is false the first time, all of the remaining items in the input are returned.

In [12]:
from itertools import *

def should_drop(x):
    print 'Testing:', x
    return (x<1)

for i in dropwhile(should_drop, [ -1, 0, 1, 2, 3, 4, 1, -2 ]): #it dumps the rest once condition is broken
    print 'Yielding:', i

Testing: -1
Testing: 0
Testing: 1
Yielding: 1
Yielding: 2
Yielding: 3
Yielding: 4
Yielding: 1
Yielding: -2


The opposite of `dropwhile()`, [`takewhile()`](https://docs.python.org/2/library/itertools.html#itertools.takewhile) returns an iterator that returns items from the input iterator as long as the test function returns true.

In [13]:
from itertools import *

def should_take(x):
    print 'Testing:', x
    return (x<2)

for i in takewhile(should_take, [ -1, 0, 1, 2, 3, 4, 1, -2 ]): #as long as the condicition statisfied
    print 'Yielding:', i

Testing: -1
Yielding: -1
Testing: 0
Yielding: 0
Testing: 1
Yielding: 1
Testing: 2


[`ifilter()`](https://docs.python.org/2/library/itertools.html#itertools.ifilter) returns an iterator that works like the built-in `filter()` does for lists, including only items for which the test function returns true. It is different from `dropwhile()` in that every item is tested before it is returned.

In [14]:
from itertools import *

def check_item(x):
    print 'Testing:', x
    return (x<1)

for i in ifilter(check_item, [ -1, 0, 1, 2, 3, 4, 1, -2 ]): #every element is tested
    print 'Yielding:', i

Testing: -1
Yielding: -1
Testing: 0
Yielding: 0
Testing: 1
Testing: 2
Testing: 3
Testing: 4
Testing: 1
Testing: -2
Yielding: -2


The opposite of `ifilter()`, [`ifilterfalse()`](https://docs.python.org/2/library/itertools.html#itertools.ifilterfalse) returns an iterator that includes only items where the test function returns false.

In [15]:
from itertools import *

def check_item(x):
    print 'Testing:', x
    return (x<1)

for i in ifilterfalse(check_item, [ -1, 0, 1, 2, 3, 4, 1, -2 ]):
    print 'Yielding:', i

Testing: -1
Testing: 0
Testing: 1
Yielding: 1
Testing: 2
Yielding: 2
Testing: 3
Yielding: 3
Testing: 4
Yielding: 4
Testing: 1
Yielding: 1
Testing: -2


## Grouping Data

The [`groupby()`](https://docs.python.org/2/library/itertools.html#itertools.groupby) function returns an iterator that produces sets of values grouped by a common key.

This example from the standard library documentation shows how to group keys in a dictionary which have the same value:

In [16]:
from itertools import *
from operator import itemgetter

d = dict(a=1, b=2, c=1, d=2, e=1, f=2, g=3)
di = sorted(d.iteritems(), key=itemgetter(1))
for k, g in groupby(di, key=itemgetter(1)):
    print k, map(itemgetter(0), g)

1 ['a', 'c', 'e']
2 ['b', 'd', 'f']
3 ['g']


This more complicated example illustrates grouping related values based on some attribute. Notice that the input sequence needs to be sorted on the key in order for the groupings to work out as expected.

In [17]:
from itertools import *

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    def __repr__(self):
        return 'Point(%s, %s)' % (self.x, self.y)
    def __cmp__(self, other):
        return cmp((self.x, self.y), (other.x, other.y))

# Create a dataset of Point instances
data = list(imap(Point, 
                 cycle(islice(count(), 3)), 
                 islice(count(), 10),
                 )
            )
print 'Data:', data #(0,0)(1,1)(2,2)(0,3)...
print

# Try to group the unsorted data based on X values
print 'Grouped, unsorted:'
for k, g in groupby(data, lambda o:o.x):
    print k, list(g) #return key, group pairs
print

# Sort the data
data.sort()
print 'Sorted:', data
print

# Group the sorted data based on X values
print 'Grouped, sorted:'
for k, g in groupby(data, lambda o:o.x):
    print k, list(g)
print

Data: [Point(0, 0), Point(1, 1), Point(2, 2), Point(0, 3), Point(1, 4), Point(2, 5), Point(0, 6), Point(1, 7), Point(2, 8), Point(0, 9)]

Grouped, unsorted:
0 [Point(0, 0)]
1 [Point(1, 1)]
2 [Point(2, 2)]
0 [Point(0, 3)]
1 [Point(1, 4)]
2 [Point(2, 5)]
0 [Point(0, 6)]
1 [Point(1, 7)]
2 [Point(2, 8)]
0 [Point(0, 9)]

Sorted: [Point(0, 0), Point(0, 3), Point(0, 6), Point(0, 9), Point(1, 1), Point(1, 4), Point(1, 7), Point(2, 2), Point(2, 5), Point(2, 8)]

Grouped, sorted:
0 [Point(0, 0), Point(0, 3), Point(0, 6), Point(0, 9)]
1 [Point(1, 1), Point(1, 4), Point(1, 7)]
2 [Point(2, 2), Point(2, 5), Point(2, 8)]



## Challenge

<div class="alert alert-success">
<p>This task aims to help you appreciate the usefulness of the `groupby()` function of `itertools`.
<p>
You are given a string $S$. Suppose a character '$c$' occurs consecutively $X$ times in the string. Write a function called **`compress`** that will replace these consecutive occurrences of the character '$c$' with ($X$, $c$) in the string.
<p>
For a better understanding of the problem, check the explanation.
<p>
<h3>Input Format</h3>
<p>
The function should accept a single argument containing the string $S$.
<p>

<h3>Output Format</h3>
<p>
The funcation should return the modified string.

<h3>Constraints</h3>
<p>
All the characters of $S$ denote integers between $0$ and $9$.
<p>
$1\le|S|\le10^4$
<p>
<h3>Sample Input</h3>
<p>
```
1222311
```
<p>
<h3>Sample Output</h3>
<p>
```
(1, 1) (3, 2) (1, 3) (2, 1)
```
<p>
<h3>Explanation</h3>
<p>
First, the character $1$ occurs only once. It is replaced by $(1, 1)$. Then the character $2$ occurs three times, and it is replaced by $(3, 2)$ and so on.
<p>
Also, note the single space within each compression and between the compressions.
</div>

In [18]:
def compress(s):
    """Compress the string s so that consecutive characters 'c' are replaced by
    (X, c) where X is the number of occurrences of c in the string. Returns a new
    string continaing the compressed characters separated by spaces.
    """
    #creating tuples for each character in s
    tuples = [(int(i), int(i)) for i in s]
    groups = []
    
    for k, g in groupby(tuples, itemgetter(0)):
        groups.append(str((len(list(g)),k)))
    
    return " ".join(groups).strip()
    
    

Run the following tests to ensure your code is correct.

In [19]:
from nose.tools import assert_equal
assert_equal(compress('1222311'), '(1, 1) (3, 2) (1, 3) (2, 1)')
assert_equal(compress('2344244443222'), '(1, 2) (1, 3) (2, 4) (1, 2) (4, 4) (1, 3) (3, 2)')
assert_equal(compress('9949333922222888888'), '(2, 9) (1, 4) (1, 9) (3, 3) (1, 9) (5, 2) (6, 8)')
assert_equal(compress('11111111'), '(8, 1)')