In [2]:
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import csv
import dateutil

In [3]:
sns.set_style("whitegrid")
sns.set_color_codes("muted")

tips = sns.load_dataset("tips")

## Lab 3

In this lab, we'll be practicing higher order functions in Python. In the first couple task, we will use the tips data set that comes with seaborn. In particular, we'll be working directly on the <b>rows</b> (a Python's list), extracted from the data set.

In [4]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [5]:
rows = tips.values.tolist()

In [6]:
rows[0]

[16.99, 1.01, 'Female', 'No', 'Sun', 'Dinner', 2]

### Objective 1
We would like to know how many of the tip transactions were served by a waiter, and how many were served by a waitress. We can do this by just counting the number of occurence of "Male" and "Female" in the sex column.

In [7]:
# First, we extract the gender column from rows
gender = map(lambda x: x[2], rows)

In [8]:
# Then, for each row, we'll create a 2-element counter, 1 for the waiter and 1 for the waitress
pairs = map(lambda x: (int(x=='Male'), int(x=='Female')), gender)

In [9]:
# Here are some samples
pairs[:3]

[(0, 1), (1, 0), (1, 0)]

In [10]:
# Then, we sum all the tuples up, column wise, then we get our results
reduce(lambda x,y: (x[0]+y[0], x[1]+y[1]), pairs)

(157, 87)

### Objective 2
Next, we'd like to count the number of transactions per day using higher order functions.

In [11]:
# Similar to the previous task, we first extract the 'day' column from the data set
day = map(lambda x: x[4], rows)

In [12]:
# Some samples
day[:3]

['Sun', 'Sun', 'Sun']

In [13]:
# We're going to use a dictionary to accumulate the count.
# In order to use this with reduce(), we would have to both
# make change to the accumulating dictionary and return a
# copy of the dictionary. Thus, it would be simpler for us
# to use an additional function instead of lambda because
# lambda only supports expressions.

def reducer(x, y):
    x[y] = x.get(y, 0)+1
    return x

counts = reduce(reducer, day, {})

In [14]:
# Here are the results
counts

{'Fri': 19, 'Sat': 87, 'Sun': 76, 'Thur': 62}

### Objective 3
Next, we would like to do a similar task as Task 2 of Homework 2 using higher order functions. Our goal is to extract the birth year of the first subscribed rider of each day, assuming that our rides are sorted temporallly. Below is our attemp in class, where we use a global variable to help us keeping track of the previous day status. There is also another approach that doesn't require the global variable, which is a task in your Homework 3.

In [15]:
Y = None
def checkDate(x):
    global Y
    d = dateutil.parser.parse(x['starttime']).day
    if d!=Y:
        Y = d
        return True
    return False


filename = 'citibike.csv'
with open(filename,'r') as fi:
    reader = csv.DictReader(fi)    
    values = filter(checkDate, reader)


In [16]:
print values[0], len(values)

{'end_station_id': '423', 'gender': '2', 'bikeid': '17131', 'start_station_latitude': '40.75044999', 'end_station_name': 'W 54 St & 9 Ave', 'cartodb_id': '1', 'start_station_name': '8 Ave & W 31 St', 'start_station_id': '521', 'start_station_longitude': '-73.99481051', 'usertype': 'Subscriber', 'stoptime': '2015-02-01 00:14:00+00', 'end_station_longitude': '-73.98690506', 'starttime': '2015-02-01 00:00:00+00', 'end_station_latitude': '40.76584941', 'tripduration': '801', 'the_geom': '', 'birth_year': '1978'} 7


### Objective 4
Next, we demonstrate the use of Python's <b>multiprocessing</b> module to perform parallel computation. In particular, we will use the <b>Pool</b> object to help us process heavy computation in parallel. In this case, we would like to map all the citibike locations into <b>NAD83 / New York Long Island</b> geometry and compute its distance to Times Square.

In order to use multiprocessing.Pool, we will need to create a Pool object providing the number of processes available to the program. Then we would use <b>pool.map</b> command to parallelize the application of our functions on an iterable. The use of <b>pool.map</b> is almost identical to the <b>map()</b> higher order function in Python. We could also use it to perform task parallelism by supplying a task-based function with a list of task IDs.

#### NOTE:
On Windows, you cannot run this notebook. <b>multiprocessing</b> module is required to be wrapped within a main() function, aka you must put this inside a <b>if \__name\__=='\__main\__'</b> somewhere.

In [17]:
import multiprocessing as mp

# timeit for measuing time between operation
from timeit import default_timer

# These modules are needed for projection
import shapely.geometry as geom
import pyproj

# Function to compute distance function
proj = pyproj.Proj(init='epsg:2263', preserve_units=True)
timesSquare = geom.Point(proj(-73.9857,40.7577))
def distanceFromTimesSquare(row):
    return geom.Point(proj(float(row['start_station_longitude']),
                           float(row['start_station_latitude']))).distance(timesSquare)

filename = 'citibike.csv'

# First we run a test without parallelism
with open(filename,'r') as fi:
    reader = csv.DictReader(fi)
    start = default_timer()
    coords = map(distanceFromTimesSquare, reader)
    end = default_timer()
    print 'Time running sequentially is %.2f seconds' % (end-start)
    
pool = mp.Pool(processes=2)
with open(filename,'r') as fi:
    reader = csv.DictReader(fi)
    start = default_timer()
    coords = pool.map(distanceFromTimesSquare, reader)
    end = default_timer()
    print 'Time running with 2 processes is %.2f seconds' % (end-start)

Time running sequentially is 3.33 seconds
Time running with 2 processes is 1.96 seconds
