# Lab 5 - Joining Uber Pick-Ups, Stations, and Boroughs

In [Lecture 3.4](./3_4_joining_large_and_small_files.ipynb), we introduced using a Python `dict` to join a large and small table.  In this lab, we will practice this technique on the uber data set.

**Note:** Make sure that you download and unzip the file `Uber-Jan-Feb-FOIL.csv` before proceeding.

In [113]:
import pandas as pd
from dfply import *
from toolz import first
from calendar import day_name
import numpy as np
from functoolz import pipeable

In [119]:
c_size = 500000
trips = pd.read_csv("./data/uber/uber-trip-data/uber-raw-data-janjune-15.csv", chunksize=c_size,parse_dates=['Pickup_date'])
first_chunk = next(trips)

In [5]:
first_chunk.head()

Unnamed: 0,Dispatching_base_num,Pickup_date,Affiliated_base_num,locationID
0,B02617,2015-05-17 09:47:00,B02617,141
1,B02617,2015-05-17 09:47:00,B02617,65
2,B02617,2015-05-17 09:47:00,B02617,100
3,B02617,2015-05-17 09:47:00,B02774,80
4,B02617,2015-05-17 09:47:00,B02617,90


## <font color="red"> Problem 1 - Creating 2 Location `dict`</font>

Read the file `taxi-zone-lookup.csv` and inspect the columns. What 2 pieces of information will this file allow us to add to the `uber-raw-data-janjune-15.csv`?  Make a `dict` for each of these variables.

This file allows us to add the Borough and Zone based on the LocationID.

In [7]:
taxi_zone_lookup = pd.read_csv("./data/uber/uber-trip-data/taxi-zone-lookup.csv")
taxi_zone_lookup.head()

Unnamed: 0,LocationID,Borough,Zone
0,1,EWR,Newark Airport
1,2,Queens,Jamaica Bay
2,3,Bronx,Allerton/Pelham Gardens
3,4,Manhattan,Alphabet City
4,5,Staten Island,Arden Heights


In [14]:
bor_dict = {id_:bor for id_,bor in zip(taxi_zone_lookup.LocationID,taxi_zone_lookup.Borough)}

In [11]:
zone_dict = {id_:zone for id_,zone in zip(taxi_zone_lookup.LocationID,taxi_zone_lookup.Zone)}

## <font color="red"> Problem 2 - Dispatch Translation</font>

The following table was taken from the FiveThirtyEight github page and contains the names of each Uber dispatch station.  Create a translation `dict` for these data.

Base Code | Base Name
---|---------
B02512 | Unter
B02598 | Hinter
B02617 | Weiter
B02682 | Schmecken
B02764 | Danach-NY
B02765 | Grun
B02835 | Dreist
B02836 | Drinnen

In [16]:
base_dict = {'B02512':'Unter',
            'B02598':'Hinter',
            'B02617':'Weiter',
            'B02682':'Schmecken',
            'B02764':'Danach-NY',
            'B02765':'Grun',
            'B02835':'Dreist',
            'B02836':'Drinnen',}

## <font color="red"> Problem 3 - Prototyping a Helper Function</font>

Use the first chunk to prototype a helper function that

1. Add four new columns, one for each key/translation.
2. Drop each of the associated keys.
3. Convert the `pickup_date` to a datetime column.
4. Add various datepart columns

This function should use appropriate `dfply` functions and a pipe.

In [45]:
def to_datetime(series, infer_datetime_format=True):
    return pd.to_datetime(series, infer_datetime_format=infer_datetime_format)

In [100]:
first_chunk.Pickup_date = to_datetime(first_chunk.Pickup_date)
first_chunk_clean = (first_chunk >>
                    mutate(borough = X.locationID.map(bor_dict),
                          zone = X.locationID.map(zone_dict),
                          dispatching_base = X.Dispatching_base_num.map(base_dict),
                          affiliated_base = X.Affiliated_base_num.map(base_dict),
                          weekday = X.Pickup_date.dt.weekday_name,
                          weekofyear = X.Pickup_date.dt.weekofyear,
                          dayofyear = X.Pickup_date.dt.dayofyear,
                          year = X.Pickup_date.dt.year,
                          month = X.Pickup_date.dt.month_name(),
                          day = X.Pickup_date.dt.day,
                          hour = X.Pickup_date.dt.hour
                          ) >>
                    drop(X.locationID,X.Dispatching_base_num,X.Affiliated_base_num)
                    )
first_chunk_clean.head()

Unnamed: 0,Pickup_date,borough,zone,dispatching_base,affiliated_base,weekday,weekofyear,dayofyear,year,month,day,hour
0,2015-05-17 09:47:00,Manhattan,Lenox Hill West,Weiter,Weiter,Sunday,20,137,2015,May,17,9
1,2015-05-17 09:47:00,Brooklyn,Downtown Brooklyn/MetroTech,Weiter,Weiter,Sunday,20,137,2015,May,17,9
2,2015-05-17 09:47:00,Manhattan,Garment District,Weiter,Weiter,Sunday,20,137,2015,May,17,9
3,2015-05-17 09:47:00,Brooklyn,East Williamsburg,Weiter,,Sunday,20,137,2015,May,17,9
4,2015-05-17 09:47:00,Manhattan,Flatiron,Weiter,Weiter,Sunday,20,137,2015,May,17,9


In [120]:
uber_clean_func = pipeable(lambda df: (df >>
                                     mutate(borough = X.locationID.map(bor_dict),
                                          zone = X.locationID.map(zone_dict),
                                          dispatching_base = X.Dispatching_base_num.map(base_dict),
                                          affiliated_base = X.Affiliated_base_num.map(base_dict),
                                          weekday = X.Pickup_date.dt.weekday_name,
                                          weekofyear = X.Pickup_date.dt.weekofyear,
                                          dayofyear = X.Pickup_date.dt.dayofyear,
                                          year = X.Pickup_date.dt.year,
                                          month = X.Pickup_date.dt.month_name(),
                                          day = X.Pickup_date.dt.day,
                                          hour = X.Pickup_date.dt.hour
                                          ) >>
                                    drop(X.locationID,X.Dispatching_base_num,X.Affiliated_base_num)))

In [121]:
first_chunk_c = uber_clean_func(first_chunk)
first_chunk_c.head()

Unnamed: 0,Pickup_date,borough,zone,dispatching_base,affiliated_base,weekday,weekofyear,dayofyear,year,month,day,hour
0,2015-05-17 09:47:00,Manhattan,Lenox Hill West,Weiter,Weiter,Sunday,20,137,2015,May,17,9
1,2015-05-17 09:47:00,Brooklyn,Downtown Brooklyn/MetroTech,Weiter,Weiter,Sunday,20,137,2015,May,17,9
2,2015-05-17 09:47:00,Manhattan,Garment District,Weiter,Weiter,Sunday,20,137,2015,May,17,9
3,2015-05-17 09:47:00,Brooklyn,East Williamsburg,Weiter,,Sunday,20,137,2015,May,17,9
4,2015-05-17 09:47:00,Manhattan,Flatiron,Weiter,Weiter,Sunday,20,137,2015,May,17,9


## <font color="red"> Problem 4 - Creating a SQL database</font>

Use `pandas` and your helper function to create a `sqlite` database for this file.

In [122]:
add_primary_key = pipeable(lambda start, df: (df
                                              >> mutate(id = np.arange(start, start + len(df))
                                              )))
process_chunk = pipeable(lambda i, df, chunksize=c_size: df >> uber_clean_func >> add_primary_key(i*c_size))

In [123]:
from more_sqlalchemy import get_sql_types
i=0
complete_first_chunk = first_chunk >> uber_clean_func >> add_primary_key(i)
sql_types = get_sql_types(complete_first_chunk)
sql_types

{'Pickup_date': sqlalchemy.sql.sqltypes.DateTime,
 'borough': sqlalchemy.sql.sqltypes.String,
 'zone': sqlalchemy.sql.sqltypes.String,
 'dispatching_base': sqlalchemy.sql.sqltypes.String,
 'affiliated_base': sqlalchemy.sql.sqltypes.String,
 'weekday': sqlalchemy.sql.sqltypes.String,
 'weekofyear': sqlalchemy.sql.sqltypes.Integer,
 'dayofyear': sqlalchemy.sql.sqltypes.Integer,
 'year': sqlalchemy.sql.sqltypes.Integer,
 'month': sqlalchemy.sql.sqltypes.String,
 'day': sqlalchemy.sql.sqltypes.Integer,
 'hour': sqlalchemy.sql.sqltypes.Integer,
 'id': sqlalchemy.sql.sqltypes.Integer}

In [124]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///databases/uber_lab_5.db', echo=False)

In [126]:
schema = pd.io.sql.get_schema(complete_first_chunk, # dataframe
                              'uber_lab_5', # name in SQL db
                              keys='id', # primary key
                              con=engine, # connection
                              dtype=sql_types # SQL types
)
print(schema)
engine.execute(schema)


CREATE TABLE uber_lab_5 (
	"Pickup_date" DATETIME, 
	borough VARCHAR, 
	zone VARCHAR, 
	dispatching_base VARCHAR, 
	affiliated_base VARCHAR, 
	weekday VARCHAR, 
	weekofyear INTEGER, 
	dayofyear INTEGER, 
	year INTEGER, 
	month VARCHAR, 
	day INTEGER, 
	hour INTEGER, 
	id INTEGER NOT NULL, 
	CONSTRAINT uber_lab_5_pk PRIMARY KEY (id)
)




<sqlalchemy.engine.result.ResultProxy at 0x7fa4b6b99828>

In [128]:
c_size = 500000
df_iter = enumerate(pd.read_csv("./data/uber/uber-trip-data/uber-raw-data-janjune-15.csv", 
                                header=0,
                                parse_dates=['Pickup_date'],
                                chunksize=c_size,
                                sep=',',
                                engine='python'))

In [129]:
for i, chunk in df_iter:
    processed_chunk = chunk >> process_chunk(i)
    print('writing chunk {0}'.format(i))
    processed_chunk.to_sql('uber', 
                           con=engine, 
                           dtype=sql_types, 
                           index=False,
                           if_exists='append')

writing chunk 0
writing chunk 1
writing chunk 2
writing chunk 3
writing chunk 4
writing chunk 5
writing chunk 6
writing chunk 7
writing chunk 8
writing chunk 9
writing chunk 10
writing chunk 11
writing chunk 12
writing chunk 13
writing chunk 14
writing chunk 15
writing chunk 16
writing chunk 17
writing chunk 18
writing chunk 19
writing chunk 20
writing chunk 21
writing chunk 22
writing chunk 23
writing chunk 24
writing chunk 25
writing chunk 26
writing chunk 27
writing chunk 28
writing chunk 29
writing chunk 30
writing chunk 31
writing chunk 32
writing chunk 33
writing chunk 34
writing chunk 35
writing chunk 36
writing chunk 37
writing chunk 38
writing chunk 39
writing chunk 40
writing chunk 41
writing chunk 42
writing chunk 43
writing chunk 44
writing chunk 45
writing chunk 46
writing chunk 47
writing chunk 48
writing chunk 49
writing chunk 50
writing chunk 51
writing chunk 52
writing chunk 53
writing chunk 54
writing chunk 55
writing chunk 56
writing chunk 57
writing chunk 58
writing

writing chunk 462
writing chunk 463
writing chunk 464
writing chunk 465
writing chunk 466
writing chunk 467
writing chunk 468
writing chunk 469
writing chunk 470
writing chunk 471
writing chunk 472
writing chunk 473
writing chunk 474
writing chunk 475
writing chunk 476
writing chunk 477
writing chunk 478
writing chunk 479
writing chunk 480
writing chunk 481
writing chunk 482
writing chunk 483
writing chunk 484
writing chunk 485
writing chunk 486
writing chunk 487
writing chunk 488
writing chunk 489
writing chunk 490
writing chunk 491
writing chunk 492
writing chunk 493
writing chunk 494
writing chunk 495
writing chunk 496
writing chunk 497
writing chunk 498
writing chunk 499
writing chunk 500
writing chunk 501
writing chunk 502
writing chunk 503
writing chunk 504
writing chunk 505
writing chunk 506
writing chunk 507
writing chunk 508
writing chunk 509
writing chunk 510
writing chunk 511
writing chunk 512
writing chunk 513
writing chunk 514
writing chunk 515
writing chunk 516
writing ch

writing chunk 918
writing chunk 919
writing chunk 920
writing chunk 921
writing chunk 922
writing chunk 923
writing chunk 924
writing chunk 925
writing chunk 926
writing chunk 927
writing chunk 928
writing chunk 929
writing chunk 930
writing chunk 931
writing chunk 932
writing chunk 933
writing chunk 934
writing chunk 935
writing chunk 936
writing chunk 937
writing chunk 938
writing chunk 939
writing chunk 940
writing chunk 941
writing chunk 942
writing chunk 943
writing chunk 944
writing chunk 945
writing chunk 946
writing chunk 947
writing chunk 948
writing chunk 949
writing chunk 950
writing chunk 951
writing chunk 952
writing chunk 953
writing chunk 954
writing chunk 955
writing chunk 956
writing chunk 957
writing chunk 958
writing chunk 959
writing chunk 960
writing chunk 961
writing chunk 962
writing chunk 963
writing chunk 964
writing chunk 965
writing chunk 966
writing chunk 967
writing chunk 968
writing chunk 969
writing chunk 970
writing chunk 971
writing chunk 972
writing ch

writing chunk 1354
writing chunk 1355
writing chunk 1356
writing chunk 1357
writing chunk 1358
writing chunk 1359
writing chunk 1360
writing chunk 1361
writing chunk 1362
writing chunk 1363
writing chunk 1364
writing chunk 1365
writing chunk 1366
writing chunk 1367
writing chunk 1368
writing chunk 1369
writing chunk 1370
writing chunk 1371
writing chunk 1372
writing chunk 1373
writing chunk 1374
writing chunk 1375
writing chunk 1376
writing chunk 1377
writing chunk 1378
writing chunk 1379
writing chunk 1380
writing chunk 1381
writing chunk 1382
writing chunk 1383
writing chunk 1384
writing chunk 1385
writing chunk 1386
writing chunk 1387
writing chunk 1388
writing chunk 1389
writing chunk 1390
writing chunk 1391
writing chunk 1392
writing chunk 1393
writing chunk 1394
writing chunk 1395
writing chunk 1396
writing chunk 1397
writing chunk 1398
writing chunk 1399
writing chunk 1400
writing chunk 1401
writing chunk 1402
writing chunk 1403
writing chunk 1404
writing chunk 1405
writing chun

## <font color="red"> Problem 5 - Exploring the pickups</font>

**Question of Interest:** I am interested in difference between pick-ups in terms of both time and Borough.  Use aggregation and visualizations to construct a group that illustrates an interesting difference between the Boroughs.

In [131]:
from sqlalchemy import create_engine
from sqlalchemy.ext.automap import automap_base
engine2 = create_engine('sqlite:///databases/uber_lab_5.db')
Base = automap_base()
Base.prepare(engine2, reflect=True)
Uber = Base.classes.uber_lab_5

In [135]:
from sqlalchemy import select, func
stmt = (select([Uber.hour, 
               func.count(Uber.hour).label('cnt'),
                Uber.borough])
        .group_by(Uber.hour,Uber.borough))
cnts = pd.read_sql_query(stmt, con=engine2)
cnts

Unnamed: 0,hour,cnt


In [None]:
import seaborn as sns
%matplotlib inline
import matplotlib.pylab as plt
ax = (sns
      .catplot(x="hour", 
               y = 'cnt',
               kind="bar", 
               order=day_name,
               palette="ch:.25", 
               data=cnts))

ax.set_xticklabels(ax.ax.get_xticklabels(), rotation=40, ha="right")
ax.set(title='Uber Pick-Ups by Hour of the Day and Borough',
       xlabel='Hour of the day', 
       ylabel='Num of Pick-ups')
plt.tight_layout()
plt.show()