## Hub Bound Dataset

The NYMTC provides data for all travel to/from Mahattan's CBD on a 'typical' Fall workday.  (found [here](https://www.nymtc.org/Data-and-Modeling/Transportation-Data-and-Statistics/Publications/Hub-Bound-Travel)). For this exercise, we are using 2019 data.

This data is provided in a nice report and easy to read Excel files, but hard to read for computers. Therefore, we manually copied numbers from the report's numerous Excel appendices into one table. This is what's read into this notebook. However, this format is still not ideal for database work, aggregating, and modeling. Therefore, further python cleaning and formatting is needed. This final table is what is submitted to our database

In [1]:
import pandas as pd

In [2]:
## Read in copied numbers from 2019 HBD report
hbd_df = pd.read_excel("hbd_data/BN_HBD_hourly_Sector_Final.xlsx",skiprows=5)\
        .rename({"Column Name":"Hour"},axis=1).reset_index(drop=True).fillna(0)
hbd_df.head(5)

Unnamed: 0,Hour,60thSt_YorkAve_Bus_In,60thSt_2ndAve_Bus_In,60thSt_LexingtonAve_Bus_In,60thSt_5thAve_Bus_In,60thSt_Broadway_Bus_In,60thSt_9thAveColumbusAve_Bus_In,60thSt_11thAveWestEndAve_Bus_In,60thSt_YorkAve_Bus_Out,60thSt_1stAve_Bus_Out,...,Brooklyn_WilliamsburgBridge_Bicycle_In,Brooklyn_BrooklynBridge_Bicycle_In,Brooklyn_ManhattanBridge_Bicycle_In,Queens_QueensboroBridge_Bicycle_In,StatenIsland_AboardFerry_Bicycle_In,Brooklyn_WilliamsburgBridge_Bicycle_Out,Brooklyn_BrooklynBridge_Bicycle_Out,Brooklyn_ManhattanBridge_Bicycle_Out,Queens_QueensboroBridge_Bicycle_Out,StatenIsland_AboardFerry_Bicycle_Out
0,0,0,14,26,6,8,0,0,0,50,...,27,5,6,8,0,105,14,40,81,12
1,1,0,35,22,1,1,0,0,0,13,...,12,0,3,3,1,48,10,22,35,9
2,2,0,7,15,2,5,0,0,0,10,...,9,1,4,4,1,19,1,14,15,0
3,3,0,15,23,6,7,0,0,0,13,...,8,1,2,11,5,7,1,6,12,5
4,4,0,34,65,15,27,0,0,0,23,...,12,5,6,30,12,10,1,2,5,1


In [3]:
## Grab Column Taxonomy from Excel - much easier to aggregate using these
hbd_column_taxonomy = pd.read_excel("hbd_data/BN_HBD_hourly_Sector_Final.xlsx",nrows=5).T
hbd_column_taxonomy.columns = hbd_column_taxonomy.loc["Appendix"]
hbd_column_taxonomy=hbd_column_taxonomy.drop("Appendix").reset_index()
hbd_column_taxonomy.rename({"index":"Appendix","TransitMode":"TransMode"\
                           ,"Point of Entry/Exit":"PointEntryExit"},axis=1,inplace=True)
hbd_column_taxonomy['Appendix']=hbd_column_taxonomy['Appendix'].apply(lambda x: x.split(".")[0])
hbd_column_taxonomy.Appendix.value_counts() #looks good
hbd_column_taxonomy.head(5)

Appendix,Appendix.1,Sector,PointEntryExit,TransMode,Direction,Column Name
0,A,60thSt,YorkAve,Bus,In,60thSt_YorkAve_Bus_In
1,A,60thSt,2ndAve,Bus,In,60thSt_2ndAve_Bus_In
2,A,60thSt,LexingtonAve,Bus,In,60thSt_LexingtonAve_Bus_In
3,A,60thSt,5thAve,Bus,In,60thSt_5thAve_Bus_In
4,A,60thSt,Broadway,Bus,In,60thSt_Broadway_Bus_In


In [4]:
hbd_column_taxonomy[hbd_column_taxonomy['PointEntryExit'].str.contains("Queens")]

Appendix,Appendix.1,Sector,PointEntryExit,TransMode,Direction,Column Name
19,A,Queens,QueensboroBridge,Bus,In,Queens_QueensboroBridge_Bus_In
21,A,Queens,QueensboroBridge,Bus,Out,Queens_QueensboroBridge_Bus_Out
103,D,60thSt,QueensboroBridgeRamp,AutoOccupants,Out,60thSt_QueensboroBridgeRamp_AutoOccupants_Out
118,D,Queens,QueensboroBridge,AutoOccupants,In,Queens_QueensboroBridge_AutoOccupants_In
126,D,Queens,QueensboroBridge,AutoOccupants,Out,Queens_QueensboroBridge_AutoOccupants_Out
143,E,60thSt,QueensboroBridgeRamp,Autos,Out,60thSt_QueensboroBridgeRamp_Autos_Out
158,E,Queens,QueensboroBridge,Autos,In,Queens_QueensboroBridge_Autos_In
166,E,Queens,QueensboroBridge,Autos,Out,Queens_QueensboroBridge_Autos_Out
214,G,Queens,QueensboroBridge,Bicycle,In,Queens_QueensboroBridge_Bicycle_In
219,G,Queens,QueensboroBridge,Bicycle,Out,Queens_QueensboroBridge_Bicycle_Out


In [5]:
## Spotcheck Entry/Exit Points by Sector
for tup in hbd_column_taxonomy.groupby(by=['Sector'])["PointEntryExit"].apply(set).iteritems():
    print(tup[0])
    print(sorted(tup[1]))
    print("**"*50)

60thSt
['10thAveAmsterdamAve', '11thAveWestEndAve', '12thAveWestSideHighway', '1stAve', '2ndAve', '2ndAveLocal', '3rdAve', '5thAve', '6thAve', '7thAve', '8thAveCPWest', '8thAveExpress', '8thAveLocal', '9thAveColumbusAve', 'AmtrakEmpire', 'Broadway', 'BroadwayExpress', 'BroadwayLocal', 'FDRDrive', 'HudsonRiverGreenway', 'LexingtonAve', 'LexingtonAveExpress', 'LexingtonAveLocal', 'MNRHarlem', 'MNRHudson', 'MNRNewHaven', 'MadisonAve', 'ParkAve', 'QueensboroBridgeRamp', 'YorkAve']
****************************************************************************************************
Brooklyn
['14thStTunnel', 'BrooklynBridge', 'ClarkStTunnel', 'CranberryStTunnel', 'Ferry', 'HughCareyTunnel', 'JoralemonStTunnel', 'ManhattanBridge', 'ManhattanBridgeExpress', 'ManhattanBridgeLocal', 'MontagueStTunnel', 'RutgersStTunnel', 'WilliamsburgBridge']
****************************************************************************************************
NewJersey
['AmtrakNECorridor', 'DowntownPath', 'Ferry',

In [6]:
## We need to "un-pivot" this table to a long-skinny version
## Pandas' melt function is very useful and easy to use
## Columns are then named for clarity
hbd_df2 = hbd_df.melt(id_vars=['Hour']).rename({"variable":"Column Name","value":"HBD_est_persons"},axis=1)
hbd_df2 = hbd_df2.merge(right=hbd_column_taxonomy,on='Column Name')
hbd_df2.head(5)

Unnamed: 0,Hour,Column Name,HBD_est_persons,Appendix,Sector,PointEntryExit,TransMode,Direction
0,0,60thSt_YorkAve_Bus_In,0,A,60thSt,YorkAve,Bus,In
1,1,60thSt_YorkAve_Bus_In,0,A,60thSt,YorkAve,Bus,In
2,2,60thSt_YorkAve_Bus_In,0,A,60thSt,YorkAve,Bus,In
3,3,60thSt_YorkAve_Bus_In,0,A,60thSt,YorkAve,Bus,In
4,4,60thSt_YorkAve_Bus_In,0,A,60thSt,YorkAve,Bus,In


In [7]:
## Then, we group by the relevant columns just to be safe (though there are no dupes)
## Grouping also makes the table easier to read by column order
## Column names are again changed for clarity, where necessary
hbd_by_hour_transmode = hbd_df2.groupby(by=["Sector","PointEntryExit","TransMode","Direction","Hour"])\
    .agg({"HBD_est_persons":'sum'}).reset_index().rename({"HBD_est_persons":"Estimated_Commuters"},axis=1)

hbd_by_hour_transmode.head(5)

Unnamed: 0,Sector,PointEntryExit,TransMode,Direction,Hour,Estimated_Commuters
0,60thSt,10thAveAmsterdamAve,AutoOccupants,Out,0,693
1,60thSt,10thAveAmsterdamAve,AutoOccupants,Out,1,432
2,60thSt,10thAveAmsterdamAve,AutoOccupants,Out,2,308
3,60thSt,10thAveAmsterdamAve,AutoOccupants,Out,3,319
4,60thSt,10thAveAmsterdamAve,AutoOccupants,Out,4,350


## Checks
This table is sufficient to send to our database. However, given the manual nature of our data gathering, it's worth to run some checks against the reports' numbers at the aggregate. That way, we can confirm that our copying worked without any issues 100%

In [8]:
## First check: Page 9 of report - totals by Sector, Direction
## We have to ignore the "Autos" category as that is just the count of vehicles
## "AutoOccupants" has actual people counts
sum_by_sector_dir = hbd_by_hour_transmode[(hbd_by_hour_transmode['TransMode']!='Autos')]\
        .groupby(by=["Sector","Direction"]).agg({"Estimated_Commuters":"sum"})
print(f"Total Sum: {sum_by_sector_dir.sum().values[0]:,}")
sum_by_sector_dir 
## This matches perfectly

Total Sum: 7,664,090


Unnamed: 0_level_0,Unnamed: 1_level_0,Estimated_Commuters
Sector,Direction,Unnamed: 2_level_1
60thSt,In,1374700
60thSt,Out,1387208
Brooklyn,In,1075958
Brooklyn,Out,1044430
NewJersey,In,590217
NewJersey,Out,576048
Queens,In,776591
Queens,Out,762328
RooseveltIsland,In,3716
RooseveltIsland,Out,4104


In [9]:
## Second check: Page 12 of report - totals + pcts entering by Sector
sum_by_sector_enter = hbd_by_hour_transmode[(hbd_by_hour_transmode['Direction']=="In")&(hbd_by_hour_transmode['TransMode']!='Autos')].groupby(by=["Sector"]).agg({"Estimated_Commuters":"sum"})
sum_by_sector_enter["Estimated_Commuters_Pct"] = sum_by_sector_enter["Estimated_Commuters"]/sum_by_sector_enter["Estimated_Commuters"].sum()
sum_by_sector_enter 
## This matches perfectly

Unnamed: 0_level_0,Estimated_Commuters,Estimated_Commuters_Pct
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1
60thSt,1374700,0.356536
Brooklyn,1075958,0.279056
NewJersey,590217,0.153076
Queens,776591,0.201413
RooseveltIsland,3716,0.000964
StatenIsland,34526,0.008955


In [10]:
## Third check: Page 15 of report - totals + pcts by Transit Mode
sum_by_mode = hbd_by_hour_transmode.groupby(by=["TransMode"]).agg({"Estimated_Commuters":"sum"})
sum_by_mode.drop("Autos",axis=0,inplace=True)
sum_by_mode["Estimated_Commuters_Pct"] = sum_by_mode["Estimated_Commuters"]/sum_by_mode["Estimated_Commuters"].sum()
sum_by_mode 
## This matches perfectly

Unnamed: 0_level_0,Estimated_Commuters,Estimated_Commuters_Pct
TransMode,Unnamed: 1_level_1,Unnamed: 2_level_1
AutoOccupants,1856236,0.242199
Bicycle,65588,0.008558
Bus,532307,0.069455
CommuterRail,685330,0.089421
Ferry,118525,0.015465
Subway,4398284,0.573882
Tram,7820,0.00102


In [11]:
## Fourth Check - Bicycle Volumes Specifically
## Page 119 of PDF (III-44 in Appendix III)
bike_df = hbd_by_hour_transmode[hbd_by_hour_transmode['TransMode']=="Bicycle"].copy()
bike_df.groupby(by=["Direction","Sector"]).agg({"Estimated_Commuters":"sum"}).T
## These match perfectly

Direction,In,In,In,In,Out,Out,Out,Out
Sector,60thSt,Brooklyn,Queens,StatenIsland,60thSt,Brooklyn,Queens,StatenIsland
Estimated_Commuters,21984,7578,2606,342,22255,8040,2510,273


Given these checks matching well at different levels of aggregation, we are satisfied that the data was copied and transformed safely. We will run further checks in the next step of data work as we go, but are now confident this data is accurate and much easier to use

In [12]:
hbd_by_hour_transmode.to_csv("est_commuters_HBD.csv")
hbd_by_hour_transmode.head(2)

Unnamed: 0,Sector,PointEntryExit,TransMode,Direction,Hour,Estimated_Commuters
0,60thSt,10thAveAmsterdamAve,AutoOccupants,Out,0,693
1,60thSt,10thAveAmsterdamAve,AutoOccupants,Out,1,432
