# Finding Station Locations

## Import the libraries 

In [1]:
import pandas as pd

## Read in the data
### Read in station location data
We pulled station location data from the MTA website

In [11]:
stations = pd.read_csv('http://web.mta.info/developers/data/nyct/subway/Stations.csv')
stations.head()

Unnamed: 0,Station ID,Complex ID,GTFS Stop ID,Division,Line,Stop Name,Borough,Daytime Routes,Structure,GTFS Latitude,GTFS Longitude
0,1,1,R01,BMT,Astoria,Astoria - Ditmars Blvd,Q,N W,Elevated,40.775036,-73.912034
1,2,2,R03,BMT,Astoria,Astoria Blvd,Q,N W,Elevated,40.770258,-73.917843
2,3,3,R04,BMT,Astoria,30 Av,Q,N W,Elevated,40.766779,-73.921479
3,4,4,R05,BMT,Astoria,Broadway,Q,N W,Elevated,40.76182,-73.925508
4,5,5,R06,BMT,Astoria,36 Av,Q,N W,Elevated,40.756804,-73.929575


### Read in our cleaned turnstile data

In [14]:
turns = pd.read_pickle('stiles/cleaned_turnstile_data.pkl')
turns.head()

Unnamed: 0,c/a,unit,scp,station,linename,division,date,time,desc,entries,...,group_id,station_id,entries_diff,exit_diff,first_row_group,entries_outlier,exit_outlier,all_traffic,week,hour
96348,N519,R461,00-03-01,B'WAY-LAFAYETTE,BDFQ6,IND,03/25/2017,05:00:00,REGULAR,17310033,...,0,B'WAY-LAFAYETTEBDFQ6IND,10.0,13.0,False,False,False,23.0,12,5
96349,N519,R461,00-03-01,B'WAY-LAFAYETTE,BDFQ6,IND,03/25/2017,09:00:00,REGULAR,17310048,...,0,B'WAY-LAFAYETTEBDFQ6IND,15.0,57.0,False,False,False,72.0,12,9
96350,N519,R461,00-03-01,B'WAY-LAFAYETTE,BDFQ6,IND,03/25/2017,13:00:00,REGULAR,17310132,...,0,B'WAY-LAFAYETTEBDFQ6IND,84.0,291.0,False,False,False,375.0,12,13
96351,N519,R461,00-03-01,B'WAY-LAFAYETTE,BDFQ6,IND,03/25/2017,17:00:00,REGULAR,17310413,...,0,B'WAY-LAFAYETTEBDFQ6IND,281.0,493.0,False,False,False,774.0,12,17
96352,N519,R461,00-03-01,B'WAY-LAFAYETTE,BDFQ6,IND,03/25/2017,21:00:00,REGULAR,17310743,...,0,B'WAY-LAFAYETTEBDFQ6IND,330.0,334.0,False,False,False,664.0,12,21


### Read in a join table

As you might have noticed, there is not common field to merge the stations data with our turnstile data :( But don't worry!! We found a super handy table that links the station ID from the stations table to a booth id which corresponds to the control area id (c/a) in our turnstile data (Yay!!!) 

Let's read in that join table

*File downloaded from: https://docs.google.com/spreadsheets/d/10sz0xWODQ02Kemx6ovS0NLQ_gA0YV9YQtdD7uiCcyjI/edit?usp=sharing&authkey=CMTzrvwE*

In [19]:
join_table = pd.read_csv('Remote-Booth-Station - turnstiles 2013.csv')
join_table.head()

Unnamed: 0,Remote,Booth,stop_id.2013,Station.2013,Line.2013,Division.2013
0,R051,A002,R11,LEXINGTON AVE,456NQR,BMT
1,R050,A004,R11,LEXINGTON AVE,456NQR,BMT
2,R079,A006,R13,5 AVE-59 ST,NQR,BMT
3,R079,A007,R13,5 AVE-59 ST,NQR,BMT
4,R080,A010,R14,57 ST-7 AVE,NQR,BMT


## Join the turnstile data with station locations

### Join the stations data with our join file to get booth id

In [22]:
stations_booth = pd.merge(join_table ,stations,how='left',left_on='stop_id.2013',right_on='GTFS Stop ID')
stations_booth.head()

Unnamed: 0,Remote,Booth,stop_id.2013,Station.2013,Line.2013,Division.2013,Station ID,Complex ID,GTFS Stop ID,Division,Line,Stop Name,Borough,Daytime Routes,Structure,GTFS Latitude,GTFS Longitude
0,R051,A002,R11,LEXINGTON AVE,456NQR,BMT,7.0,613.0,R11,BMT,Astoria,Lexington Av/59 St,M,N W R,Subway,40.76266,-73.967258
1,R050,A004,R11,LEXINGTON AVE,456NQR,BMT,7.0,613.0,R11,BMT,Astoria,Lexington Av/59 St,M,N W R,Subway,40.76266,-73.967258
2,R079,A006,R13,5 AVE-59 ST,NQR,BMT,8.0,8.0,R13,BMT,Astoria,5 Av/59 St,M,N W R,Subway,40.764811,-73.973347
3,R079,A007,R13,5 AVE-59 ST,NQR,BMT,8.0,8.0,R13,BMT,Astoria,5 Av/59 St,M,N W R,Subway,40.764811,-73.973347
4,R080,A010,R14,57 ST-7 AVE,NQR,BMT,9.0,9.0,R14,BMT,Broadway - Brighton,57 St - 7 Av,M,N Q R W,Subway,40.764664,-73.980658


In [34]:
missing = stations_booth['GTFS Latitude'].isnull().sum()

print("We're missing the stop id in our join table for {} stations".format(missing))
print("...but we have booth data for {} of our total stations!".format(round((len(stations_booth)-missing)/len(stations_booth),2)))

We're missing the stop id in our join table for 41 stations
...but we have booth data for 0.95 of our total stations!


### Join the turnstile data with our stations data

In [32]:
turns_loc = pd.merge(turns, stations_booth, how='inner',left_on='c/a',right_on='Booth')
turns_loc.head()

Unnamed: 0,c/a,unit,scp,station,linename,division,date,time,desc,entries,...,Complex ID,GTFS Stop ID,Division,Line,Stop Name,Borough,Daytime Routes,Structure,GTFS Latitude,GTFS Longitude
0,N519,R461,00-03-01,B'WAY-LAFAYETTE,BDFQ6,IND,03/25/2017,05:00:00,REGULAR,17310033,...,619.0,D21,IND,6th Av - Culver,Broadway-Lafayette St,M,B D F M,Subway,40.725297,-73.996204
1,N519,R461,00-03-01,B'WAY-LAFAYETTE,BDFQ6,IND,03/25/2017,09:00:00,REGULAR,17310048,...,619.0,D21,IND,6th Av - Culver,Broadway-Lafayette St,M,B D F M,Subway,40.725297,-73.996204
2,N519,R461,00-03-01,B'WAY-LAFAYETTE,BDFQ6,IND,03/25/2017,13:00:00,REGULAR,17310132,...,619.0,D21,IND,6th Av - Culver,Broadway-Lafayette St,M,B D F M,Subway,40.725297,-73.996204
3,N519,R461,00-03-01,B'WAY-LAFAYETTE,BDFQ6,IND,03/25/2017,17:00:00,REGULAR,17310413,...,619.0,D21,IND,6th Av - Culver,Broadway-Lafayette St,M,B D F M,Subway,40.725297,-73.996204
4,N519,R461,00-03-01,B'WAY-LAFAYETTE,BDFQ6,IND,03/25/2017,21:00:00,REGULAR,17310743,...,619.0,D21,IND,6th Av - Culver,Broadway-Lafayette St,M,B D F M,Subway,40.725297,-73.996204


In [35]:
missing_t = turns_loc['GTFS Latitude'].isnull().sum()

print("We're missing the corresponding c/a id for {} turnstile records".format(missing_t))
print("...but we have location data for {} of our total turnstile records!".format(round((len(turns_loc)-missing_t)/len(turns_loc),2)))

We're missing the corresponding c/a id for 362630 turnstile records
...but we have location data for 0.93 of our total turnstile records!


## Pickle the turnstile data with lat/longs

In [36]:
turns_loc.to_pickle('turns_data_locations.pkl')