# Finding Station Locations

## Import the libraries 

In [1]:
import pandas as pd
import numpy as np
import sys

print("Python Version:", sys.version)
print("Pandas Version:", pd.__version__)
print("Numpy Version:", np.__version__)

Python Version: 3.6.3 |Anaconda, Inc.| (default, Oct  6 2017, 12:04:38) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
Pandas Version: 0.20.3
Numpy Version: 1.14.0


## Read in the data
### Read in station location data
We pulled station location data from the MTA website

In [2]:
stations = pd.read_csv('http://web.mta.info/developers/data/nyct/subway/Stations.csv')
stations.head()

Unnamed: 0,Station ID,Complex ID,GTFS Stop ID,Division,Line,Stop Name,Borough,Daytime Routes,Structure,GTFS Latitude,GTFS Longitude
0,1,1,R01,BMT,Astoria,Astoria - Ditmars Blvd,Q,N W,Elevated,40.775036,-73.912034
1,2,2,R03,BMT,Astoria,Astoria Blvd,Q,N W,Elevated,40.770258,-73.917843
2,3,3,R04,BMT,Astoria,30 Av,Q,N W,Elevated,40.766779,-73.921479
3,4,4,R05,BMT,Astoria,Broadway,Q,N W,Elevated,40.76182,-73.925508
4,5,5,R06,BMT,Astoria,36 Av,Q,N W,Elevated,40.756804,-73.929575


### Read in our cleaned turnstile data

In [4]:
turns = pd.read_pickle('data/cleaned_turnstile_data.pkl')
turns.head()

Unnamed: 0,c/a,unit,scp,station,linename,division,date,time,desc,entries,...,entries_diff,exit_diff,first_row_group,entries_outlier,exit_outlier,week,hour,timegroup,timegroupstr,wkdaynbr
72507,N220,R155,01-00-02,KINGSBRIDGE RD,BD,IND,03/26/2016,05:00:00,REGULAR,669738,...,12.0,37.0,False,False,False,12,5,1,2 am - 6 am,6
72508,N220,R155,01-00-02,KINGSBRIDGE RD,BD,IND,03/26/2016,09:00:00,REGULAR,669845,...,107.0,21.0,False,False,False,12,9,2,6 am - 10 am,6
72509,N220,R155,01-00-02,KINGSBRIDGE RD,BD,IND,03/26/2016,13:00:00,REGULAR,669956,...,111.0,56.0,False,False,False,12,13,3,10 am - 2pm,6
72510,N220,R155,01-00-02,KINGSBRIDGE RD,BD,IND,03/26/2016,17:00:00,REGULAR,670074,...,118.0,95.0,False,False,False,12,17,4,2 pm - 6 pm,6
72511,N220,R155,01-00-02,KINGSBRIDGE RD,BD,IND,03/26/2016,21:00:00,REGULAR,670139,...,65.0,121.0,False,False,False,12,21,5,6 pm - 10 pm,6


### Read in a join table

As you might have noticed, there is not common field to merge the stations data with our turnstile data :( But don't worry!! We found a super handy table that links the station ID from the stations table to a booth id which corresponds to the control area id (c/a) in our turnstile data (Yay!!!) 

Let's read in that join table

*File downloaded from: https://docs.google.com/spreadsheets/d/10sz0xWODQ02Kemx6ovS0NLQ_gA0YV9YQtdD7uiCcyjI/edit?usp=sharing&authkey=CMTzrvwE*

In [5]:
join_table = pd.read_csv('data/Remote-Booth-Station - turnstiles 2013.csv')
join_table.head()

Unnamed: 0,Remote,Booth,stop_id.2013,Station.2013,Line.2013,Division.2013
0,R051,A002,R11,LEXINGTON AVE,456NQR,BMT
1,R050,A004,R11,LEXINGTON AVE,456NQR,BMT
2,R079,A006,R13,5 AVE-59 ST,NQR,BMT
3,R079,A007,R13,5 AVE-59 ST,NQR,BMT
4,R080,A010,R14,57 ST-7 AVE,NQR,BMT


## Join the turnstile data with station locations

### Join the stations data with our join file to get booth id

In [6]:
stations_booth = pd.merge(join_table ,stations,how='left',left_on='stop_id.2013',right_on='GTFS Stop ID')
stations_booth.head()

Unnamed: 0,Remote,Booth,stop_id.2013,Station.2013,Line.2013,Division.2013,Station ID,Complex ID,GTFS Stop ID,Division,Line,Stop Name,Borough,Daytime Routes,Structure,GTFS Latitude,GTFS Longitude
0,R051,A002,R11,LEXINGTON AVE,456NQR,BMT,7.0,613.0,R11,BMT,Astoria,Lexington Av/59 St,M,N W R,Subway,40.76266,-73.967258
1,R050,A004,R11,LEXINGTON AVE,456NQR,BMT,7.0,613.0,R11,BMT,Astoria,Lexington Av/59 St,M,N W R,Subway,40.76266,-73.967258
2,R079,A006,R13,5 AVE-59 ST,NQR,BMT,8.0,8.0,R13,BMT,Astoria,5 Av/59 St,M,N W R,Subway,40.764811,-73.973347
3,R079,A007,R13,5 AVE-59 ST,NQR,BMT,8.0,8.0,R13,BMT,Astoria,5 Av/59 St,M,N W R,Subway,40.764811,-73.973347
4,R080,A010,R14,57 ST-7 AVE,NQR,BMT,9.0,9.0,R14,BMT,Broadway - Brighton,57 St - 7 Av,M,N Q R W,Subway,40.764664,-73.980658


In [7]:
missing = stations_booth['GTFS Latitude'].isnull().sum()

print("We're missing the stop id in our join table for {} stations".format(missing))
print("...but we have booth data for {} of our total stations!".format(round((len(stations_booth)-missing)/len(stations_booth),2)))

We're missing the stop id in our join table for 41 stations
...but we have booth data for 0.95 of our total stations!


### Join the turnstile data with our stations data

In [8]:
turns_loc = pd.merge(turns, stations_booth, how='inner',left_on='c/a',right_on='Booth')
turns_loc.head()

Unnamed: 0,c/a,unit,scp,station,linename,division,date,time,desc,entries,...,Complex ID,GTFS Stop ID,Division,Line,Stop Name,Borough,Daytime Routes,Structure,GTFS Latitude,GTFS Longitude
0,N220,R155,01-00-02,KINGSBRIDGE RD,BD,IND,03/26/2016,05:00:00,REGULAR,669738,...,212.0,D04,IND,Concourse,Kingsbridge Rd,Bx,B D,Subway,40.866978,-73.893509
1,N220,R155,01-00-02,KINGSBRIDGE RD,BD,IND,03/26/2016,09:00:00,REGULAR,669845,...,212.0,D04,IND,Concourse,Kingsbridge Rd,Bx,B D,Subway,40.866978,-73.893509
2,N220,R155,01-00-02,KINGSBRIDGE RD,BD,IND,03/26/2016,13:00:00,REGULAR,669956,...,212.0,D04,IND,Concourse,Kingsbridge Rd,Bx,B D,Subway,40.866978,-73.893509
3,N220,R155,01-00-02,KINGSBRIDGE RD,BD,IND,03/26/2016,17:00:00,REGULAR,670074,...,212.0,D04,IND,Concourse,Kingsbridge Rd,Bx,B D,Subway,40.866978,-73.893509
4,N220,R155,01-00-02,KINGSBRIDGE RD,BD,IND,03/26/2016,21:00:00,REGULAR,670139,...,212.0,D04,IND,Concourse,Kingsbridge Rd,Bx,B D,Subway,40.866978,-73.893509


In [9]:
missing_t = turns_loc['GTFS Latitude'].isnull().sum()

print("We're missing the corresponding c/a id for {} turnstile records".format(missing_t))
print("...but we have location data for {} of our total turnstile records!".format(round((len(turns_loc)-missing_t)/len(turns_loc),2)))

We're missing the corresponding c/a id for 369104 turnstile records
...but we have location data for 0.93 of our total turnstile records!


Check out the station names from the turnstile data and station loc data to make sure they agree

In [11]:
diff_stations = [(x, y) for x, y in zip(turns_loc['station'], turns_loc['Stop Name'])]
set(diff_stations)

{('AVENUE N', 'Avenue N'),
 ('VERNON-JACKSON', 'Vernon Blvd - Jackson Av'),
 ('TOMPKINSVILLE', 'Tompkinsville'),
 ('81 ST-MUSEUM', '81 St - Museum of Natural History'),
 ('175 ST', '175 St'),
 ('47-50 STS ROCK', '47-50 Sts - Rockefeller Ctr'),
 ('BUSHWICK AV', 'Bushwick Av - Aberdeen St'),
 ('BOROUGH HALL', 'Borough Hall'),
 ('NEWARK HM HE', nan),
 ('ASTOR PL', 'Astor Pl'),
 ('MT EDEN AV', 'Mt Eden Av'),
 ('CHAMBERS ST', 'Chambers St'),
 ('METROPOLITAN AV', 'Metropolitan Av'),
 ('BEACH 98 ST', nan),
 ('CORTELYOU RD', 'Cortelyou Rd'),
 ('NORWOOD 205 ST', 'Norwood - 205 St'),
 ('BEACH 90 ST', 'Beach 90 St'),
 ('BEDFORD-NOSTRAN', 'Bedford - Nostrand Avs'),
 ('BUHRE AV', 'Buhre Av'),
 ('NASSAU ST', 'Nassau Av'),
 ('JAMAICA 179 ST', 'Jamaica - 179 St'),
 ('40 ST LOWERY ST', '40 St'),
 ('225 ST', '225 St'),
 ('KOSCIUSZKO ST', 'Kosciuszko St'),
 ('CANARSIE-ROCKAW', 'Canarsie - Rockaway Pkwy'),
 ('WORLD TRADE CTR', 'World Trade Center'),
 ('YORK ST', 'York St'),
 ('SIMPSON ST', 'Simpson St'),


In [12]:
nans = [pair for pair in diff_stations if np.nan in pair]
print('{} Stations do not have location information: {} of total'.format(len(nans), len(nans)/len(diff_stations)))

369104 Stations do not have location information: 0.07021522034899116 of total


## Add new column for clean Station ID for use in visualizations

In [13]:
turns_loc['station_id'] = turns_loc['Stop Name'] + turns_loc['Line']

In [None]:
turns_loc.head()

Unnamed: 0,c/a,unit,scp,station,linename,division,date,time,desc,entries,...,GTFS Stop ID,Division,Line,Stop Name,Borough,Daytime Routes,Structure,GTFS Latitude,GTFS Longitude,station_id
0,N220,R155,01-00-02,KINGSBRIDGE RD,BD,IND,03/26/2016,05:00:00,REGULAR,669738,...,D04,IND,Concourse,Kingsbridge Rd,Bx,B D,Subway,40.866978,-73.893509,Kingsbridge RdConcourse
1,N220,R155,01-00-02,KINGSBRIDGE RD,BD,IND,03/26/2016,09:00:00,REGULAR,669845,...,D04,IND,Concourse,Kingsbridge Rd,Bx,B D,Subway,40.866978,-73.893509,Kingsbridge RdConcourse
2,N220,R155,01-00-02,KINGSBRIDGE RD,BD,IND,03/26/2016,13:00:00,REGULAR,669956,...,D04,IND,Concourse,Kingsbridge Rd,Bx,B D,Subway,40.866978,-73.893509,Kingsbridge RdConcourse
3,N220,R155,01-00-02,KINGSBRIDGE RD,BD,IND,03/26/2016,17:00:00,REGULAR,670074,...,D04,IND,Concourse,Kingsbridge Rd,Bx,B D,Subway,40.866978,-73.893509,Kingsbridge RdConcourse
4,N220,R155,01-00-02,KINGSBRIDGE RD,BD,IND,03/26/2016,21:00:00,REGULAR,670139,...,D04,IND,Concourse,Kingsbridge Rd,Bx,B D,Subway,40.866978,-73.893509,Kingsbridge RdConcourse


## Pickle the turnstile data with lat/longs

In [36]:
turns_loc.to_pickle('data/turns_data_locations.pkl')