### This notebook will be for processing and cleaning the subway turnstile dataset for 2019
- Will need to determine all the turnstile differences for entries between current and last value
- Same for exits 
- This can give me a representation of how many people enter/exit station X in a given time period
- Do this for only stations in Manhattan

#### Manually cross refrencing which stations are in Manhattan:
- https://new.mta.info/accessibility/stations

In [56]:
import json
import pandas as pd

### Extract subway names and co-ordinates from overpass turbo json file


In [57]:
#there are 161 stations in this dataset

# Open the geojson file and read it
with open('manSubTurn.geojson') as file:
    data = json.load(file)


subway=pd.DataFrame(columns=['name','short_name','lat','lon','Link to Turnstile Data'])

for i in range(len(data['features'])):
    #Extract 
    name=data['features'][i]['properties']['name']
    try:
        short_n=data['features'][i]['properties']['short_name']
    except:
        short_n='None'
        
    lat=data['features'][i]['geometry']['coordinates'][1]
    lon=data['features'][i]['geometry']['coordinates'][0]
    subway.loc[i]=[name,short_n,lat,lon,'None']
    

In [58]:
subway.tail(20)

Unnamed: 0,name,short_name,lat,lon,Link to Turnstile Data
142,Grand Central Terminal,,40.752806,-73.977179,
143,Marble Hill,,40.874924,-73.912702,
144,Manhattan,,41.418606,-87.989013,
145,59th Street,59 St,40.762707,-73.96788,
146,Chambers Street,Chambers St,40.714916,-74.007968,
147,Fulton Street,Fulton St,40.710186,-74.007664,
148,49th Street,49 St,40.759901,-73.984139,
149,34th Street–Herald Square,34 St–Herald Sq,40.749719,-73.987823,
150,23rd Street,23 St,40.742853,-73.992834,
151,Union Square,Union Sq,40.734603,-73.99036,


In [59]:
#remove duplicates? depends on how granular the data for 2019 is

### Now import the turnstile data for 2019

In [60]:
with open('Turnstile_Usage_Data__2019.csv') as file:
    turn_data = pd.read_csv(file)

turn_data.head(20)

Unnamed: 0,C/A,Unit,SCP,Station,Line Name,Division,Date,Time,Description,Entries,Exits
0,A033,R170,02-00-05,14 ST-UNION SQ,LNQR456W,BMT,12/27/2019,00:00:00,REGULAR,17538854,7031168
1,A033,R170,02-00-02,14 ST-UNION SQ,LNQR456W,BMT,12/27/2019,00:00:00,REGULAR,14983900,14554087
2,A033,R170,02-06-00,14 ST-UNION SQ,LNQR456W,BMT,12/27/2019,00:00:00,REGULAR,769115,559221
3,A033,R170,02-00-03,14 ST-UNION SQ,LNQR456W,BMT,12/27/2019,00:00:00,REGULAR,7191422,8417203
4,A033,R170,02-06-01,14 ST-UNION SQ,LNQR456W,BMT,12/27/2019,00:00:00,REGULAR,71047673,20925389
5,A033,R170,02-00-01,14 ST-UNION SQ,LNQR456W,BMT,12/27/2019,00:00:00,REGULAR,2430093,2921770
6,A033,R170,02-00-00,14 ST-UNION SQ,LNQR456W,BMT,12/27/2019,00:00:00,REGULAR,271981,828662
7,A033,R170,02-00-04,14 ST-UNION SQ,LNQR456W,BMT,12/27/2019,00:00:00,REGULAR,6483080,4945335
8,A034,R170,03-00-02,14 ST-UNION SQ,LNQR456W,BMT,12/27/2019,00:00:00,REGULAR,4632187,3269462
9,A034,R170,03-03-00,14 ST-UNION SQ,LNQR456W,BMT,12/27/2019,00:00:00,REGULAR,69926,219187


In [86]:
#find how many unique subway names there are
turn_data['Station'].nunique()

379

#### Columns of interest
- Station, this is the station name which I will compare with Overpass turbo name data to only have manhattan turnstile data
- Date, needed for determing day of the week and time of year
- Time, needed for determing time of day
- Entries, needed for determing how many people entered the station, useful for busyness
- Exits, needed for determing how many people exited the station, useful for busyness
- Might also need to include individual turnstile data to see where people are entering/exiting the station and for comparing that turnstile to its previous value to see the difference

### Drop the rows that are not in Manhattan
#### Change overpass turbo names to match turnstile names

In [61]:
#capitalize all overpass turbo names
subway['name'] = subway['name'].str.upper()
subway['short_name'] = subway['short_name'].str.upper()
subway.head(20)

Unnamed: 0,name,short_name,lat,lon,Link to Turnstile Data
0,96TH STREET,96 ST,40.794379,-73.972,
1,BOWLING GREEN,NONE,40.704509,-74.014095,
2,SOUTH FERRY,NONE,40.70171,-74.013149,
3,86TH STREET,86 ST,40.779494,-73.955529,
4,28TH STREET,28 ST,40.743314,-73.984075,
5,14TH STREET–UNION SQUARE,14 ST–UNION SQ,40.735761,-73.990649,
6,28TH STREET,28 ST,40.745494,-73.988691,
7,CANAL STREET,CANAL ST,40.718092,-73.999892,
8,BROADWAY–LAFAYETTE STREET,B’WAY–LAFAYETTE ST,40.725297,-73.996204,
9,5TH AVENUE,5 AV,40.753743,-73.9819,


In [74]:
# Now search for the short name in the turnstile data
# If it is not there, search for the long name

stations_to_extract_from_turnstile = []
change_name_stations_to_extract_from_turnstile = []
for i in range(len(subway)):
    name = subway.iloc[i]['name']
    short_name = subway.iloc[i]['short_name']
    if short_name in turn_data['Station'].values:
        stations_to_extract_from_turnstile.append(short_name)
        subway.loc[i,'Link to Turnstile Data'] = short_name
    elif name in turn_data['Station'].values:
        stations_to_extract_from_turnstile.append(name)
        subway.loc[i,'Link to Turnstile Data'] = name
    else:
        change_name_stations_to_extract_from_turnstile.append('None for '+name)
        

print(stations_to_extract_from_turnstile)

['96 ST', 'BOWLING GREEN', 'SOUTH FERRY', '86 ST', '28 ST', '28 ST', 'CANAL ST', 'RECTOR ST', 'CHAMBERS ST', 'WALL ST', 'EAST BROADWAY', 'GRAND ST', 'BOWERY', 'BROAD ST', '2 AV', '51 ST', '7 AV', 'PRINCE ST', 'CANAL ST', '23 ST', 'ASTOR PL', 'BLEECKER ST', 'SPRING ST', '33 ST', '1 AV', 'CANAL ST', '14 ST', '3 AV', 'SPRING ST', '23 ST', '28 ST', '50 ST', 'HOUSTON ST', '18 ST', 'FRANKLIN ST', '72 ST', '77 ST', '96 ST', 'DYCKMAN ST', '125 ST', '125 ST', '135 ST', '145 ST', '79 ST', '86 ST', '110 ST', '116 ST', '103 ST', '125 ST', '155 ST', '116 ST', '103 ST', '116 ST', '14 ST', '57 ST', 'CITY HALL', '155 ST', 'CHAMBERS ST', 'PARK PLACE', 'CORTLANDT ST', 'RECTOR ST', '23 ST', 'DYCKMAN ST', '168 ST', '181 ST', '191 ST', '207 ST', '215 ST', '145 ST', '157 ST', '103 ST', '72 ST', '86 ST', '96 ST', 'CANAL ST', '9TH STREET', '14 ST', '14 ST', '33 ST', 'FULTON ST', 'WALL ST', '125 ST', '145 ST', '181 ST', '190 ST', '23 ST', '23 ST', 'CHRISTOPHER ST', '135 ST', '86 ST', '96 ST', '72 ST', 'CANAL S

In [75]:
subway.head(20)

Unnamed: 0,name,short_name,lat,lon,Link to Turnstile Data
0,96TH STREET,96 ST,40.794379,-73.972,96 ST
1,BOWLING GREEN,NONE,40.704509,-74.014095,BOWLING GREEN
2,SOUTH FERRY,NONE,40.70171,-74.013149,SOUTH FERRY
3,86TH STREET,86 ST,40.779494,-73.955529,86 ST
4,28TH STREET,28 ST,40.743314,-73.984075,28 ST
5,14TH STREET–UNION SQUARE,14 ST–UNION SQ,40.735761,-73.990649,14 ST-UNION SQ
6,28TH STREET,28 ST,40.745494,-73.988691,28 ST
7,CANAL STREET,CANAL ST,40.718092,-73.999892,CANAL ST
8,BROADWAY–LAFAYETTE STREET,B’WAY–LAFAYETTE ST,40.725297,-73.996204,B'WAY-LAFAYETTE
9,5TH AVENUE,5 AV,40.753743,-73.9819,5 AVE


In [76]:
print(change_name_stations_to_extract_from_turnstile)
#will need to change these names manually

['None for 14TH STREET–UNION SQUARE', 'None for BROADWAY–LAFAYETTE STREET', 'None for 5TH AVENUE', 'None for 42ND STREET–GRAND CENTRAL', 'None for 5TH AVENUE–59TH STREET', 'None for HARLEM–148TH STREET', 'None for 163RD STREET–AMSTERDAM AVENUE', 'None for WTC CORTLANDT', 'None for WEST 4TH STREET–WASHINGTON SQUARE', 'None for ESSEX STREET', 'None for LEXINGTON AVENUE–63RD STREET', 'None for 42ND STREET–BRYANT PARK', 'None for LEXINGTON AVENUE–59TH STREET', 'None for 5TH AVENUE–53RD STREET', 'None for 57TH STREET–7TH AVENUE', 'None for 8TH STREET–NEW YORK UNIVERSITY', 'None for 42ND STREET–PORT AUTHORITY BUS TERMINAL', 'None for ROOSEVELT ISLAND', 'None for NEW YORK PENN STATION', 'None for 66TH STREET–LINCOLN CENTER', 'None for WORLD TRADE CENTER', 'None for CENTRAL PARK NORTH–110TH STREET', 'None for BROOKLYN BRIDGE–CITY HALL', 'None for 42ND STREET–TIMES SQUARE', 'None for MARBLE HILL–225TH STREET', 'None for 137TH STREET–CITY COLLEGE', 'None for CATHEDRAL PARKWAY–110TH STREET', 'Non

In [77]:
print(len(change_name_stations_to_extract_from_turnstile))

56


In [78]:
change_name_stations_to_extract_from_turnstile.append('14 ST-UNION SQ')
change_name_stations_to_extract_from_turnstile.append('B\'WAY-LAFAYETTE')
change_name_stations_to_extract_from_turnstile.append('5 AVE')
change_name_stations_to_extract_from_turnstile.append('GRD CNTRL-42 ST')
change_name_stations_to_extract_from_turnstile.append('5 AV/59 ST')
change_name_stations_to_extract_from_turnstile.append('HARLEM 148 ST')
change_name_stations_to_extract_from_turnstile.append('163 ST-AMSTERDM')
change_name_stations_to_extract_from_turnstile.append('WTC-CORTLANDT')
change_name_stations_to_extract_from_turnstile.append('W 4 ST-WASH SQ')
change_name_stations_to_extract_from_turnstile.append('DELANCEY/ESSEX')
change_name_stations_to_extract_from_turnstile.append('LEXINGTON AV/63')
change_name_stations_to_extract_from_turnstile.append('42 ST-BRYANT PK')
change_name_stations_to_extract_from_turnstile.append('None')
change_name_stations_to_extract_from_turnstile.append('5 AV/53 ST')
change_name_stations_to_extract_from_turnstile.append('57 ST-7 AV')
change_name_stations_to_extract_from_turnstile.append('8 ST-NYU')
change_name_stations_to_extract_from_turnstile.append('None as bus terminal')
change_name_stations_to_extract_from_turnstile.append('ROOSEVELT ISLND')
change_name_stations_to_extract_from_turnstile.append('34 ST-PENN STA')
change_name_stations_to_extract_from_turnstile.append('66 ST-LINCOLN')
change_name_stations_to_extract_from_turnstile.append('WORLD TRADE CTR')
change_name_stations_to_extract_from_turnstile.append('CENTRAL PK N110')
change_name_stations_to_extract_from_turnstile.append('BROOKLYN BRIDGE')
change_name_stations_to_extract_from_turnstile.append('TIMES SQ-42 ST')
change_name_stations_to_extract_from_turnstile.append('MARBLE HILL-225')
change_name_stations_to_extract_from_turnstile.append('137 ST CITY COL')
change_name_stations_to_extract_from_turnstile.append('CATHEDRAL PKWY')
change_name_stations_to_extract_from_turnstile.append('116 ST-COLUMBIA')
change_name_stations_to_extract_from_turnstile.append('34 ST-PENN STA')
change_name_stations_to_extract_from_turnstile.append('TIMES SQ-42 ST')
change_name_stations_to_extract_from_turnstile.append('CHRISTOPHER ST')
change_name_stations_to_extract_from_turnstile.append('CATHEDRAL PKWY')
change_name_stations_to_extract_from_turnstile.append('68ST-HUNTER CO')
change_name_stations_to_extract_from_turnstile.append('34 ST-PENN STA')
change_name_stations_to_extract_from_turnstile.append('LEXINGTON AV/53')
change_name_stations_to_extract_from_turnstile.append('47-50 STS ROCK')
change_name_stations_to_extract_from_turnstile.append('DELANCEY/ESSEX')
change_name_stations_to_extract_from_turnstile.append('34 ST-HERALD SQ')
change_name_stations_to_extract_from_turnstile.append('81 ST-MUSEUM')
change_name_stations_to_extract_from_turnstile.append('59 ST COLUMBUS')
change_name_stations_to_extract_from_turnstile.append('WHITEHALL S-FRY')
change_name_stations_to_extract_from_turnstile.append('WORLD TRADE CTR')
change_name_stations_to_extract_from_turnstile.append('14 ST-UNION SQ')
change_name_stations_to_extract_from_turnstile.append('GRD CNTRL-42 ST')
change_name_stations_to_extract_from_turnstile.append('GRD CNTRL-42 ST')
change_name_stations_to_extract_from_turnstile.append('INWOOD-207 ST')
change_name_stations_to_extract_from_turnstile.append('125 ST')
change_name_stations_to_extract_from_turnstile.append('MARBLE HILL-225')
change_name_stations_to_extract_from_turnstile.append('GRD CNTRL-42 ST')
change_name_stations_to_extract_from_turnstile.append('34 ST-HERALD SQ')
change_name_stations_to_extract_from_turnstile.append('14 ST-UNION SQ')
change_name_stations_to_extract_from_turnstile.append('34 ST-HUDSON YD')
change_name_stations_to_extract_from_turnstile.append('59 ST COLUMBUS')
change_name_stations_to_extract_from_turnstile.append('TIMES SQ-42 ST')
change_name_stations_to_extract_from_turnstile.append('TIMES SQ-42 ST')
change_name_stations_to_extract_from_turnstile.append('TIMES SQ-42 ST')

In [79]:
len(change_name_stations_to_extract_from_turnstile)

112

- Missing for LEXINGTON AVENUE–59TH STREET

- Have a triple of penn station 34 st
- Duplicate of TIMES SQ-42 ST
- Duplicate of CATHEDRAL PKWY
- Duplicate of DELANCEY/ESSEX
- Duplicate of WORLD TRADE CTR
- Duplicate of 14 ST-UNION SQ
- Quadruple of GRD CNTRL-42 ST
- Duplicate of 125 ST
- Duplicate of MARBLE HILL-225
- 



In [80]:
stations_to_extract_from_turnstile = []
j=56
#change_name_stations_to_extract_from_turnstile = []
for i in range(len(subway)):
    name = subway.iloc[i]['name']
    short_name = subway.iloc[i]['short_name']
    if short_name in turn_data['Station'].values:
        stations_to_extract_from_turnstile.append(short_name)
        subway.loc[i,'Link to Turnstile Data'] = short_name
    elif name in turn_data['Station'].values:
        stations_to_extract_from_turnstile.append(name)
        subway.loc[i,'Link to Turnstile Data'] = name
    else:
        subway.loc[i,'Link to Turnstile Data'] = change_name_stations_to_extract_from_turnstile[j]
        j=j+1

subway.head(20)

Unnamed: 0,name,short_name,lat,lon,Link to Turnstile Data
0,96TH STREET,96 ST,40.794379,-73.972,96 ST
1,BOWLING GREEN,NONE,40.704509,-74.014095,BOWLING GREEN
2,SOUTH FERRY,NONE,40.70171,-74.013149,SOUTH FERRY
3,86TH STREET,86 ST,40.779494,-73.955529,86 ST
4,28TH STREET,28 ST,40.743314,-73.984075,28 ST
5,14TH STREET–UNION SQUARE,14 ST–UNION SQ,40.735761,-73.990649,14 ST-UNION SQ
6,28TH STREET,28 ST,40.745494,-73.988691,28 ST
7,CANAL STREET,CANAL ST,40.718092,-73.999892,CANAL ST
8,BROADWAY–LAFAYETTE STREET,B’WAY–LAFAYETTE ST,40.725297,-73.996204,B'WAY-LAFAYETTE
9,5TH AVENUE,5 AV,40.753743,-73.9819,5 AVE


In [82]:
#show the entire data frame
#pd.set_option('display.max_rows', None)
#subway

In [84]:
#remove any row that has a duplicate value in the link to turnstile data column
subway = subway.drop_duplicates(subset=['Link to Turnstile Data'], keep='first')
len(subway)

96

### Now have 96 subway locations in Manhattan that we can reference in the turnstile data
