In [121]:
import scrapy
import requests
import time
import shapely
import json
import math
import re
from shapely.geometry import asShape
from collections import defaultdict
from pandas import DataFrame as df

## When last we left off...

We had produced a neighborhood data json file, and a list of subway stops, that we could combine
to make a map -- but 

Now we want to focus in on elevator outages and how long they're expected to take to resolve.

In [2]:
# Start by reloading our core datasets:
with open('neighborhood_to_stations.json') as infile:
    neighborhood_to_stations=json.load(infile)
with open('stations_to_neighborhoods.json') as infile:
    stations_to_neighborhoods=json.load(infile)
with open('mta_stations_sorted_by_name.json') as infile:
    mta_stations_sorted_by_name=json.load(infile)
with open('nyca_stations.json') as infile:
    nyca_stations=json.load(infile)
with open('all_nyca_station_details.json') as infile:
    all_nyca_station_details=json.load(infile)
with open('neighborhood_data.json') as infile:
    neighborhood_data=json.load(infile)

Unfortunately, I left myself a bit of a mess. The geotagged station data from the MTA doesn't have any
simple id to join it against the accessibility data from NYCA. Let's work on that problem.

Let's see if I can correlate name+line across datasets and make it work.

... [1 hour later]... Not automatically, but I can do a manual alignment by staring hard at excel.
Dump both files ... do a ton of manual analysis ... reimport the files

In [20]:
mta_sorted = sorted([
    (m['properties']['name'],m['properties']['line'],m['properties']['objectid'])
    for m in mta_stations_sorted_by_name
])
nyca_sorted = sorted([
    (n['name'],tuple(n['lines']),n['id'])
    for n in nyca_stations
]) 

In [31]:
df(mta_sorted).to_csv('mta_sorted.csv')

In [32]:
df(nyca_sorted).to_csv('nyca_sorted.csv')

In [97]:
dr1=df.from_csv('aligned_mta_nyca.csv')

In [98]:
# tuple of (mta,nyca_id)
def format_mta_id(flt):
    if not flt or math.isnan(flt):
        return None
    else:
        return str(int(flt))

def format_nyca_id(flt):
    if not flt or math.isnan(flt):
        return None
    else:
        return int(flt)

aligned_objectids = [
    (format_mta_id(r[3]),format_nyca_id(r[7]))
    for r in dr1.to_records()
]
mta_stations_to_name={ms['properties']['objectid']:ms['properties']['name'] for ms in mta_stations_sorted_by_name}
nyca_stations_to_name={ns['id']:ns['name'] for ns in nyca_stations}
first_rec=aligned_objectids[0]
(mta_stations_to_name[first_rec[0]],nyca_stations_to_name[first_rec[1]]) # confirming they match

('103rd St', '103rd St')

... whew! That was a pain.
Now we need to go:
* Elevator outage -> NYCA station ID -> MTA station ID -> (point) -> Neighborhood

There were some NYCA station IDs that mapped to multiple MTA stations. Let's see if any of those span neighborhoods.

In [102]:
nyca_station_to_ntacode=defaultdict(lambda:set())
#aligned_objectids[0][1]
#stations_to_neighborhoods[aligned_objectids[0][0]]
for mta_id,nyca_id in aligned_objectids:
    if mta_id is None:
        print('Missing mta_id for nyca_id {}'.format(nyca_id))
    elif nyca_id is None:
        print('Missing nyca_id for mta_id {}'.format(mta_id))
    elif mta_id in stations_to_neighborhoods:
        neighborhoods=stations_to_neighborhoods[mta_id]
        if len(neighborhoods) == 1:
            nyca_station_to_ntacode[nyca_id].add(neighborhoods[0])
        else:
            print('Multi-neighborhood data for {}/{}'.format(mta_id,nyca_id))
    else:
        print('Missing neighborhood data for {}'.format(mta_id))

ntacode_to_name={nb['properties']['NTACode']:nb['properties']['NTAName'] for nb in neighborhood_data['features']}
for nyca_id,ntacodes in nyca_station_to_ntacode.items():
    if len(ntacodes) != 1:
        print('NYCA id {} ({}) maps to neighborhoods {}'.format(
            nyca_id,
            nyca_stations_to_name[nyca_id],
            [ntacode_to_name[nb] for nb in ntacodes]))

Missing mta_id for nyca_id 372
Missing nyca_id for mta_id 470
Missing nyca_id for mta_id 442
Missing nyca_id for mta_id 641
Missing nyca_id for mta_id 642
Missing nyca_id for mta_id 443
Missing nyca_id for mta_id 643
Missing nyca_id for mta_id 153
Missing nyca_id for mta_id 371
NYCA id 49 (Atlantic Ave) maps to neighborhoods ['Fort Greene', 'Park Slope-Gowanus']
NYCA id 57 (Borough Hall) maps to neighborhoods ['DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill', 'Brooklyn Heights-Cobble Hill']
NYCA id 274 (Broadway Junction) maps to neighborhoods ['Bushwick South', 'Ocean Hill']
NYCA id 471 (Grand Central - 42nd St) maps to neighborhoods ['Turtle Bay-East Midtown', 'Murray Hill-Kips Bay', 'Midtown-Midtown South']
NYCA id 34 (Union Sq - 14th St) maps to neighborhoods ['Hudson Yards-Chelsea-Flatiron-Union Square', 'West Village']


In [140]:
nyca_station_to_ntacode[106]

{'MN12'}

... OK. Loaded. Mapped. I'm satisfied that the mappings are good enough. Now let's do what we came here
to do and map outages to neighborhoods. 

## Let's bring it all together...

In [117]:
nyca_elevator_outages=[
    (station_nyca_id,m['out_since'],m['return_date_display'])
    for station_nyca_id,station_info in all_nyca_station_details.items()
    for m in station_info['machines']
    if m['out_since'] and m['human_eq_type']=='Elevator'
]
nyca_elevator_outages

[('445', 'about 3 hours', '3 days (04/04 12:00 AM)'),
 ('291', 'about 9 hours', '1 day (04/03  2:05 AM)'),
 ('397', 'about 11 hours', '1 day (04/03 12:40 AM)'),
 ('438', 'about 1 hour', 'about 2 hours (04/01  1:00 PM)'),
 ('3', '5 days', '7 months (10/31  9:49 AM)'),
 ('3', '5 days', '7 months (10/31 10:32 AM)'),
 ('315', '26 minutes', '2 days (04/03 11:54 AM)'),
 ('25', 'about 8 hours', '1 day (04/03  3:55 AM)'),
 ('57', '1 day', '1 day (04/03 12:00 AM)'),
 ('434', 'about 8 hours', '1 day (04/03  3:55 AM)'),
 ('5', '8 months', 'about 1 month (05/11 10:00 AM)'),
 ('5', '8 months', 'about 1 month (05/11 12:00 AM)'),
 ('5', '8 months', 'about 1 month (05/11 12:00 AM)'),
 ('223', 'about 2 hours', 'about 5 hours (04/01  4:00 PM)'),
 ('471', 'about 12 hours', '1 day (04/03  2:00 AM)'),
 ('241', '1 day', 'about 13 hours (04/02 12:00 AM)'),
 ('46', '1 day', 'about 17 hours (04/02  4:31 AM)')]

On a human scale, it's not clear how to evaluate these outages. A 1-day outage sucks. A 12-month outage sucks a lot
more, but not 365 times as much. 

Originally I was going to discount when an elevator was one of several at a station - but that's not fair; we
don't know whether the elevator is the only one *on its platform*.

So instead I'm going to badness-score it with simple regexes:
* 'day' - 1 point
* 'days' - 2 points
* 'month' - 5 points
* anything else (second, minute, hour) - 0.5 points

Max across both (past,expected). 
Multiple elevator outages at the same station don't count multiple times.

However, multiple elevator outages in the **same neighborhood** at different stations do count multiple times.

In [131]:
def score_text(text):
    if re.search('month',text):
        return 5
    elif re.search('days',text):
        return 2
    elif re.search('day',text):
        return 1
    else:
        return 0.5

def score_outage(outage):
    (nyca_id,out_since,return_date)=outage
    return max(
        score_text(out_since),
        score_text(return_date))

def flatten_scores(outages):
    station_scores=defaultdict(lambda:0)
    for outage in outages:
        station_scores[outage[0]] = max(
            station_scores[outage[0]],
            score_outage(outage)
        )
    return dict(station_scores)

station_scores=flatten_scores(nyca_elevator_outages)
station_scores

{'223': 0.5,
 '241': 1,
 '25': 1,
 '291': 1,
 '3': 5,
 '315': 2,
 '397': 1,
 '434': 1,
 '438': 0.5,
 '445': 2,
 '46': 1,
 '471': 1,
 '5': 5,
 '57': 1}

Last step! Take the scored station data and turn it into scored neighborhood data.

If an elevator outage is in one of my 5 monster stations 
(Atlantic Av, Boro Hall, Bway Junction, Grand Central, and Union Sq) that span neighborhoods, I'm
going to count it as a broken elevator in all 5.

TODO: get fancy and divide by total number of stations in that neighborhood - but not right now.

In [142]:
def station_scores_to_neighborhood_data(station_scores):
    neighborhood_scores=defaultdict(lambda:0)
    for station_id,station_score in station_scores.items():
        for neighborhood in nyca_station_to_ntacode[int(station_id)]:
            neighborhood_scores[neighborhood] = neighborhood_scores[neighborhood]+station_score    
    return dict(neighborhood_scores)

neighborhood_scores=station_scores_to_neighborhood_data(station_scores)
neighborhood_scores

{'BK09': 1,
 'BK28': 2,
 'BK38': 1,
 'BK60': 1,
 'BX49': 1,
 'BX63': 1,
 'MN13': 2,
 'MN17': 1.5,
 'MN19': 6,
 'MN20': 1,
 'MN23': 1,
 'MN24': 1,
 'MN27': 5,
 'MN35': 1,
 'QN17': 0.5}

In [145]:
# And, once again, enrich the original neighborhood_data geojson with neighborhood_score:
for n in neighborhood_data['features']:
    n_id=n['properties']['NTACode']
    n['properties']['elevator_outage_score'] = neighborhood_scores.get(n_id,0)

#... and write it back out:
with open('neighborhood_data_current_outages.json','w') as outfile:
    json.dump(neighborhood_data,outfile)