## About

Let's test the efficacy of our trip outlining algorithm by doing some preliminary data visualization and data analysis on specific well-known cases for which a [mytransit archive](http://data.mytransit.nyc/subway_time/) exists.

The well-known cases in particular are:

    2017:
    May 24 on BDFM
    May 19 (small probs on 1, 2, 3, 4, 5, F, N, Q, R & W)
    May 12 (4 5 6 http://gothamist.com/2017/05/12/pithier_headline_tk.php)
    May 9 (loss of power at dekalb)
    May 7 (loss of power at dekalb)
    April 24 (lettered lines but also possible 4 and 6 delays?)
    April 21 (BDE rerouted, mostly lettered lines)
    March 6 (A C F http://gothamist.com/2017/03/06/a_c_f_subway_monday.php)
    
    2016:
    Dec 26 http://gothamist.com/2016/12/26/subway_service_issues_on_b_d_and_e.php A, D, E, F, N, Q & R
    Oct 13 stabbing A, B, C, D & F http://gothamist.com/2016/10/13/subway_stabbing_42nd_st.php
    The gothamist archive ended at May 2016.
    
I don't have an archive of my own set up yet (in part because the Transit Center said that they are working on creating such a thing and making it a public resource), so I am relying on the mytransit archive for data. Keep in mind also the limitations of which feeds were available and when, and which of those are in the mytransit archive: the B and D were only recently added; the archive stopped publishing recently anyway; etcetera etcetera.

In [1]:
import sys; sys.path.append("../src/")
from processing import parse_feeds_into_trip_logbook, mta_archival_time_to_unix_timestamp

## May 12 2016 GTFS (1...6)

This day saw significant system-wide delays due to a conscierge of incidents, with a peak on the A, C, E line. According to the [Gothamist article](http://gothamist.com/2017/05/12/pithier_headline_tk.php) there was spillover onto service on the 1...6 lines as well however.

### Localizing Data

Note: using the `tar` or `lmza` builtins for reading this file pulls up a persistant `embedded NUL character` `TypeError`. The Linux archive manager reads these files just fine and I don't have the patience to debug, so we'll just use bash instead here.

In [1]:
import requests
r = requests.get("http://data.mytransit.nyc.s3.amazonaws.com/subway_time/2016/2016-05/subway_time_20160512.tar.xz")

In [12]:
!mkdir data
!mkdir data/subway_time_20160512

In [13]:
with open("./data/subway_time_20160512/arch.tar.xz", "wb") as f:
    f.write(r.content)

In [15]:
!cd ./data/subway_time_20160512; tar xvfJ arch.tar.xz

gtfs-20160512T0400Z
gtfs-20160512T0401Z
gtfs-20160512T0402Z
gtfs-20160512T0404Z
gtfs-20160512T0405Z
gtfs-20160512T0406Z
gtfs-20160512T0407Z
gtfs-20160512T0408Z
gtfs-20160512T0409Z
gtfs-20160512T0410Z
gtfs-20160512T0411Z
gtfs-20160512T0412Z
gtfs-20160512T0413Z
gtfs-20160512T0414Z
gtfs-20160512T0415Z
gtfs-20160512T0416Z
gtfs-20160512T0417Z
gtfs-20160512T0418Z
gtfs-20160512T0419Z
gtfs-20160512T0420Z
gtfs-20160512T0421Z
gtfs-20160512T0422Z
gtfs-20160512T0423Z
gtfs-20160512T0424Z
gtfs-20160512T0425Z
gtfs-20160512T0426Z
gtfs-20160512T0427Z
gtfs-20160512T0428Z
gtfs-20160512T0429Z
gtfs-20160512T0430Z
gtfs-20160512T0431Z
gtfs-20160512T0432Z
gtfs-20160512T0433Z
gtfs-20160512T0434Z
gtfs-20160512T0435Z
gtfs-20160512T0436Z
gtfs-20160512T0437Z
gtfs-20160512T0438Z
gtfs-20160512T0439Z
gtfs-20160512T0440Z
gtfs-20160512T0441Z
gtfs-20160512T0442Z
gtfs-20160512T0443Z
gtfs-20160512T0444Z
gtfs-20160512T0445Z
gtfs-20160512T0446Z
gtfs-20160512T0447Z
gtfs-20160512T0448Z
gtfs-20160512T0449Z
gtfs-20160512T0450Z


### Parsing Data

In [1]:
import os

logs = [f for f in os.listdir("./data/subway_time_20160512") if f != 'arch.tar.xz' 
        and 'si' not in f and 'l' not in f]

In [2]:
logs[:5]

['gtfs-20160512T0415Z',
 'gtfs-20160512T1759Z',
 'gtfs-20160512T2155Z',
 'gtfs-20160512T0610Z',
 'gtfs-20160513T0153Z']

In [3]:
from google.transit import gtfs_realtime_pb2

A fun minimal complaint:

    pip install requests gtfs-realtime-bindings
    python -c "import requests; r = requests.get('http://data.mytransit.nyc.s3.amazonaws.com/subway_time/2016/2016-05/subway_time_20160512.tar.xz'); open('arch.tar.xz', 'wb').write(r.content)"
    tar xvfJ arch.tar.xz
    python -c "from google.transit import gtfs_realtime_pb2; test_example = gtfs_realtime_pb2.FeedMessage().ParseFromString(open('gtfs-20160512T0400Z', 'rb').read()); print(type(test_example))"

Basic example:

In [4]:
with open("./data/subway_time_20160512/gtfs-20160512T0400Z", "rb") as f:
    fm = gtfs_realtime_pb2.FeedMessage()
    fm.ParseFromString(f.read())

In [5]:
len(fm.entity)

163

In [6]:
from tqdm import tqdm

In [7]:
def parse_feed(filepath):
    with open(filepath, "rb") as f:
        try:
            fm = gtfs_realtime_pb2.FeedMessage()
            fm.ParseFromString(f.read())
            return fm
        except (KeyboardInterrupt, SystemExit):
            raise
        except:
            return None

In [8]:
logs = sorted(logs)

In [9]:
feeds = [parse_feed("./data/subway_time_20160512/" + l) for l in tqdm(logs[:120])]

100%|██████████| 120/120 [00:11<00:00, 10.79it/s]


In [10]:
logs[:5]

['gtfs-20160512T0400Z',
 'gtfs-20160512T0401Z',
 'gtfs-20160512T0402Z',
 'gtfs-20160512T0404Z',
 'gtfs-20160512T0405Z']

In [11]:
information_dates = [log.split("-")[-1][:-1] for log in logs]

In [12]:
information_dates[:5]

['20160512T0400',
 '20160512T0401',
 '20160512T0402',
 '20160512T0404',
 '20160512T0405']

In [13]:
len([feed for feed in feeds if feed is None])

0

Here's two hours of MTA train data:

In [17]:
import sys; sys.path.append("../src/")
from processing import parse_feeds_into_trip_logbook

In [23]:
bad_feed = feeds[information_dates.index('20160512T0532')]

messages = []

for message in bad_feed.entity:
    if message.trip_update.trip.trip_id == '147200_1..N02X017':
        messages.append(message)
        break
    elif message.vehicle.trip.trip_id == '147200_1..N02X017':
        messages.append(message)
        break        

In [24]:
messages

[id: "000002"
 vehicle {
   trip {
     trip_id: "147200_1..N02X017"
     start_date: "20160512"
     route_id: "1"
   }
   current_stop_sequence: 31
   current_status: STOPPED_AT
   timestamp: 1463030924
   stop_id: "107N"
 }]

In [18]:
logbook = parse_feeds_into_trip_logbook(feeds[:120], information_dates[:120])

> /home/alex/Desktop/mta-data-exploration/src/processing.py(59)_parse_message_list_into_action_log()
-> actions_list = []
(Pdb) messages
[id: "000002"
vehicle {
  trip {
    trip_id: "147200_1..N02X017"
    start_date: "20160512"
    route_id: "1"
  }
  current_stop_sequence: 31
  current_status: STOPPED_AT
  timestamp: 1463030924
  stop_id: "107N"
}
]
(Pdb) information_time
'20160512T0532'
(Pdb) q


BdbQuit: 

In [16]:
%debug

> [0;32m/home/alex/miniconda3/envs/mta-data-exploration/lib/python3.4/site-packages/pandas/tools/merge.py[0m(1484)[0;36m__init__[0;34m()[0m
[0;32m   1482 [0;31m[0;34m[0m[0m
[0m[0;32m   1483 [0;31m        [0;32mif[0m [0mlen[0m[0;34m([0m[0mobjs[0m[0;34m)[0m [0;34m==[0m [0;36m0[0m[0;34m:[0m[0;34m[0m[0m
[0m[0;32m-> 1484 [0;31m            [0;32mraise[0m [0mValueError[0m[0;34m([0m[0;34m'No objects to concatenate'[0m[0;34m)[0m[0;34m[0m[0m
[0m[0;32m   1485 [0;31m[0;34m[0m[0m
[0m[0;32m   1486 [0;31m        [0;32mif[0m [0mkeys[0m [0;32mis[0m [0;32mNone[0m[0;34m:[0m[0;34m[0m[0m
[0m
ipdb> up
> [0;32m/home/alex/miniconda3/envs/mta-data-exploration/lib/python3.4/site-packages/pandas/tools/merge.py[0m(1451)[0;36mconcat[0;34m()[0m
[0;32m   1449 [0;31m                       [0mkeys[0m[0;34m=[0m[0mkeys[0m[0;34m,[0m [0mlevels[0m[0;34m=[0m[0mlevels[0m[0;34m,[0m [0mnames[0m[0;34m=[0m[0mnames[0m[0;34m,[0m[

D'oh!

TODO: trek onwards...