# About

## Theory

The MTA's GTFS and GTFS-Realtime datastreams use the same identifiers, and their identifiers need to be folded into one another to maximize what we know.

Unfortunately this is *very* non-trivial. The MTA's data dictionary has a long, long passage on the problem:

> The New York City subway is a 24 x 7 operations and as a result is a highly dynamic operation. The majority of repairs and maintenance are performed during live operations so the daily service plan is subject to both planned and unplanned changes. The result of this is that some trips defined in the GTFS trips.txt may change (originating times, trip running times and trip path), cancelled or new trips may be added. 
>
> Unfortunately, there is no reliable way for us to determine the relationship between the actual and the static GTFS trip, so we can’t tell if a particular trip is the original one or has been changed or added later so the ScheduleRelationship is not used. While trip_id in the GTFS-realtime feed will not directly match the trip_id in trips.txt, a partial match should be possible if the trip has been defined in trips.txt. If there is a partial match, the trip is a scheduled trip.
>
> For example, if a trip_id in trips.txt is A20111204SAT_021150_2..N08R, the GTFS-realtime trip_id will
be 021150_2..N08R which is unique within the day type (WKD, SAT, SUN). A20111204SAT_021150_2..N08R is decoded as follows:
> 
> A – Is the Sub-Division identifier. A identifies Sub-Division A (IRT) which include the GC Shuttle and all number lines with the exception of the 7 line. B identifies Sub-Division B (BMT and IND) which includes the Franklin Ave and Rockaway Shuttles, all letter lines and the 7 line.
>
> 20111204 – Effective date of the base schedule, Dec 4, 2011
>
> SAT – Is the applicable service code. Typically it will be WKD-Weekday, SAT-Saturday or SUNSunday
>
> 021150 – This identifies the trips origin time. Times are coded reflecting hundredths of a minute past midnight and converts to (03:31:30 also described as 0331+ where the + equals 30 seconds). This format provides more "precision" than can be realistically attributed to a transit operation, and most applications can safely round or truncate these numbers to the nearest minute. Since Transit authority internal timetables frequently involve half-minute scheduling, systems involved in train control or monitoring will need to represent times in a more accurate manner (to at least the half minute, and perhaps to the tenth minute or one second level). It should be noted that the service associated with a single day's subway schedule is not necessarily confined to a twenty-four hour period. Negative numbers reflect times prior to the day of the schedule (-0000200 refers to 11:58 PM yesterday) and
numbers exceeding 00144000 (a day has 1440 minutes) reflect times beyond the day of the schedule (00145000 refers to 12:10 AM tomorrow).
>
> 2..N08R – This identifies the Trip Path (stopping pattern) for a unique train trip. This can be decomposed into the Route ID (aka service, 2 train) Direction (Northbound train) and Path Identifier (08R). Internally this path provides operations planning such information as origination, destination, all stops, routing scheme (express/local) in Manhattan/Bronx/Brooklyn, operating time periods, and shape (circle = local, diamond = express).
>
> The combination of Origin Time, Route ID and Direction can be used to identify a unique trip. The Path Identifier should be considered optional data that will only be provided when known. This could result with it being there at the start of a trip but not during portions of the trip.

The TLDR is that GTFS provides us all of the *static* information about the MTA, while GTFS-Realtime provides all of the *dynamic* information. In the complicated, aging, 24/7-on MTA system, this difference between expectation and reality isn't a just a gap, it's a chasm.

That means that in general, there are three types of train trips:

1. Train trips that were to occur in the GTFS plan, but did not occur in the GTFS-Realtime reality.
2. Train trips that occurred in the GTFS-Realtime reality, but not in the GTFS plan.
3. Train trips that occurred both in planning and reality.

## Tooling

To start with, we're going to need to grab a tool for reading that GTFS data in (I could build my own, but why re-invent the wheel?).

Google publishes the [`transitfeed`](https://github.com/google/transitfeed) library for this purpose. However, that library is Python 2 only. Boo.

So instead we'll use the competently done [`pygtfs`](https://github.com/jarondl/pygtfs).

In [1]:
import pygtfs

This is a database-based implementation, using `sqlite`. We don't need persistence, so let's go in-memory for simplicity.

In [2]:
sched = pygtfs.Schedule(":memory:")

Now to load the data in. Note: this process takes a couple of minutes even on a fast PC.

In [11]:
import requests

with open("../data/gtfs/temp.zip", "wb") as f:
    f.write(requests.get("http://web.mta.info/developers/data/nyct/subway/google_transit.zip").content)

In [12]:
pygtfs.append_feed(sched, "../data/gtfs/temp.zip")

Loading GTFS data for <class 'pygtfs.gtfs_entities.Agency'>:
Loading GTFS data for <class 'pygtfs.gtfs_entities.Stop'>:
Loading GTFS data for <class 'pygtfs.gtfs_entities.Route'>:
Loading GTFS data for <class 'pygtfs.gtfs_entities.Trip'>:
Loading GTFS data for <class 'pygtfs.gtfs_entities.StopTime'>:
Loading GTFS data for <class 'pygtfs.gtfs_entities.Service'>:
Loading GTFS data for <class 'pygtfs.gtfs_entities.ServiceException'>:
Loading GTFS data for <class 'pygtfs.gtfs_entities.Fare'>:
Loading GTFS data for <class 'pygtfs.gtfs_entities.FareRule'>:
Loading GTFS data for <class 'pygtfs.gtfs_entities.ShapePoint'>:
Loading GTFS data for <class 'pygtfs.gtfs_entities.Frequency'>:
Loading GTFS data for <class 'pygtfs.gtfs_entities.Transfer'>:
Loading GTFS data for <class 'pygtfs.gtfs_entities.FeedInfo'>:
Loading GTFS data for <class 'pygtfs.gtfs_entities.Translation'>:
1 record read for <class 'pygtfs.gtfs_entities.Agency'>.


  for (pr, fr_) in other_props)
  for (pr, fr_) in other_props)
  for (pr, fr_) in other_props)
  for (pr, fr_) in other_props)
  for (pr, fr_) in other_props)
  for (pr, fr_) in other_props)
  for (pr, fr_) in other_props)
  for (pr, fr_) in other_props)
  for (pr, fr_) in other_props)
  for (pr, fr_) in other_props)
  for (pr, fr_) in other_props)
  for (pr, fr_) in other_props)
  for (pr, fr_) in other_props)
  for (pr, fr_) in other_props)
  for (pr, fr_) in other_props)
  for (pr, fr_) in other_props)


1497 records read for <class 'pygtfs.gtfs_entities.Stop'>.
29 records read for <class 'pygtfs.gtfs_entities.Route'>.
....20622 records read for <class 'pygtfs.gtfs_entities.Trip'>.
..............................................................................................................553149 records read for <class 'pygtfs.gtfs_entities.StopTime'>.
10 records read for <class 'pygtfs.gtfs_entities.Service'>.
16 records read for <class 'pygtfs.gtfs_entities.ServiceException'>.
........................122382 records read for <class 'pygtfs.gtfs_entities.ShapePoint'>.
610 records read for <class 'pygtfs.gtfs_entities.Transfer'>.
Complete.


<pygtfs.schedule.Schedule at 0x7fbb105d3a90>

In [13]:
schedule = _

In [14]:
schedule

<pygtfs.schedule.Schedule at 0x7fbb105d3a90>

In [17]:
len(schedule.routes)

29

In [18]:
len(schedule.trips)

20622

## Modelling

With that in mind, we're going to need to organize a mental model for organizing all of this information together.

Our frame of reference is going to be individual "trip sets". Each trip execution will consist of two trips: the one that occured in reality, and the one that was planned. Sometimes these two trips will be the same. Sometimes we will only have information on one of these trips: either a trip that was planned but didn't happen (or couldn't be matched to one that happened), or a trip that was unplanned but did happen. Sometimes there will be a trip that was planned, and one that occurred, but they won't be the same. We need to account for all of these possibilities, in theory (in practice, we will see).

These trips executions will have metadata regarding what line and route they occurred along, which we will read out of the requisite GTFS files.

Finally, we will create and store information about alerts. Alerts occur in [place, place] segments of a trip. That place may be a station, or it may be the space in between stations, and it includes the spaces both before the first and after the last stop on a trip. Later on we will probably want to assign some sort of classification system to the alerts that we see, but for now, just storing the text string associated with the alert will do, so long as we have this lengthwise spatial extent worked out alright.

Importantly, we will assume `trip_planned -> trip_executed` is *injective*, but not *bijective*. That is, if we started with one executed trip, we will assume that the most we can do is match one planned trip. However, we will not assume that the reverse is true: that is, we will assume that it may be possible for a single planned trip to map to multiple executed trips. An easy example might be when something goes cataclysmically wrong at one station in particular, so the trains on a route run in two segments: one there from before, and one there from afterwards.

I expect that the reverse is also possible. However, we need to match trips from some starting point, and it only makes sense to try to do so backwards: to start at a trip that happened, and work backwards to a trip that was planned. This is an injective worldview.

I'm going to organize things using objects, but I'm mainly interested in serializing these things to JSON representations.

An important limitation of this data model is that we will be storing information that is accurately localized to a single line. For example, if a service disruption occurs that affects many lines, each of the trips that occur that are affected across all of these lines will duplicate information on the disruption. We have to accept limitations like this one, for now, because the data is just so multimodal; in order to flatten it somewhat, we have to make compromises somewhere. Otherwise we end up with a tree that's barely different from the one that we started with!

In [74]:
class TripSet():
    def __init__(self, route=None, service=None, trip_planned=None, trip_executed=None, alerts=[]):
        """
        Parameters
        ----------
        line, str
            The line ("1", "A", "7", so on) on which this tripset occurs. Since trains occassionally get
            rerouted from one line onto another for part of their journey, it's best to think of the line as
            the decal on the train.
        route, Trip
            The route assigned to this line at the time that this tripset took place. Routes change in
            a number of circumstances: when the system gets updated, for example, or when regularly scheduled
            weekend service change kicks in.
        trip_planned, Trip or None
            The trip that was planned. Often, this will be exactly the same as the route.
        trip_executed, Trip or None
            The trip that was executed.
        alerts, list of Alert objects
            A list of alerts tied to this line in the period in question.
        """
        self.line = line
        self.routes = route
        self.trip_planned = trip_planned
        self.trip_executed = trip_executed
        self.alerts = alerts
        
    def to_json():
        """
        Serialize to JSON.
        """
        return {
            'line': self.line,
            'route': self.route,
            'trip_planned': self.trip_planned.to_json() if self.trip_planned else None,
            'trip_executed': self.trip_executed.to_json() if self.trip_planned else None,
            'alerts': [alert.to_json() for alert in self.alerts]
        }
        
        
class Trip():
    def __init__(self, stops=[], alerts=[]):
        """
        Parameters
        ----------
        stops, list of Stop objects
            The stops that this trip occurs along, in the order in which they occur.
        alerts, list of Alert objects
            A list of Alert objects corresponding with what alerts were active along what parts of the line.
        """
        self.stops = stops
        self.alerts = alerts
        
    def to_json(self):
        json_repr = dict(stops=[], alerts=[])
        for stop in self.stops:
            json_repr['stops'].append(stop.to_json())
        for alert in self.alerts:
            json_repr['alerts'].append(alert.to_json())
        return json_repr


class Stop():
    def __init__(self, name, coordinates, time):
        """
        Parameters
        ----------
        name, str
        latitude, float
        longitude, float
        time, str
        """
        self.name = name
        self.coordinates = coordinates
        self.time = time
    
    def to_json(self):
        return vars(self)
    

class Alert():
    def __init__(self, time_interval, text, sources):
        """
        Parameters
        ----------
        time_interval, list of two str objects
            Two ISO time strings corresponding with the best known start and end time for the alert being active.
        text, str
            The alert text.
        sources, list of str objects
            The sources for this alert. There are two sources that we are interested in, GTFS-Realtime and the
            MTA Alerts service. Examination shows that the MTA Alerts are reserved for widescale disruptions,
            while GTFS-Realtime also captures smaller events. Hence the need to track and distinguish. Options
            are "GTFS-Realtime" and "MTA Alerts".
        """
        self.time_interval = time_interval
        self.text = text
        self.sources = sources
        
    def to_json(self):
        return vars(self)
    

# Methods looking up data for routes.

def map_route_id_to_route(route_id):
    """FYI: What I'm calling a 'route' is a route_short_name in the GTFS-Realtime lexicon."""
    import pandas as pd
    route_id = str(route_id)
    return pd.read_csv("../data/gtfs/routes.txt").query('route_id == @route_id').iloc[0]['route_short_name']

# def get_service(route_id):
#     import pandas as pd
#     route_id = str(route_id)
#     return pd.read_csv("../data/gtfs/routes.txt").query('route_id == @route_id').iloc[0]['route_short_name']

In [77]:
pd.read_csv("../data/gtfs/stop_times.txt")

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,shape_dist_traveled
0,A20170625SUN_001150_7..S97R,00:11:30,00:11:30,701S,1,,0,0,
1,A20170625SUN_001150_7..S97R,00:14:00,00:14:00,702S,2,,0,0,
2,A20170625SUN_001150_7..S97R,00:15:30,00:15:30,705S,3,,0,0,
3,A20170625SUN_001150_7..S97R,00:16:30,00:16:30,706S,4,,0,0,
4,A20170625SUN_001150_7..S97R,00:17:30,00:17:30,707S,5,,0,0,
5,A20170625SUN_001150_7..S97R,00:19:00,00:19:00,708S,6,,0,0,
6,A20170625SUN_001150_7..S97R,00:20:00,00:20:00,709S,7,,0,0,
7,A20170625SUN_001150_7..S97R,00:21:00,00:21:00,710S,8,,0,0,
8,A20170625SUN_001150_7..S97R,00:22:30,00:22:30,711S,9,,0,0,
9,A20170625SUN_001150_7..S97R,00:24:30,00:24:30,712S,10,,0,0,


## Experiment with Loading in GTFS-Realtime

In [32]:
# Load the data in.

from google.transit import gtfs_realtime_pb2
import requests

feed = gtfs_realtime_pb2.FeedMessage()

import pickle
response = pickle.load(open("../data/gtfs-realtime/response.p", "rb"))
feed.ParseFromString(response.content)
example_pull = feed

Ok, let's start by bisecting off the alerts. `entity` is a...`RepeatedCompositeFieldContainer`...making this more annoying than it has to be. Each message also defines all possible fields, just leaving a few empty if they don't make sense. Very wonky design.

In [52]:
alert_breakpoint = None

for i, entity in enumerate(reversed(example_pull.entity)):
    if str(entity.alert) == '':
        alert_breakpoint = len(example_pull.entity) - i
        break

alerts = example_pull.entity[alert_breakpoint:] if alert_breakpoint else []

OK, let's now set up those `TripSet` objects.

In [72]:
realtime_tripsets = []

trips_breakpoint = alert_breakpoint if alert_breakpoint else len(example_pull.entity)

for i in range(0, trips_breakpoint, 2):
#     pass
    trip_update_message = example_pull.entity[i]
    realtime_tripset = TripSet(
        route=map_route_id_to_line(trip_update_message.trip_update.trip.route_id),
        service=
    )

In [33]:
example_pull.entity[0]

id: "000001"
trip_update {
  trip {
    trip_id: "006550_1..N02X003"
    start_date: "20170621"
    route_id: "1"
  }
  stop_time_update {
    arrival {
      time: 1498021800
    }
    departure {
      time: 1498021800
    }
    stop_id: "137N"
  }
  stop_time_update {
    arrival {
      time: 1498021890
    }
    departure {
      time: 1498021890
    }
    stop_id: "136N"
  }
  stop_time_update {
    arrival {
      time: 1498021950
    }
    departure {
      time: 1498021950
    }
    stop_id: "135N"
  }
  stop_time_update {
    arrival {
      time: 1498022040
    }
    departure {
      time: 1498022040
    }
    stop_id: "134N"
  }
  stop_time_update {
    arrival {
      time: 1498022130
    }
    departure {
      time: 1498022130
    }
    stop_id: "133N"
  }
  stop_time_update {
    arrival {
      time: 1498022220
    }
    departure {
      time: 1498022220
    }
    stop_id: "132N"
  }
  stop_time_update {
    arrival {
      time: 1498022280
    }
    departure {
    

The code for managing this merge is of a level of complexity that I want to TDD to get right, so I'm going to cut out of development in-notebook here.

With the current object model in mind, the process is roughly:

**Load GTFS-Realtime data. Easy enough.**

Transfer the data into TripSet form (pour the data into our object model). This requires a sequence of lookups into the GTFS data, which still need to be worked out for robustness.

**Merge the resultant TripSet list into the pre-existing information.**

GTFS-Realtime data is forward-looking *only*. It doesn't contain any information on the past, even the recent past. If a train has already passed a stop, then that stop is not included in the output. This is great if you're building an app obviously, but we need archival information to put together a model of the whole system's movements.
   
The idea is that every 30 seconds we will requery to get the latest data, then compare-and-merge that information against what we already know, updating the information we have as we go along. This is time-dependent and really quite complicated, and I'm a little unsure what the shape of the data we will ultimately recover will be.
   
Luckily the archival information we have access to in the earlier lines will be helpful for development purposes here.

**Fetch the GTFS data into the TripSit information.**

I really have no idea what to expect in terms of how often we will succeed in doing so. The GTFS-Realtime stream alone will be good enough for many purposes, but having more information in the model will be good.
 
Then once this data engineering is done, we will be able to start to play with the results at last!