# About

## Protobuff

Real-time subway data is provided in the GTFS-Realtime format. This format uses the Google Protobuff spec, an abstract data specification that allows Google to distribute transit API wrappers in various languages more easily.

Google has a protobuff CLI tool that ingests a data specification and outputs an object-based API for interfacing with that spec, which can generate (distributable) packages for working with that spec in various languages. Python is included, obviously, with the output being a `.py` file. The Python implementation is metaprogrammming based, by the way.

I could use the Protobuff tool directly, however, there's no need to do so, since the resultant package is distributed via PyPi.

There are three versions of Protobuff. Protobuff 2 is Python 2 only, while Protobuff 3 is Python 2/3 compatible. Protobuff 1 was only ever used internally within Google, before it was open-sourced.

## GTFS-Realtime

The GTFS-Realtime binding is distributed via PyPI ([GitHub link](https://github.com/google/gtfs-realtime-bindings/blob/master/python/README.md)). However, Google is internally still on Python 2, as is the Transit team. Since the package is open-source, someone has contributed a [pull request](https://github.com/google/gtfs-realtime-bindings/pull/20) with an update (`0.0.4` -> `0.0.5`) which updates the PyPi module to be compatible with Protobuff 3. However, Google has been extremely slow at getting this PR distributed onto PyPi; ironically this was starting to happen just as I started getting into this project (see [here](https://github.com/google/gtfs-realtime-bindings/issues/21)).

I don't want to work in Python 2 though, so I need to get the latest version.

To get the latest version right now, I have to clone the repository and do a `setup.py` install, because the git repo includes *all* of the language bindings (weird as that is) in subfolders, and hence the `git+git` trick doesn't work.

From Desktop:

```shell
git clone https://github.com/google/gtfs-realtime-bindings.git
cd gtfs-realtime-bindings/python
python setup.py install
```

Unfortunately, there's a complicated issue in the install environment (see [#21](https://github.com/google/gtfs-realtime-bindings/issues/21#issuecomment-309898505)). TLDR: you have to do this install from and use a Python 3.4 environment.

## Parsing the Data

In [1]:
from google.transit import gtfs_realtime_pb2
import requests

feed = gtfs_realtime_pb2.FeedMessage()
response = requests.get('http://datamine.mta.info/mta_esi.php?key=224a3669a50efeb1b61d3fb3694a0a17&feed_id=1')
feed.ParseFromString(response.content)
# for entity in feed.entity:
#   if entity.HasField('trip_update'):
#     print(entity.trip_update)

In [2]:
example_pull = feed

GTFS-Realtime is a binary format, and you decode it using the Python SDK. I expected the output to be a stream, however, the example provided in the library `README` is of a regular REST API.

The MTA updates its feed every 30 seconds ([source](http://datamine.mta.info/sites/all/files/pdfs/GTFS-Realtime-NYC-Subway%20version%201%20dated%207%20Sep.pdf)). So to have the latest information at the highest possible temporal resolution, one would need to download and process the data at that interval. Other systems will have some other (hopefully well-documented!) update frequency.

In [3]:
type(example_pull)

gtfs_realtime_pb2.FeedMessage

The result is wrapped in a Java-stype `FeedMessage` class, although it does have a `dict` repr. The underlying data is JSON, but too large to print here.

The top level of the object has two keys, `header` and `entities`. `header` contains versioning information. `entities` is a list containing the data of interest (huge; omitted).

In [7]:
example_pull.header

gtfs_realtime_version: "1.0"
timestamp: 1498061666

### Header

These fields are pretty simple. `gtfs_realtime_version` is the version of the `gtfs-realtime` specification used for transfering this data, while `timestamp` is the UNIX timestamp at which this transfer occurred.

At this intersection, we have to make a small digression. The MTA has its own spin on the GTFS-Realtime specification, adding a few fields and extensions to the data that it transmits. According to the design documentation:

> ...[there are] extensions added specifically for NYCT (NyctFeedHeader,
NyctTripDescriptor and NyctStopTimeUpdate). To use these extensions, you need the nyctsubway.proto
file (URL TBD).

Having access to the `nyctsubway.proto` file would allow us to run the `protobuf` tool on the binary blob directly, which would in turn allow us to inspect these fields as well. Although the URL isn't listed in the reference document, it's [easily findable via Google](http://datamine.mta.info/sites/all/files/pdfs/nyct-subway.proto.txt).

The package that we're using is a Google export that's built against the general GTFS-Realtime specification, not the MTA's homebrew version of it. Being an expert on neither `protobuf` nor the MTA signal system, I can't say for sure (yet) what the consequences are.

### Entities
My guess is that many if not most transit systems release their data in a single API endpoint. In those cases, the `entity` list contains records for all of the lines in the system (for which this information is available in the first place).

However, the MTA is huge, so doing so in New York City would be uneconomical&mdash;the sizes of the files you would be reading from would be too large. Additionally, different lines are being brought into the fold at different times, and just "tacking on" recently computerized lines into the API output after-the-fact would be poor form. So instead the MTA breaks down transit information across several API endpoints, which each endpoint responsible for a certain "slice" of the system.

The feed that we are looking at is responsible for the 1, 2, 3, 4, 5, 6, and S lines. The entity is a list with quite a large number of items in it:

In [17]:
len(example_pull.entity)

355

Each entry looks something like this one:

In [33]:
example_pull.entity[1]

id: "000002"
vehicle {
  trip {
    trip_id: "006550_1..N02X003"
    start_date: "20170621"
    route_id: "1"
  }
  current_stop_sequence: 4
  current_status: INCOMING_AT
  timestamp: 1498022005
  stop_id: "137N"
}

According to the [MTA GTFS-realtime Reference](http://datamine.mta.info/sites/all/files/pdfs/GTFS-Realtime-NYC-Subway%20version%201%20dated%207%20Sep.pdf), this entity list contains three kinds of individual entities (called "Messages" in the Protobuf parlance): *trip updates*, *vehicle position*, and *alerts*. Above we have an example of a vehicle update.

Although entities have a JSON dict `repr`, they themselves are actually `FeedEntity` objects:

In [72]:
type(example_pull.entity[0])

gtfs_realtime_pb2.FeedEntity

This is not very Pythonic; at this level of detail I'd expect to be working with raw `dict` objects. But then, `protobuf` is a tool for outputting code across many different programming languages, resulting in programmatically generated APIs that probably feed above-averagely awkward in *all* those languages.

### Vehicle Updates

Since vehicle updates are a little simpler to understand, let's start by looking at those.

In [92]:
example_vehicle_update = example_pull.entity[1]

In [93]:
example_vehicle_update

id: "000002"
vehicle {
  trip {
    trip_id: "006550_1..N02X003"
    start_date: "20170621"
    route_id: "1"
  }
  current_stop_sequence: 4
  current_status: INCOMING_AT
  timestamp: 1498022005
  stop_id: "137N"
}

As a sentence: 

> A train performing trip number `006550_1..N02X003` is currently `INCOMING_AT` station number `4` on route `1` (otherwise known as stop `137N`), and is expected to arrive at `1498022005` in UNIX time.

Or, by translating these computer-readable IDs into human-readable ones (a processs still TBD):

> A northbound `1` is currently en route to Chambers Street, (otherwise known as stop `137N`), and is expected to arrive at 1:13:25.

### Trip Update

Let's repeat this process for trip updates. These contain information on all of the stops a train is taking, and so are much longer.

In [94]:
example_pull.entity[0]

id: "000001"
trip_update {
  trip {
    trip_id: "006550_1..N02X003"
    start_date: "20170621"
    route_id: "1"
  }
  stop_time_update {
    arrival {
      time: 1498021800
    }
    departure {
      time: 1498021800
    }
    stop_id: "137N"
  }
  stop_time_update {
    arrival {
      time: 1498021890
    }
    departure {
      time: 1498021890
    }
    stop_id: "136N"
  }
  stop_time_update {
    arrival {
      time: 1498021950
    }
    departure {
      time: 1498021950
    }
    stop_id: "135N"
  }
  stop_time_update {
    arrival {
      time: 1498022040
    }
    departure {
      time: 1498022040
    }
    stop_id: "134N"
  }
  stop_time_update {
    arrival {
      time: 1498022130
    }
    departure {
      time: 1498022130
    }
    stop_id: "133N"
  }
  stop_time_update {
    arrival {
      time: 1498022220
    }
    departure {
      time: 1498022220
    }
    stop_id: "132N"
  }
  stop_time_update {
    arrival {
      time: 1498022280
    }
    departure {
    

The MTA's documentation states that:

> The feed contains all revenue trips that are either:
>
> * already underway (assigned trips), or
> * scheduled to start in the next 30 minutes (unassigned trips)
> 
> Trips are usually assigned to a physical train a few minutes before the scheduled start time, sometimes
just a few seconds before.
> 
> If a trip is included in the GTFS-realtime feed, there is a high probability that it will depart from its
originating terminal as planned. It is more likely that a train that is never assigned a trip identifier to be
changed or cancelled than an assigned one.

This block of information contains all of the *projected* arrival and departure times for our northbound 1 train from before. Chambers Street is the first stop on the list, because it's the first station that this train has not yet stopped at (two earlier stops, Rector Street and South Ferry, are excluded).

By comparing this timetable with our earlier one, we learn that this train is *already late*. It was expected to arrive at 1:10:00, but is currently projected to arrive at 1:13:25&mdash;three-and-a-half minutes later!

Trip updates *always* precede vehicle updates, in a 1-2 pattern. In other words, leaving aside alerts, every *even* (or 0) entry is a trip update, and every *odd* entry is a vehicle update for the immediately preceding trip.

This is an awkfully convenient system: to get the status and schedule of any particular train, we just have to read two contiguous entries.

### Alerts

Finally there are alerts. Alerts may be used for a somewhat wide range of informative content, but in the MTA system they are confined to service delay declarations ("in general, when a train is shown as ‘delayed’ on the station countdown clocks, an Alert is generated for that trip in the feed"). They also neatly always appear at the very end of our list of messages.

Here's an example output:

In [121]:
example_pull.entity[-1]

id: "000355"
alert {
  informed_entity {
    trip {
      trip_id: "006550_1..N02X003"
      route_id: "1"
    }
  }
  informed_entity {
    trip {
      trip_id: "051350_4..N06R"
      route_id: "4"
    }
  }
  header_text {
    translation {
      text: "Train delayed"
    }
  }
}

The data dictionary has the following to say about alerts:

> The only alerts included in the NYCT Subway GTFS-realtime feed are notifications about delayed trains
therefore the entity is always a trip. In general, when a train is shown as ‘delayed’ on the station
countdown clocks, an Alert is generated for that trip in the feed.

Parsing this new information as a sentence, we find out that:

> Two northbound 1 and 4 trains are delayed.

If we compare this trip ID with the one from our example earlier, we see that it's that same train again!

In [118]:
len(trips_updates)

355