# About

So far, I've gotten a model for how loading trips into TripSets will look like, and done a little bit of parsing with the current examples that I have (this was done in a development notebook). However, in order to better understand how to create a backwards-facing history of train movements using the GTFS-Realtime data, I need to examine what the process of merging backwards will look like.

For that I need access to archival information. Luckily, we have that: see [here](http://web.mta.info/developers/MTA-Subway-Time-historical-data.html).

These archival files are generated at 5-minute intervals, however, while the system as a whole updated at 30-second intervals. For testing purposes, I want to be able to build backwards off of this historical data. But is that going to be enough?

I need to better understand the GTFS-Realtime format to be sure.

In [34]:
import sys; sys.path.append("../src/")
from processing import fetch_archival_gtfs_realtime_data

In [6]:
ex1 = fetch_archival_gtfs_realtime_data(kind='gtfs', timestamp='2014-09-17-09-31')

In [9]:
len(ex1.entity)

470

In [7]:
ex2 = fetch_archival_gtfs_realtime_data(kind='gtfs', timestamp='2014-09-17-09-36')

In [10]:
len(ex2.entity)

454

In [11]:
ex3 = fetch_archival_gtfs_realtime_data(kind='gtfs', timestamp='2014-09-17-09-41')

In [12]:
len(ex3.entity)

450

In [14]:
ex1.entity[0].trip_update.trip.trip_id

'050400_1..S02R'

In [15]:
ex2.entity[0].trip_update.trip.trip_id

'051600_1..S02R'

In [16]:
ex3.entity[0].trip_update.trip.trip_id

'051600_1..S02R'

OK, so here we see that the first entry in our three GTFS-Realtime structures are the same one trip. What can we learn by examing the raw info?

In [17]:
ex1.entity[0]

id: "000001"
trip_update {
  trip {
    trip_id: "050400_1..S02R"
    start_date: "20140917"
    route_id: "1"
  }
  stop_time_update {
    arrival {
      time: 1410960713
    }
    stop_id: "140S"
  }
}

In [18]:
ex1.entity[1]

id: "000002"
vehicle {
  trip {
    trip_id: "050400_1..S02R"
    start_date: "20140917"
    route_id: "1"
  }
  current_stop_sequence: 38
  current_status: IN_TRANSIT_TO
  timestamp: 1410960574
  stop_id: "140S"
}

At time `ex1` (at timestamp 1410960574), this train is en route to station `140S`. It is planned to stop there at 1410960713. The next stop is supposed to be the last stop on this train's route.

In [19]:
ex2.entity[0]

id: "000001"
trip_update {
  trip {
    trip_id: "051600_1..S02R"
    start_date: "20140917"
    route_id: "1"
  }
  stop_time_update {
    arrival {
      time: 1410960957
    }
    departure {
      time: 1410961017
    }
    stop_id: "139S"
  }
  stop_time_update {
    arrival {
      time: 1410961167
    }
    stop_id: "140S"
  }
}

In [38]:
ex2.entity[0].trip_update.stop_time_update[1].departure



In [20]:
ex2.entity[1]

id: "000002"
vehicle {
  trip {
    trip_id: "051600_1..S02R"
    start_date: "20140917"
    route_id: "1"
  }
  current_stop_sequence: 37
  current_status: INCOMING_AT
  timestamp: 1410960909
  stop_id: "139S"
}

Five minutes later, at time 2 (1410960909), we learn that the train was re-scheduled: probably it was made from an express to a local. Now we have one more stop, `139S`. We expect the train to stop there at 1410960957, and depart there at 1410961017. We expect it to *then* stop at `140S`, at time 1410961167.

In [22]:
ex3.entity[0]

id: "000001"
trip_update {
  trip {
    trip_id: "051600_1..S02R"
    start_date: "20140917"
    route_id: "1"
  }
  stop_time_update {
    arrival {
      time: 1410961362
    }
    stop_id: "140S"
  }
}

In [23]:
ex3.entity[1]

id: "000002"
vehicle {
  trip {
    trip_id: "051600_1..S02R"
    start_date: "20140917"
    route_id: "1"
  }
  current_stop_sequence: 38
  current_status: IN_TRANSIT_TO
  timestamp: 1410961213
  stop_id: "140S"
}

Five minutes later, at time 3 (1410961213), we have already stopped at station `139S`, and are currently in transit to station `140S`. We do not have any information about when the train actually stopped at `139S`. We expect to arrive at station `140S` at time 1410961362.

NB: From the documentation:

> This includes all future Stop Times for the trip but StopTimes from the past
are omitted. The first StopTime in the sequence is the stop the train is
currently approaching, stopped at or about to leave. A stop is dropped from
the sequence when the train departs the station.

This is what I was worried about. It is impossible to tell when a stop occurred except in five minute intervals, at least with the data that we have exported here.

With the data that we can stream in at 30-second intervals, we will have acceptable resolution, since we will be able to isolate these stops inside of a half-minute. That's good enough.

...from the documentation, with reference to the stop time update timestamp:

> The motivation to include VehiclePosition is to provide the timestamp field. This is the time of the last
detected movement of the train. This allows feed consumers to detect the situation when a train stops
moving (aka stalled). The platform countdown clocks only count down when trains are moving
otherwise they persist the last published arrival time for that train. If one wants to mimic this
behavior you must first determine the absence of movement (stalled train condition) ), then the
countdown must be stopped.

In [24]:
ex4 = fetch_archival_gtfs_realtime_data(kind='gtfs', timestamp='2014-09-17-09-46')

In [25]:
ex4.entity[0]

id: "000001"
trip_update {
  trip {
    trip_id: "052250_1..S02R"
    start_date: "20140917"
    route_id: "1"
  }
  stop_time_update {
    arrival {
      time: 1410961501
    }
    stop_id: "140S"
  }
}

In [26]:
ex4.entity[1]

id: "000002"
vehicle {
  trip {
    trip_id: "052250_1..S02R"
    start_date: "20140917"
    route_id: "1"
  }
  current_stop_sequence: 38
  current_status: STOPPED_AT
  timestamp: 1410961501
  stop_id: "140S"
}

Five minutes later we learn that we stopped at station `140S`, the last station on the line, at 1410961501. Per the note from above, the timestamp in vehicle movement reflects this fact, as it has the exact same timestamp. The train had been stopped at that station from that time until the time at which this data was generated.

The raw information associated with this specific update be distilled into a CSV file. `ex1` becomes:

```csv
trip_id,start_date,route_id,action,stop_id,timestamp,
051600_1..S02R,20140917,1,IN_TRANSIT_TO,140S,1410960574,
051600_1..S02R,20140917,1,EXPECTED_TO_ARRIVE,140S,1410960713,
```

`ex2` becomes:

```csv
trip_id,start_date,route_id,action,stop_id,timestamp,
051600_1..S02R,20140917,1,EXPECTED_TO_ARRIVE,139S,1410960957,
051600_1..S02R,20140917,1,EXPECTED_TO_LEAVE,139S,1410961017,
051600_1..S02R,20140917,1,EXPECTED_TO_ARRIVE,140S,1410961167,
```

`ex3` becomes:

```csv
trip_id,start_date,route_id,action,stop_id,timestamp,
051600_1..S02R,20140917,1,EXPECTED_TO_ARRIVE,140S,1410961362,
```

`ex4` becomes:

```csv
trip_id,start_date,route_id,action,stop_id,timestamp,
051600_1..S02R,20140917,1,STOPPED_AT,140S,1410961501,
```

From this sequence of "action logs", as it were, we can reconstruct the story for our train trip. But, there are still things missing:

* We don't know what the relationship between this trip and the one that was originally scheduled to happen is.
* We don't know what happened inside of that `139S -> 140S` "gap". This occurs due to our 5-minute moving window. Plenty of stops can be passed over completely inside of five minutes; far fewer in 30 seconds, however. I expect that having the 30-second interval data will help a lot here, in terms of resolving timewise ambiguity. Still, this means that we will need to be able to provide an arrival/departure "time band".

I still don't know what happens when a train ends its trip. Let's see by crunching forward some more time.

In [30]:
ex_later1 = fetch_archival_gtfs_realtime_data(kind='gtfs', timestamp='2014-09-17-10-01')

In [31]:
ex_later1.entity[1]

id: "000002"
vehicle {
  trip {
    trip_id: "053700_1..S02R"
    start_date: "20140917"
    route_id: "1"
  }
  current_stop_sequence: 37
  current_status: STOPPED_AT
  timestamp: 1410962327
  stop_id: "139S"
}

In [32]:
ex_later2 = fetch_archival_gtfs_realtime_data(kind='gtfs', timestamp='2014-09-17-10-21')

In [33]:
ex_later2.entity[1]

id: "000002"
vehicle {
  trip {
    trip_id: "056350_1..N"
    start_date: "20140917"
    route_id: "1"
  }
  current_stop_sequence: 30
  current_status: INCOMING_AT
  timestamp: 1410963618
  stop_id: "106N"
}

It looks like the entry will simply be stricken off the record. That makes sense. That means that we know that a trip has been cancelled, when its entry no longer appears inside of the GTFS-Realtime messages.

I suspect that this northbound train is just our southbound train going the opposite direction now, as well.

I see processing this as being a three-step process. First, grab the GTFS-Realtime data, and parse it into our little CSV format. Then, go through the CSVs that we have generated to rearrange the information from an update-oriented to a trip-oriented organization. Then, build TripSets out of our data.

We still need to test using this archival data, so we'll also want to program in some understanding of the time bands we are working with.

The limitations of the time band are akin to computing the position function of a moving particle over time, knowing only its velocity and heading. This is easy enough to do using integral calculus, but then, what if you only know the position of that particle at certain times? In between you have a band of possible actions, and you can't say for sure where or what it was doing anywhere in that band.

In our case, this is going to make it difficult to pick up when a train skips a local stop, I suspect.<!-- This is a fascinating limitation of the format that I think we're actually familiar with from the electronic boards; longtime commuters will know that the arrival boards will display "Arriving Now" for trains that are actually skipping a local stop, and not update to show the next arrival time until the train skipping the stop has passed it by. -->

Trains that make local stops, as above, should be easier to pick up. But then, you can't be too sure.