Now let's test our joins for correctness and consistency.

In [1]:
import os
from google.transit import gtfs_realtime_pb2
from tqdm import tqdm

import sys; sys.path.append("../src/")
from processing import parse_feeds_into_trip_logbook, merge_trip_logbooks

def parse_feed(filepath):
    with open(filepath, "rb") as f:
        try:
            fm = gtfs_realtime_pb2.FeedMessage()
            fm.ParseFromString(f.read())
            return fm
        except (KeyboardInterrupt, SystemExit):
            raise
        except:
            return None

logs = [f for f in os.listdir("./data/subway_time_20160512") if f != 'arch.tar.xz' 
        and 'si' not in f and 'l' not in f]
logs = sorted(logs)
feeds = [parse_feed("./data/subway_time_20160512/" + l) for l in tqdm(logs[:6])]
logbooks = [
    parse_feeds_into_trip_logbook(feeds[0:3], [0, 1, 2]), 
    parse_feeds_into_trip_logbook(feeds[3:6], [3, 4, 5])
]
logbook = merge_trip_logbooks(logbooks)

100%|██████████| 6/6 [00:00<00:00, 10.00it/s]


In [2]:
llog = parse_feeds_into_trip_logbook(feeds[0:3], [0, 1, 2])
rlog = parse_feeds_into_trip_logbook(feeds[3:6], [3, 4, 5])

In [3]:
logbook = merge_trip_logbooks([llog, rlog])

In [4]:
trip_ids = list(logbook.keys())

## Interesting cases

Let's go through the results and study some of the interesting cases that come up. These will teach up a lot about what we've wrought.

In the log below we see that `stop_id` `228N` and `137N` are given minimum times for their stops which are the same, even though `137N` is a `STOPPED_AT`. This occurs because at time `1`, the train was at a station two stations away from where it was found to be at time `0`. The intervening stop, then, must have occured between times `0` and `1`, which is, indeed, what we find.

In [19]:
logbook[trip_ids[12]]

Unnamed: 0,trip_id,route_id,action,minimum_time,maximum_time,stop_id,latest_information_time
0,137600_6..N01R,6,STOPPED_AT,,1.0,603N,0
1,137600_6..N01R,6,STOPPED_OR_SKIPPED,1.0,2.0,602N,2
2,137600_6..N01R,6,STOPPED_AT,2.0,5.0,601N,3


Again, a similar construction occurs betwee stops `217N`, `216N`, and `218N`. In theory we know that the train couldn't have "warped" from stop `217N` to stop `216N` instantenously, so the minimum times shouldn't be aligned as they are, but such are the limitations of this streaming data format.

TODO: Below, the `234S` `maximum_time` should be 5.0, not 4.0.

Examine the raw log

In [15]:
messages = []

for i, feed in enumerate(feeds):
    for message in feed.entity:
        if message.trip_update.trip.trip_id == '142100_2..S08R':
            messages.append((i, message))
            break

In [19]:
messages[3]

(3, id: "000061"
 trip_update {
   trip {
     trip_id: "142100_2..S08R"
     start_date: "20160511"
     route_id: "2"
   }
   stop_time_update {
     arrival {
       time: 1463025758
     }
     departure {
       time: 1463025758
     }
     stop_id: "217S"
   }
   stop_time_update {
     arrival {
       time: 1463025818
     }
     departure {
       time: 1463025818
     }
     stop_id: "218S"
   }
   stop_time_update {
     arrival {
       time: 1463025908
     }
     departure {
       time: 1463025908
     }
     stop_id: "219S"
   }
   stop_time_update {
     arrival {
       time: 1463025998
     }
     departure {
       time: 1463025998
     }
     stop_id: "220S"
   }
   stop_time_update {
     arrival {
       time: 1463026118
     }
     departure {
       time: 1463026118
     }
     stop_id: "221S"
   }
   stop_time_update {
     arrival {
       time: 1463026238
     }
     departure {
       time: 1463026238
     }
     stop_id: "222S"
   }
   stop_time_update {
 