Trip log joins are an operational necessity. However, they do not process fast enough. Let's get them running more quickly.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from google.transit import gtfs_realtime_pb2
import sys; sys.path.append("../src/")
from processing import parse_feeds_into_trip_logbook, merge_trip_logbooks

with open("../src/tests/data/gtfs_realtime_pull_1.dat", "rb") as f:
    gtfs_r0 = gtfs_realtime_pb2.FeedMessage()
    gtfs_r0.ParseFromString(f.read())
with open("../src/tests/data/gtfs_realtime_pull_2.dat", "rb") as f:
    gtfs_r1 = gtfs_realtime_pb2.FeedMessage()
    gtfs_r1.ParseFromString(f.read())

left_logbook = parse_feeds_into_trip_logbook([gtfs_r0], [0])
right_logbook = parse_feeds_into_trip_logbook([gtfs_r1], [1])

In [None]:
# Slow!
result = merge_trip_logbooks([left_logbook, right_logbook])

> /home/alex/Desktop/mta-data-exploration/src/processing.py(558)merge_trip_logbooks()
-> left = dict()
(Pdb) c


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  entries['action'] = 'STOPPED_OR_SKIPPED'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  entries['maximum_time'] = next_entries['maximum_time']


The above makes it obvious that the problem is that we are running `_join_trip_logs`, a multi-second operation, 162 times here. So we need to raise that functions' execution speed.

In [3]:
from processing import _join_trip_logs

Old code:

In [5]:
from pyinstrument import Profiler

profiler = Profiler()
profiler.start()

# code you want to profile
_join_trip_logs(left_logbook['055450_4..N06R'], right_logbook['055450_4..N06R'])

profiler.stop()

print(profiler.output_text(unicode=True, color=True))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  #     entry['action'] = 'STOPPED_OR_SKIPPED'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  #     entry['maximum_time'] = next_entry['maximum_time']


[31m3.096[0m _join_trip_logs  [2mprocessing.py:591[0m
└─ [31m3.059[0m __setitem__  [2mpandas/core/series.py:716[0m
   └─ [31m3.059[0m _check_is_chained_assignment_possible  [2mpandas/core/generic.py:1512[0m
      └─ [31m3.059[0m _check_setitem_copy  [2mpandas/core/generic.py:1533[0m



New code (using `numpy` indexed assignment):

In [9]:
from pyinstrument import Profiler

profiler = Profiler()
profiler.start()

# code you want to profile
_join_trip_logs(left_logbook['055450_4..N06R'], right_logbook['055450_4..N06R'])

profiler.stop()

print(profiler.output_text(unicode=True, color=True))

[31m0.152[0m _join_trip_logs  [2mprocessing.py:591[0m
├─ [31m0.113[0m __setitem__  [2mpandas/core/indexing.py:135[0m
│  └─ [31m0.112[0m _setitem_with_indexer  [2mpandas/core/indexing.py:233[0m
│     └─ [31m0.112[0m setter  [2mpandas/core/indexing.py:455[0m
│        └─ [31m0.112[0m __setitem__  [2mpandas/core/frame.py:2405[0m
│           └─ [31m0.112[0m _set_item  [2mpandas/core/frame.py:2473[0m
│              └─ [31m0.110[0m _check_setitem_copy  [2mpandas/core/generic.py:1533[0m
├─ [32m0.014[0m <listcomp>  [2mprocessing.py:622[0m
│  ├─ [32m0.009[0m __init__  [2mpandas/core/frame.py:252[0m
│  │  └─ [32m0.008[0m _init_dict  [2mpandas/core/frame.py:349[0m
│  │     └─ [92m[2m0.006[0m _arrays_to_mgr  [2mpandas/core/frame.py:5391[0m
│  │        └─ [92m[2m0.005[0m create_block_manager_from_arrays  [2mpandas/core/internals.py:4259[0m
│  │           └─ [92m[2m0.004[0m form_blocks  [2mpandas/core/internals.py:4270[0m
│  │              └─ [9

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


This is a 20x speedup. However, the full merge still takes 20 seconds (!), due to `__setitem__` calls still in `_join_trip_logs`. It's not transparent to me where these are coming from, so I spent some time muddling about in it.

In [15]:
from pyinstrument import Profiler

profiler = Profiler()
profiler.start()

# code you want to profile
result = merge_trip_logbooks([left_logbook, right_logbook])

profiler.stop()

print(profiler.output_text(unicode=True, color=True))

[31m6.024[0m _join_trip_logs  [2mprocessing.py:590[0m
├─ [33m2.242[0m <listcomp>  [2mprocessing.py:621[0m
│  ├─ [33m1.421[0m __init__  [2mpandas/core/frame.py:252[0m
│  │  └─ [33m1.350[0m _init_dict  [2mpandas/core/frame.py:349[0m
│  │     ├─ [32m0.969[0m _arrays_to_mgr  [2mpandas/core/frame.py:5391[0m
│  │     │  └─ [32m0.906[0m create_block_manager_from_arrays  [2mpandas/core/internals.py:4259[0m
│  │     │     ├─ [32m0.701[0m form_blocks  [2mpandas/core/internals.py:4270[0m
│  │     │     │  ├─ [92m[2m0.235[0m equals  [2mpandas/indexes/base.py:1643[0m
│  │     │     │  │  └─ [92m[2m0.193[0m array_equivalent  [2mpandas/types/missing.py:245[0m
│  │     │     │  │     └─ [92m[2m0.065[0m array_equal  [2mnumpy/core/numeric.py:2476[0m
│  │     │     │  ├─ [92m[2m0.189[0m _simple_blockify  [2mpandas/core/internals.py:4385[0m
│  │     │     │  │  └─ [92m[2m0.134[0m make_block  [2mpandas/core/internals.py:2649[0m
│  │     │     │  │     └─

We got another 33% speedup by avoiding `DataFrame` transforms.

In [22]:
from pyinstrument import Profiler

profiler = Profiler()
profiler.start()

# code you want to profile
result = merge_trip_logbooks([left_logbook, right_logbook])

profiler.stop()

print(profiler.output_text(unicode=True, color=True))

[31m3.782[0m _join_trip_logs  [2mprocessing.py:590[0m
├─ [33m1.591[0m __getitem__  [2mpandas/core/frame.py:2035[0m
│  └─ [33m1.463[0m _getitem_array  [2mpandas/core/frame.py:2078[0m
│     ├─ [33m1.308[0m take  [2mpandas/core/generic.py:1650[0m
│     │  ├─ [33m1.110[0m take  [2mpandas/core/internals.py:3943[0m
│     │  │  ├─ [33m0.792[0m reindex_indexer  [2mpandas/core/internals.py:3813[0m
│     │  │  │  ├─ [32m0.493[0m <listcomp>  [2mpandas/core/internals.py:3848[0m
│     │  │  │  │  ├─ [32m0.255[0m take_nd  [2mpandas/core/internals.py:2126[0m
│     │  │  │  │  │  └─ [32m0.210[0m take_nd  [2mpandas/core/categorical.py:1481[0m
│     │  │  │  │  │     └─ [92m[2m0.168[0m take_nd  [2mpandas/core/algorithms.py:1010[0m
│     │  │  │  │  │        └─ [92m[2m0.079[0m _maybe_promote  [2mpandas/types/cast.py:227[0m
│     │  │  │  │  └─ [32m0.225[0m take_nd  [2mpandas/core/internals.py:1001[0m
│     │  │  │  │     └─ [92m[2m0.162[0m take_nd  [2

WIP:

In [29]:
from pyinstrument import Profiler

profiler = Profiler()
profiler.start()

# code you want to profile
result = merge_trip_logbooks([left_logbook, right_logbook])

profiler.stop()

print(profiler.output_text(unicode=True, color=True))

[31m1.664[0m _join_trip_logs  [2mprocessing.py:590[0m
├─ [33m0.745[0m __getitem__  [2mpandas/core/indexing.py:1302[0m
│  └─ [33m0.731[0m _getitem_axis  [2mpandas/core/indexing.py:1599[0m
│     ├─ [33m0.664[0m _get_loc  [2mpandas/core/indexing.py:104[0m
│     │  └─ [33m0.656[0m _ixs  [2mpandas/core/frame.py:1953[0m
│     │     ├─ [32m0.293[0m __init__  [2mpandas/core/series.py:135[0m
│     │     │  ├─ [32m0.146[0m _sanitize_array  [2mpandas/core/series.py:2817[0m
│     │     │  │  └─ [32m0.094[0m _try_cast  [2mpandas/core/series.py:2834[0m
│     │     │  │     ├─ [92m[2m0.047[0m _possibly_cast_to_datetime  [2mpandas/types/cast.py:765[0m
│     │     │  │     └─ [92m[2m0.033[0m is_extension_type  [2mpandas/types/common.py:301[0m
│     │     │  ├─ [92m[2m0.057[0m __init__  [2mpandas/core/internals.py:4031[0m
│     │     │  │  └─ [92m[2m0.048[0m make_block  [2mpandas/core/internals.py:2649[0m
│     │     │  │     └─ [92m[2m0.018[0m __in

In [23]:
_join_trip_logs(left_logbook['055450_4..N06R'], right_logbook['055450_4..N06R'])

> /home/alex/Desktop/mta-data-exploration/src/processing.py(619)_join_trip_logs()
-> left.loc[:, 'stop_id'] = pd.Categorical(left['stop_id'], stations)
(Pdb) stations
['250N', '239N', '235N', '234N', '423N', '420N', '419N', '418N', '640N', '635N', '631N', '629N', '626N', '621N', '416N', '415N', '414N', '413N', '412N', '411N', '410N', '409N', '408N', '407N', '406N', '405N', '402N', '401N']
(Pdb) len(stations)
28
(Pdb) len(set(stations))
28
(Pdb) c


Unnamed: 0,index,trip_id,route_id,action,minimum_time,maximum_time,stop_id,latest_information_time
0,0,055450_4..N06R,4,STOPPED_OR_SKIPPED,0,,250N,0
1,1,055450_4..N06R,4,STOPPED_OR_SKIPPED,0,,239N,0
2,2,055450_4..N06R,4,STOPPED_OR_SKIPPED,0,,235N,0
3,3,055450_4..N06R,4,STOPPED_OR_SKIPPED,0,,234N,0
4,4,055450_4..N06R,4,STOPPED_OR_SKIPPED,0,,423N,0
5,5,055450_4..N06R,4,STOPPED_OR_SKIPPED,0,,420N,0
6,6,055450_4..N06R,4,STOPPED_OR_SKIPPED,0,,419N,0
7,7,055450_4..N06R,4,STOPPED_OR_SKIPPED,0,,418N,0
8,8,055450_4..N06R,4,STOPPED_OR_SKIPPED,0,,640N,0
9,9,055450_4..N06R,4,STOPPED_OR_SKIPPED,0,,635N,0


Begin TODO: debug this.

In [1]:
import os

logs = [f for f in os.listdir("./data/subway_time_20160512") if f != 'arch.tar.xz' 
        and 'si' not in f and 'l' not in f]

In [2]:
logs[:5]

['gtfs-20160512T0415Z',
 'gtfs-20160512T1759Z',
 'gtfs-20160512T2155Z',
 'gtfs-20160512T0610Z',
 'gtfs-20160513T0153Z']

In [3]:
from google.transit import gtfs_realtime_pb2

In [4]:
def parse_feed(filepath):
    with open(filepath, "rb") as f:
        try:
            fm = gtfs_realtime_pb2.FeedMessage()
            fm.ParseFromString(f.read())
            return fm
        except (KeyboardInterrupt, SystemExit):
            raise
        except:
            return None

Note: the data above comes from the next notebook. You can get it yourself by using the following magical incantation:

    pip install requests gtfs-realtime-bindings
    python -c "import requests; r = requests.get('http://data.mytransit.nyc.s3.amazonaws.com/subway_time/2016/2016-05/subway_time_20160512.tar.xz'); open('arch.tar.xz', 'wb').write(r.content)"
    tar xvfJ arch.tar.xz
    python -c "from google.transit import gtfs_realtime_pb2; test_example = gtfs_realtime_pb2.FeedMessage().ParseFromString(open('gtfs-20160512T0400Z', 'rb').read()); print(type(test_example))"

In [7]:
from tqdm import tqdm

In [12]:
feeds = [parse_feed("./data/subway_time_20160512/" + l) for l in tqdm(logs[:6])]


  0%|          | 0/6 [00:00<?, ?it/s][A
 17%|█▋        | 1/6 [00:00<00:00,  7.10it/s][A
 33%|███▎      | 2/6 [00:00<00:00,  5.40it/s][A
 50%|█████     | 3/6 [00:00<00:00,  4.88it/s][A
 67%|██████▋   | 4/6 [00:00<00:00,  5.27it/s][A
 83%|████████▎ | 5/6 [00:00<00:00,  5.63it/s][A
100%|██████████| 6/6 [00:01<00:00,  5.24it/s][A
[A

In [16]:
import sys; sys.path.append("../src/")
from processing import parse_feeds_into_trip_logbook

In [19]:
# logbooks = [
#     parse_feeds_into_trip_logbook(feeds[0:3], [0, 1, 2]), 
#     parse_feeds_into_trip_logbook(feeds[3:6], [3, 4, 5])
# ]

AssertionError: 

End TODO: debug this.