Skip to content

TritonDataCenter/manta-munistats

Repository files navigation

Muni Statistics

This repository contains tools for downloading, manipulating, and analyzing the arrival times of San Francisco Muni vehicles. The collection agent download each vehicle's position every 20 seconds. Those positions are then mapped to specific stops, and then those transcripts of vehicle arrivals are used to aggregate statistics about the average trip length between different stops.

Data Transformations

There are three distinct transformations used to go from a raw XML dump of a vehicle's GPS coorindates to an aggregated report of a given line's statistics. The NextBus API provides an XML blob with a lit of vehicles for a given line:

<?xml version="1.0" encoding="UTF-8"?>
<body copyright="All data copyright San Francisco Muni 2013.">
  <vehicle id="5461" routeTag="14" dirTag="14_IB1" lat="37.78429" lon="-122.40442" secsSinceReport="2" predictable="true" heading="45" speedKmHr="0" />
  <vehicle id="7112" routeTag="14" dirTag="14_IB1" lat="37.75132" lon="-122.41838" secsSinceReport="38" predictable="true" heading="356" speedKmHr="27" />
  <vehicle id="7104" routeTag="14" dirTag="14_OB2" lat="37.72193" lon="-122.437" secsSinceReport="20" predictable="true" heading="211" speedKmHr="35" />
  <vehicle id="7126" routeTag="14" dirTag="14_IB1" lat="37.71793" lon="-122.43979" secsSinceReport="20" predictable="true" heading="31" speedKmHr="36" />
  <vehicle id="7032" routeTag="14" dirTag="14_IB1" lat="37.74655" lon="-122.41923" secsSinceReport="87" predictable="true" heading="25" speedKmHr="20" />
  <vehicle id="7001" routeTag="14" dirTag="14_OB2" lat="37.7528099" lon="-122.41854" secsSinceReport="2" predictable="true" heading="175" speedKmHr="5" />
  <vehicle id="7117" routeTag="14" dirTag="14_IB1" lat="37.73587" lon="-122.42445" secsSinceReport="7" predictable="true" heading="23" speedKmHr="0" />
  <vehicle id="5450" routeTag="14" dirTag="14_OB2" lat="37.76923" lon="-122.42005" secsSinceReport="83" predictable="true" heading="175" speedKmHr="33" />
  <vehicle id="7031" routeTag="14" dirTag="14_OB2" lat="37.7069" lon="-122.46051" secsSinceReport="51" predictable="true" heading="213" speedKmHr="25" />
  <vehicle id="7049" routeTag="14" dirTag="14_IB1" lat="37.78636" lon="-122.40191" secsSinceReport="99" predictable="true" heading="31" speedKmHr="11" />
  <vehicle id="7118" routeTag="14" dirTag="14_IB1" lat="37.77416" lon="-122.4173" secsSinceReport="73" predictable="true" heading="39" speedKmHr="16" />
  <vehicle id="7124" routeTag="14" dirTag="14_OB4" lat="37.77606" lon="-122.41486" secsSinceReport="72" predictable="true" heading="225" speedKmHr="0" />
  <vehicle id="5412" routeTag="14" dirTag="14_IB1" lat="37.7069199" lon="-122.45808" secsSinceReport="32" predictable="true" heading="66" speedKmHr="29" />
  <vehicle id="7012" routeTag="14" lat="37.76363" lon="-122.40926" secsSinceReport="65" predictable="false" heading="355" speedKmHr="7" />
  <vehicle id="7043" routeTag="14" dirTag="14_IB1" lat="37.78795" lon="-122.39988" secsSinceReport="23" predictable="true" heading="45" speedKmHr="0" />
  <vehicle id="7107" routeTag="14" dirTag="14_OB2" lat="37.7421099" lon="-122.42227" secsSinceReport="44" predictable="true" heading="206" speedKmHr="23" />
  <vehicle id="7122" routeTag="14" dirTag="14_OB2" lat="37.70986" lon="-122.44962" secsSinceReport="26" predictable="true" heading="244" speedKmHr="27" />
  <vehicle id="7115" routeTag="14" dirTag="14_IB1" lat="37.78035" lon="-122.4094" secsSinceReport="17" predictable="true" heading="45" speedKmHr="22" />
  <lastTime time="1370986564897" />
</body>

There are approximately 4000 of these files generated per line per day. Since they're such small files, I created a tarball for each day so as not to have too many keys. Those files are stored in /bpijewski/stor/munistats/sf-muni/data.

The first phase of this transformation takes the tarball and generates a transcript of which train arrives at which station:

01/04/2013   07:29:45	1357313385102	7017	14_OB2	15604	Mission St & Mt Vernon Ave
01/04/2013   07:30:12	1357313412137	7011	14_IB1	15586	Mission St & Excelsior Ave
01/04/2013   07:30:12	1357313412137	7113	14_IB1	15567	Mission St & 26th St
01/04/2013   07:30:12	1357313412137	7106	14_IB1	17129	Mission St & 7th St
01/04/2013   07:30:12	1357313412137	7120	14_OB2	15578	Mission St & Appleton Ave
01/04/2013   07:30:12	1357313412137	7030	14_IB1	15574	Mission St & Acton St
01/04/2013   07:30:12	1357313412137	7104	14_OB2	15550	Mission St & 15th St
01/04/2013   07:30:32	1357313432503	7045	14_OB2	15528	Mission St & 1st St
01/04/2013   07:30:32	1357313432503	7017	14_OB2	15589	Mission St & Foote Ave
01/04/2013   07:30:52	1357313452529	7017	14_OB2	15602	Mission St & Lowell St
01/04/2013   07:31:12	1357313472236	7106	14_IB1	15537	Mission St & 6th St
01/04/2013   07:31:12	1357313472236	7045	14_OB2	15529	Mission St & 2nd St
01/04/2013   07:31:12	1357313472236	7104	14_OB2	15552	Mission St & 16th St
01/04/2013   07:31:33	1357313493796	7011	14_IB1	15620	Mission St & Silver Ave
01/04/2013   07:31:33	1357313493796	7113	14_IB1	15565	Mission St & 24th St
01/04/2013   07:31:33	1357313493796	7120	14_OB2	15597	Mission St & Highland Ave
01/04/2013   07:31:33	1357313493796	7030	14_IB1	15608	Mission St & Oliver St
01/04/2013   07:31:33	1357313493796	7017	14_OB2	15627	Mission St & Whittier St
01/04/2013   07:31:52	1357313512961	7011	14_IB1	15624	Mission St & Trumbull St
01/04/2013   07:32:14	1357313534015	7113	14_IB1	15563	Mission St & 23rd St

The job which generates that report is driven by jobs/arrivals.json and specifically the jobs/generate_arrivals.sh script. The output of that job is a single key (per line, per day) which is the transcript of train arrivals. Those files are stored in /bpijewski/stor/munistats/sf-muni/arrivals.

Next, all the transcripts for a given line are fed into the next phase, which calculates the total duration for each trip between two stations. The output format is:

M	M__OB1	16994	16259	06/11/2013   13:59:47	31 min	1419
M	M__OB1	16994	16259	06/11/2013   14:18:23	29 min	1420
M	M__OB1	16994	16259	06/11/2013   14:21:06	28 min	1440
M	M__OB1	16994	16259	06/11/2013   14:21:06	28 min	1458
M	M__OB1	16994	16259	06/11/2013   14:33:24	37 min	1438
M	M__OB1	16994	16259	06/11/2013   14:43:26	31 min	1468
M	M__OB1	16994	16259	06/11/2013   14:53:28	21 min	1516
M	M__OB1	16994	16259	06/11/2013   15:06:48	36 min	1479
M	M__OB1	16994	16259	06/11/2013   15:26:44	29 min	1410

That shows a random sample of trips from the Joyent office back to my apartment. Those files are stored in /bpijewski/stor/munistats/sf-muni/trips.

Finally, the last phase takes the list of trips between two stops and generates a report of average duration and standard deviation by hour:

Weekday 000-100 25 28.80 3.06
Weekday 100-200 0 0 0.00
Weekday 200-300 0 0 0.00
Weekday 300-400 0 0 0.00
Weekday 400-500 11 29.09 0.30
Weekday 500-600 26 29.69 1.05
Weekday 600-700 62 28.98 4.55
Weekday 700-800 80 30.11 5.98
Weekday 800-900 91 29.99 5.64
Weekday 900-1000 103 31.87 4.98
Weekday 1000-1100 104 30.20 5.79
Weekday 1100-1200 92 30.17 5.17
Weekday 1200-1300 89 30.69 3.67
Weekday 1300-1400 86 30.71 3.49
Weekday 1400-1500 83 29.20 4.92
Weekday 1500-1600 99 30.20 5.38
Weekday 1600-1700 107 29.93 5.73
Weekday 1700-1800 102 32.63 6.36
Weekday 1800-1900 87 30.10 5.74
Weekday 1900-2000 91 29.85 3.87
Weekday 2000-2100 58 29.76 2.99
Weekday 2100-2200 30 29.80 1.88
Weekday 2200-2300 34 29.06 2.68
Weekday 2300-2400 19 29.84 0.96
Weekend 000-100 7 29.43 3.31
Weekend 100-200 0 0 0.00
Weekend 200-300 0 0 0.00
Weekend 300-400 0 0 0.00
Weekend 400-500 0 0 0.00
Weekend 500-600 0 0 0.00
Weekend 600-700 3 29.00 0.00
Weekend 700-800 6 27.50 3.89
Weekend 800-900 12 30.75 3.02
Weekend 900-1000 14 31.14 4.20
Weekend 1000-1100 11 27.09 5.63
Weekend 1100-1200 17 33.12 4.36
Weekend 1200-1300 16 31.94 4.14
Weekend 1300-1400 17 29.24 4.37
Weekend 1400-1500 17 30.47 4.87
Weekend 1500-1600 16 30.81 5.19
Weekend 1600-1700 20 31.65 3.84
Weekend 1700-1800 20 29.20 4.69
Weekend 1800-1900 19 30.89 4.07
Weekend 1900-2000 10 31.40 2.84
Weekend 2000-2100 10 33.30 5.36
Weekend 2100-2200 14 28.21 4.68
Weekend 2200-2300 7 29.86 0.38
Weekend 2300-2400 5 31.60 3.21

The data show roughly what I would expect: there are many more trains during the rush hours, and both the average trip length and standard deviation are higher during that time as well.

Lessons Learned

  • It served me well to build the tarball of small files as opposed to upload each XML blob as a single key. It was easy enough to untar the tarball inside the job, made running jobs on those keys faster, and was easier to debug as I didn't have 100000s of keys in a single directory.

  • Manipulating XML sucks. I thought it was a good idea to keep the original data intact as received from the API, but it would have been better to go straight to a text, column-based format.

  • For the second transforation, I ran a job for each (source, destination) pair. For example, different jobs for (Powell, Civic Center), (Powell, Church St.), (Powell, Castro), (Civic Center, Church St.) and so on. It would have been easier/faster to do that in the job itself, and just have my program emit a key for each (source, destination) pair. For some of the longer lines, there are 2000-2500 pairs, so running that many jobs became untenable.

  • By far the hardest part was writing the program to take the GPS coordinates and list the station arrivals. It wasn't as simple as finding which was the closest stop. The vehicles report in every so often, and that period is anywhere from a few seconds to a few minutes. Between reports, the vehicle may have remained at the current stop or advanced several stops down the line. In addition, several lines have multiple cars per train, and each car has its own transponder. Finally, there are lines which have U-turns, so even the closest stop on the line may not be the next one.

  • The data from the API was a dirtier than expected. Trains would appear in multiple places at once, and there were situations where trains would pass each other, yet there was no siding which would allow such passing. If that data is accurate (questionable), my theory was that technicians were carrying transponders with them as they did track inspections or whatever.

TODO

  • Scripts to automate/simplify running of jobs. Not all the reports have been generated - I need to change the trip generation to emit all the station pairs as explained above.

  • Small frontend/portal to interact with data

  • Move data to /public?

Releases

No releases published

Packages

No packages published