# BART GTFS Changes

From [Bay Area Rapid Transit GTFS Schedules](https://www.bart.gov/schedules/developers/gtfs)

> For the February 11, 2019 schedule change, we’ve created two GTFS file versions. The first version was generated using our legacy process, and it's available on the permalink. We also created a second, "preview" version to show you what BART GTFS will look like once we transition to our new scheduling system.

>The preview GTFS includes more stop detail, like additional entrances and exits in the stops.txt file. It also includes more trip-level transfer details in transfer.txt. In addition, routes have been split up in routes.txt, and are unidirectional instead of bidirectional. We are also planning to refactor shapes.txt to improve accuracy, but that work is still on the backlog.

This suggests that stops.txt, transfer.txt, and routes.txt warrant inspection.

In [1]:
import glob
import errno
import os, sys
import numpy as np
import pandas as pd

In [2]:
# assign relevant directories to variables
original_bart_gtfs_dir = "google_transit_20190211_v06/"
preview_bart_gtfs_dir = "PREVIEW-google_transit_20190211_v2/"

# run a quick check to see if both contain the same files (or at least files names)
print(os.listdir(original_bart_gtfs_dir) == os.listdir(preview_bart_gtfs_dir))

# prints out all the files in the preview directory
print(os.listdir(preview_bart_gtfs_dir))

True
['fare_attributes.txt', 'transfers.txt', 'agency.txt', 'fare_rules.txt', 'calendar_dates.txt', 'stop_times.txt', 'frequencies.txt', 'shapes.txt', 'trips.txt', 'feed_info.txt', 'stops.txt', 'calendar.txt', 'routes.txt']


## Read in all files from both directories

In [3]:
'''
Set a path to collect the files from the original BART GTFS directory
'''
path = original_bart_gtfs_dir + '*.txt'
files = glob.glob(path)
for name in files:
    try:
        with open(name) as f:
            vars()['original_'+name[28:-4]] = pd.read_csv(name, sep=",", header=0)
    except IOError as exc:
        if exc.errno != errno.EISDIR:
            raise

In [4]:
'''
Set a path to collect the files from the preview BART GTFS directory
'''
path = preview_bart_gtfs_dir + '*.txt'
files = glob.glob(path)
for name in files:
    try:
        with open(name) as f:
            vars()['preview_'+name[35:-4]] = pd.read_csv(name, sep=",", header=0)
    except OSError as exc:
        if exc.errno != errno.EISDIR:
            raise

In [5]:
original_agency

Unnamed: 0,agency_id,agency_name,agency_url,agency_timezone,agency_lang
0,BART,Bay Area Rapid Transit,http://www.bart.gov,America/Los_Angeles,en


In [6]:
preview_agency

Unnamed: 0,agency_id,agency_name,agency_url,agency_timezone,agency_lang
0,BART,Bay Area Rapid Transit,http://www.bart.gov,America/Los_Angeles,en


In [7]:
'''
Quick check to see if each variable name is assigned.
A KeyError would show up if the variable did not exist.
'''
for file in os.listdir(preview_bart_gtfs_dir):
    print(file[:-4])
    vars()['original_'+file[:-4]]
    vars()['preview_'+file[:-4]]

fare_attributes
transfers
agency
fare_rules
calendar_dates
stop_times
frequencies
shapes
trips
feed_info
stops
calendar
routes


## Dataframe Comparison and Styling Functions

In [8]:
'''
Function to create a single dataframe that allows for side-by-side comparison
of the original and preview data from BART GTFS.
'''
def combine_original_preview(original, preview, index):
    # combines the two dataframes into a single multilevel dataframe
    combined_GTFS = pd.concat([original.set_index(index),
                               preview.set_index(index)],
                              axis='columns',
                              keys=['original', 'preview'],
                              sort=True)

    # swaps the levels of the dataframe to get the same columns from both
    # original dataframes side-by-side for easy comparison
    combined_GTFS = combined_GTFS.swaplevel(axis='columns')[original.columns[1:]]
    
    # replaces all np.nan values with a string 'N/A' because later when
    # columns are compared, np.nan==np.nan by definition returns False
    combined_GTFS = combined_GTFS.replace({np.nan: 'N/A'})
    
    return combined_GTFS

In [9]:
'''
Two dataframe styling functions that change the font color of the
values in the dataframe according to certain rules. If there is a
difference between the values in the original and preview data columns
the original values are coloured red and the preview values green. 
'''
def color_red(data):
    # styling attribute is defined as red
    attr = 'color: %s' % 'red'
    # cross section of preview columns
    preview = data.xs('preview', axis='columns', level=-1)
    # compares the data and colours any differences red
    df =  pd.DataFrame(np.where(data.ne(preview, level=0), attr, ''),
                        index=data.index, columns=data.columns)
    return df

def color_green(data):
    # styling attribute is defined as green
    attr = 'color: %s' % 'green'
    # cross section of original columns
    original = data.xs('original', axis='columns', level=-1)
    # compares the data and colours any differences green
    df =  pd.DataFrame(np.where(data.ne(original, level=0), attr, ''),
                        index=data.index, columns=data.columns)
    return df

## Inspecting the three files that have been changed

### Stops
I took the liberty to remove the columns zone_id, stop_url, stop_timezone, and wheelchair_boarding as there were no differences between the original and preview GTFS data.

The main difference that can be seen in the stops data is that BART decided to include station entry and exit points of stations. The location_type column reflects this by having a value of 2 to identify "station Entrance/Exit. A location where passengers can enter or exit a station from the street. The stop entry must also specify a parent_station value referencing the stop ID of the parent station for the entrance."

Previously this data was not available as shown by the red N/As. The location_type column is optional, indicating that entry/exit points are not required under GTFS.

In [10]:
combined_stops = combine_original_preview(original_stops, preview_stops, 'stop_id')
combined_stops = combined_stops.drop(columns=['zone_id', 'stop_url', 'stop_timezone', 'wheelchair_boarding'])
combined_stops.style.apply(color_red, axis=None).apply(color_green, axis=None)

Unnamed: 0_level_0,stop_name,stop_name,stop_desc,stop_desc,stop_lat,stop_lat,stop_lon,stop_lon,location_type,location_type,parent_station,parent_station
Unnamed: 0_level_1,original,preview,original,preview,original,preview,original,preview,original,preview,original,preview
12TH,12th St. Oakland City Center,12th St. Oakland City Center,,,37.8038,37.8038,-122.271,-122.271,0.0,0,,
12TH_1,,Enter/Exit : Broadway @ 13th Street (SW),,Broadway @ 13th Street (SW),,37.8035,,-122.272,,2,,12TH
12TH_2,,Enter/Exit : Broadway @ 13th Street (NE),,Broadway @ 13th Street (NE),,37.8038,,-122.271,,2,,12TH
12TH_3,,Enter/Exit : Broadway @ 13th Street (NW),,Broadway @ 13th Street (NW),,37.8039,,-122.272,,2,,12TH
12TH_4,,Enter/Exit : Broadway @ 12th Street (NW),,Broadway @ 12th Street (NW),,37.8025,,-122.272,,2,,12TH
12TH_5,,Enter/Exit : Broadway @ 12th Street (NE),,Broadway @ 12th Street (NE),,37.8024,,-122.272,,2,,12TH
12TH_6,,Enter/Exit : Broadway @ 14th Street (NW),,Broadway @ 14th Street (NW),,37.8047,,-122.271,,2,,12TH
12TH_7,,Enter/Exit : 14th Street (NE),,14th Street (NE),,37.8043,,-122.271,,2,,12TH
16TH,16th St. Mission,16th St. Mission,,,37.7651,37.7651,-122.42,-122.42,0.0,0,,
16TH_1,,Enter/Exit : 16th Street @ Mission Street (NE),,16th Street @ Mission Street (NE),,37.7653,,-122.419,,2,,16TH


### Routes
Again, I took the liberty to remove the columns agency_id, route_desc, route_color, and route_text_color as there were no noteworthy differences between the original and preview GTFS data. The differences were purely a result of the added unidirectional route that would carry the same route_color and route_text_color - agency_id and route_desc were identical and empty respectively.

The main difference that can be seen in the routes data is that BART decided to publish routes as unidirectional instead of bidirectional. For instance, the first two rows of the data are _Antioch - SFIA/Millbrae_ and _SFIA/Millbrae - Antioch_ respectively.

The other change is that route_url has been added for each route where previously the BART schedules site was. For instance, route_url is now http://www.bart.gov/schedules/bylineresults?route=3 instead of http://www.bart.gov/schedules/.

Similar to location_type identified above, route_url is an optional field under GTFS.

In [11]:
combined_stops = combine_original_preview(original_routes, preview_routes, 'route_id')
combined_stops = combined_stops.drop(columns=['agency_id', 'route_desc', 'route_color', 'route_text_color'])
combined_stops.style.apply(color_red, axis=None).apply(color_green, axis=None)

Unnamed: 0_level_0,route_short_name,route_short_name,route_long_name,route_long_name,route_type,route_type,route_url,route_url
Unnamed: 0_level_1,original,preview,original,preview,original,preview,original,preview
route_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
1,Yellow,Yellow,Antioch - SFO/Millbrae,Antioch - SFIA/Millbrae,1.0,1,http://www.bart.gov/schedules/bylineresults?route=1,http://www.bart.gov/schedules/bylineresults?route=1
2,,Yellow,,Millbrae/SFIA - Antioch,,1,,http://www.bart.gov/schedules/bylineresults?route=2
3,Orange,Orange,Warm Springs/South Fremont - Richmond,Warm Springs/South Fremont - Richmond,1.0,1,http://www.bart.gov/schedules/,http://www.bart.gov/schedules/bylineresults?route=3
4,,Orange,,Richmond - Warm Springs/South Fremont,,1,,http://www.bart.gov/schedules/bylineresults?route=4
5,Green,Green,Warm Springs/South Fremont - Daly City,Warm Springs/South Fremont - Daly City,1.0,1,http://www.bart.gov/schedules/,http://www.bart.gov/schedules/bylineresults?route=5
6,,Green,,Daly City - Warm Springs/South Fremont,,1,,http://www.bart.gov/schedules/bylineresults?route=6
7,Red,Red,Richmond - Daly City/Millbrae,Richmond - Daly City/Millbrae,1.0,1,http://www.bart.gov/schedules/,http://www.bart.gov/schedules/bylineresults?route=7
8,,Red,,Millbrae/Daly City - Richmond,,1,,http://www.bart.gov/schedules/bylineresults?route=8
9,Blue-Sun,Blue,Dublin/Pleasanton - MacArthur,Dublin/Pleasanton - MacArthur,1.0,1,http://www.bart.gov/schedules/,http://www.bart.gov/schedules/
10,,Blue,,MacArthur - Dublin/Pleasanton,,1,,http://www.bart.gov/schedules/


## Transfers
The transfers data was changed a little more than routes and stops so the breakdown will be different.

A recap of what the fields are meant to contain:
* **from_stop_id** - identifies a stop or station where a connection between routes begins.
* **to_stop_id** - identifies a stop or station where a connection between routes ends.
* **transfer_type** - specifies the type of connection for the specified (from_stop_id, to_stop_id) pair.
    * 0 or (empty) - This is a recommended transfer point between routes.
    * 1 - This is a timed transfer point between two routes.
    * 2 - This transfer requires a minimum amount of time between arrival and departure to ensure a connection.
    * 3 - Transfers are not possible between routes at this location.
* **min_transfer_time** - defines the amount of time that must be available in an itinerary to permit a transfer between routes at these stops.

The first thing to notice is the number of non-GTFS columns that there are in the new data published by BART. This was perplexing at first glance but then I realised that it reflects the changes made to routes and stops. Stops now contains entry/exit points that are not relevant to transfers and routes is now unidirectional so providing the additional columns (from_route_id, to_route_id, from_trip_id, to_trip_id) helps to specify which transfers is being referred to.

In [12]:
original_transfers.columns

Index(['from_stop_id', 'to_stop_id', 'transfer_type', 'min_transfer_time'], dtype='object')

In [13]:
preview_transfers.columns

Index(['from_stop_id', 'to_stop_id', 'transfer_type', 'min_transfer_time',
       'from_route_id', 'to_route_id', 'from_trip_id', 'to_trip_id'],
      dtype='object')

In [14]:
preview_transfers.head()

Unnamed: 0,from_stop_id,to_stop_id,transfer_type,min_transfer_time,from_route_id,to_route_id,from_trip_id,to_trip_id
0,MCAR,MCAR,0,,1.0,3.0,,
1,MCAR,MCAR,0,,1.0,4.0,,
2,MCAR,MCAR,0,,1.0,7.0,,
3,MCAR,MCAR,0,,1.0,8.0,,
4,MCAR,MCAR,0,,4.0,1.0,,


Examining the from_stop_id it was interesting to then see the entry/exit points specified.

For example, CIVIC_6 is described as _Enter/Exit : Market Street @ 8th Street (SE)_ yet it has a transfer_type of 2 (this transfer requires a minimum amount of time between arrival and departure to ensure a connection) to CIVIC which is the actual station, with a min_tranfer_time of 269. This suggests that the walk between the entry/exit point is approximately 5 mins from the platform.

This is useful for determining the walking time on journey planners. This is a new piece of information that allows for journey planners to account for the time it would take for a user to arrive at the platform from the entry/exit point the user was directed to. This could help journey planners get users to desired transit boarding locations on time.

In [15]:
preview_transfers['from_stop_id'].unique()

array(['MCAR', '19TH', 'BALB', 'SBRN', 'BAYF', 'COLS', 'FRMT', 'WOAK',
       'LAKE', '12TH', '12TH_1', '12TH_2', '12TH_3', '12TH_4', '12TH_5',
       '12TH_6', '12TH_7', '19TH_1', '19TH_2', '19TH_3', '19TH_4',
       '19TH_5', 'ASHB_1', 'CIVC_1', 'CIVC_2', 'CIVC_3', 'CIVC_4',
       'CIVC_5', 'CIVC_6', 'CIVC_7', 'CIVC_8', 'DBRK_1', 'DBRK_2',
       'DBRK_3', 'DBRK_4', 'EMBR_1', 'EMBR_2', 'EMBR_3', 'EMBR_4',
       'EMBR_5', 'EMBR_6', 'LAKE_1', 'LAKE_2', 'LAKE_3', 'MONT_1',
       'MONT_2', 'MONT_3', 'MONT_4', 'MONT_5', 'MONT_6', 'MONT_7',
       'POWL_1', 'POWL_2', 'POWL_3', 'POWL_4', 'POWL_5', 'POWL_6',
       'POWL_7', 'POWL_8', 'ASHB', 'CIVC', 'DBRK', 'EMBR', 'MONT', 'POWL',
       'SFIA'], dtype=object)

In [16]:
preview_transfers[preview_transfers['from_stop_id'] == 'CIVC_6']

Unnamed: 0,from_stop_id,to_stop_id,transfer_type,min_transfer_time,from_route_id,to_route_id,from_trip_id,to_trip_id
95,CIVC_6,CIVC,2,269.0,,,,
