# GTFS Data Loading and Route Analysis

This notebook loads GTFS data from the ./data folder and analyzes routes to identify different scenarios for testing the schedule-for-route functionality in OneBusAway.

## Data Loading and Analysis

We'll load the following GTFS files:
- routes.txt - Route information
- trips.txt - Trip information 
- calendar_dates.txt - Service dates and exceptions

We'll analyze routes to find:
- Routes with multiple directions
- Route with just one direction
- Stops order in Topo order (for some complex trip)
- Routes with different service patterns
- Routes with varying frequencies
- Routes with exceptions/holidays
- Routes with multiple trip headsigns
- Routes with different fare types

This will help ensure comprehensive test coverage of the schedule-for-route functionality.


In [2]:
import pandas as pd
import numpy as np
from datetime import datetime

# Load the GTFS data files
routes_df = pd.read_csv('data/routes.txt')
trips_df = pd.read_csv('data/trips.txt') 
calendar_df = pd.read_csv('data/calendar.txt')
calendar_dates_df = pd.read_csv('data/calendar_dates.txt')

# Find routes with multiple directions
direction_counts = trips_df.groupby('route_id')['direction_id'].nunique()
multi_direction_routes = direction_counts[direction_counts > 1].index.tolist()
print("\nRoutes with multiple directions:")
print(routes_df[routes_df['route_id'].isin(multi_direction_routes)][['route_id', 'route_short_name', 'route_long_name']].head())

# Find routes with multiple headsigns
headsign_counts = trips_df.groupby('route_id')['trip_headsign'].nunique()
multi_headsign_routes = headsign_counts[headsign_counts > 1].index.tolist()
print("\nRoutes with multiple headsigns:")
print(routes_df[routes_df['route_id'].isin(multi_headsign_routes)][['route_id', 'route_short_name', 'route_long_name']].head())

# Find routes with different fare types
fare_counts = trips_df.groupby('route_id')['fare_id'].nunique()
multi_fare_routes = fare_counts[fare_counts > 1].index.tolist()
print("\nRoutes with multiple fare types:")
print(routes_df[routes_df['route_id'].isin(multi_fare_routes)][['route_id', 'route_short_name', 'route_long_name']].head())

# Find routes with service exceptions
routes_with_exceptions = trips_df[trips_df['service_id'].isin(calendar_dates_df['service_id'])]['route_id'].unique()
print("\nRoutes with service exceptions:")
print(routes_df[routes_df['route_id'].isin(routes_with_exceptions)][['route_id', 'route_short_name', 'route_long_name']].head())

# Find routes with varying frequencies (using trip counts as proxy)
trip_counts = trips_df.groupby('route_id').size()
high_frequency_routes = trip_counts[trip_counts > trip_counts.median()].index.tolist()
print("\nHigh frequency routes:")
print(routes_df[routes_df['route_id'].isin(high_frequency_routes)][['route_id', 'route_short_name', 'route_long_name']].head())



Routes with multiple directions:
   route_id route_short_name  route_long_name
0    100001                1              NaN
1    100002               10              NaN
2    100003              101              NaN
3    100004              105              NaN
4    100005              106              NaN

Routes with multiple headsigns:
   route_id route_short_name  route_long_name
0    100001                1              NaN
1    100002               10              NaN
2    100003              101              NaN
3    100004              105              NaN
4    100005              106              NaN

Routes with multiple fare types:
     route_id route_short_name  route_long_name
114    102615           E Line              NaN

Routes with service exceptions:
   route_id route_short_name  route_long_name
0    100001                1              NaN
1    100002               10              NaN
2    100003              101              NaN
3    100004              105      

In [3]:
# Load stop_times.txt to get stop counts per trip
stop_times_df = pd.read_csv('data/stop_times.txt')

# Count number of stops per trip
stops_per_trip = stop_times_df.groupby('trip_id').size().reset_index(name='stop_count')

# Join with trips and routes to get route information
trip_details = stops_per_trip.merge(trips_df[['trip_id', 'route_id', 'trip_headsign']], on='trip_id')
trip_details = trip_details.merge(routes_df[['route_id', 'route_short_name', 'route_long_name']], on='route_id')

# Sort by stop count ascending and show top 5 trips with fewest stops
print("\nTrips with fewest stops:")
print(trip_details.sort_values('stop_count')[['trip_id', 'route_short_name', 'trip_headsign', 'stop_count']].head())



Trips with fewest stops:
         trip_id route_short_name                trip_headsign  stop_count
29077  761529468              973  Downtown Seattle Water Taxi           2
29097  761529668              973  Downtown Seattle Water Taxi           2
29098  761529678              973      West Seattle Water Taxi           2
29099  761529688              973  Downtown Seattle Water Taxi           2
29100  761529698              973      West Seattle Water Taxi           2


# PLOT ROUTES

In [4]:
import pandas as pd
import folium
import random
import json
from branca.element import MacroElement, Element
from jinja2 import Template

# === Load GTFS data (no changes) ===
try:
    shapes = pd.read_csv("data/shapes.txt")
    stops = pd.read_csv("data/stops.txt")
    trips = pd.read_csv("data/trips.txt")
    routes = pd.read_csv("data/routes.txt")
    stop_times = pd.read_csv("data/stop_times.txt")
except FileNotFoundError as e:
    print(f"Error loading data: {e}. Make sure the GTFS files are in a 'data/' directory.")
    exit()

# === Preprocess and create route_data (no changes) ===
trip_route_map = trips[['trip_id', 'route_id', 'shape_id']].drop_duplicates()
route_names = routes[['route_id', 'route_short_name', 'route_long_name']]
shape_route_map = pd.merge(trip_route_map, route_names, on='route_id', how='left').drop_duplicates(subset=['shape_id'])
trip_to_stops = stop_times.groupby('trip_id')['stop_id'].apply(list).to_dict()
stops_dict = stops.set_index('stop_id').to_dict('index')
route_data = {}
for _, row in shape_route_map.iterrows():
    shape_id = row['shape_id']
    route_name = row['route_short_name'] if pd.notna(row['route_short_name']) else row['route_long_name']
    route_id = row['route_id']
    color = f'#{random.randint(0, 0xFFFFFF):06x}'
    trip_ids_for_shape = trip_route_map[trip_route_map['shape_id'] == shape_id]['trip_id']
    if trip_ids_for_shape.empty: continue
    trip_id = trip_ids_for_shape.values[0]
    stop_ids = trip_to_stops.get(trip_id, [])
    shape_points = shapes[shapes['shape_id'] == shape_id].sort_values('shape_pt_sequence')
    path = shape_points[['shape_pt_lat', 'shape_pt_lon']].values.tolist()
    stops_on_route = []
    for sid in stop_ids:
        stop = stops_dict.get(sid)
        if stop: stops_on_route.append({'lat': stop['stop_lat'], 'lon': stop['stop_lon'], 'name': stop['stop_name'], 'id': sid})
    route_data[route_id] = {'route_name': route_name, 'color': color, 'shape': path, 'stops': stops_on_route}

# === Map setup (no changes) ===
center = [stops['stop_lat'].iloc[0], stops['stop_lon'].iloc[0]]
m = folium.Map(location=center, zoom_start=13)
all_stops_group = folium.FeatureGroup(name="All Stops", show=True)
for _, stop in stops.iterrows():
    folium.CircleMarker(
        location=(stop['stop_lat'], stop['stop_lon']), radius=3, color='gray', fill=True, 
        fill_opacity=0.8, popup=f"{stop['stop_name']} ({stop['stop_id']})"
    ).add_to(all_stops_group)
all_stops_group.add_to(m)


# --- Start of Corrected Code Block ---

# 1. First, add the static HTML for the sidebar
dropdown_options = "".join(
    f'<option value="{rid}">{val["route_name"]}</option>' for rid, val in route_data.items()
)
sidebar_html = f"""
<div id="sidebar" style="position: fixed; top: 50px; left: 10px; z-index:9999; background: white; padding: 10px; border-radius: 6px; box-shadow: 2px 2px 5px rgba(0,0,0,0.3); max-height: 400px; overflow-y: auto;">
    <label><strong>Select Route:</strong></label><br>
    <select id="routeDropdown">
        <option value="">-- Select --</option>
        {dropdown_options}
    </select>
</div>
"""
m.get_root().html.add_child(Element(sidebar_html))


# 2. Define the interactive logic using the robust MacroElement pattern
class SidebarInteraction(MacroElement):
    _template = Template(u"""
        {% macro script(this, kwargs) %}
            // Get a robust reference to the map object. This is the key fix.
            var map = {{this._parent.get_name()}};
            
            // Get the route data passed from Python
            var routeData = {{this.route_data_json}};

            var routeLayer = null;
            var stopMarkers = [];

            function clearLayers() {
                if (routeLayer) {
                    map.removeLayer(routeLayer);
                    routeLayer = null;
                }
                stopMarkers.forEach(function(marker) {
                    map.removeLayer(marker);
                });
                stopMarkers = [];
            }

            var dropdown = document.getElementById('routeDropdown');
            dropdown.addEventListener('change', function() {
                var rid = this.value;
                clearLayers();

                if (!rid) return;

                var data = routeData[rid];
                var color = data.color;

                routeLayer = L.polyline(data.shape, {
                    color: color,
                    weight: 5,
                    opacity: 0.9
                }).addTo(map); // This 'map' is now correctly defined

                data.stops.forEach(function(stop) {
                    var marker = L.circleMarker([stop.lat, stop.lon], {
                        radius: 5,
                        color: color,
                        fillColor: color,
                        fillOpacity: 0.8
                    }).bindPopup(stop.name + " (" + stop.id + ")");
                    
                    marker.addTo(map); // This 'map' is also correctly defined
                    stopMarkers.push(marker);
                });
            });
        {% endmacro %}
    """)

    def __init__(self, route_data):
        super(SidebarInteraction, self).__init__()
        self._name = 'SidebarInteraction'
        # Convert the Python dict to a JSON string for embedding in the template
        self.route_data_json = json.dumps(route_data)

# 3. Add an instance of our new class to the map
m.add_child(SidebarInteraction(route_data))

# 4. Save the map
m.save("gtfs_sidebar_map_final.html")
print("✅ Map saved as gtfs_sidebar_map_final.html")

# --- End of Corrected Code Block ---

✅ Map saved as gtfs_sidebar_map_final.html


# 1 Get routes with just one direction

In [5]:
trip = pd.read_csv('data/trips.txt')
stops = pd.read_csv('data/stops.txt')
stop_times = pd.read_csv('data/stop_times.txt')
routes = pd.read_csv('data/routes.txt')

trip.columns

Index(['route_id', 'service_id', 'trip_id', 'trip_headsign',
       'tts_trip_headsign', 'trip_short_name', 'direction_id', 'block_id',
       'shape_id', 'peak_flag', 'fare_id', 'wheelchair_accessible',
       'bikes_allowed'],
      dtype='object')

In [6]:
routes_trips = routes.merge(trip, on='route_id', how='left')
routes_trips.columns

routes_trips_stop_times = routes_trips.merge(stop_times, on='trip_id', how='left')
routes_trips_stop_times.columns

routes_trips_stop_times_stops = routes_trips_stop_times.merge(stops, on='stop_id', how='left')
routes_trips_stop_times_stops.columns

routes_trips_stop_times_stops.head()

Unnamed: 0,route_id,agency_id,route_short_name,route_long_name,route_desc,route_type,route_url,route_color,route_text_color,service_id,...,tts_stop_name,stop_desc,stop_lat,stop_lon,zone_id,stop_url,location_type,parent_station,stop_timezone,wheelchair_boarding
0,100001,1,1,,Kinnear - Downtown Seattle,3,https://kingcounty.gov/en/dept/metro/routes-an...,FDB71A,0,86832,...,10th Avenue West and West Armour St,,47.645111,-122.370277,1,,0,,America/Los_Angeles,1
1,100001,1,1,,Kinnear - Downtown Seattle,3,https://kingcounty.gov/en/dept/metro/routes-an...,FDB71A,0,86832,...,10th Avenue West and West Bothwell St,,47.643486,-122.370308,1,,0,,America/Los_Angeles,1
2,100001,1,1,,Kinnear - Downtown Seattle,3,https://kingcounty.gov/en/dept/metro/routes-an...,FDB71A,0,86832,...,10th Avenue West and West Halladay St,,47.641792,-122.370171,1,,0,,America/Los_Angeles,1
3,100001,1,1,,Kinnear - Downtown Seattle,3,https://kingcounty.gov/en/dept/metro/routes-an...,FDB71A,0,86832,...,10th Avenue West and West Mcgraw St,,47.639366,-122.370415,1,,0,,America/Los_Angeles,1
4,100001,1,1,,Kinnear - Downtown Seattle,3,https://kingcounty.gov/en/dept/metro/routes-an...,FDB71A,0,86832,...,10th Avenue West and West Crockett St,,47.637585,-122.370438,1,,0,,America/Los_Angeles,1


In [7]:
print("Column names in routes_trips dataframe:")
for col in routes_trips.columns:
    print(f"- {col}")

print("\nShape of routes_trips dataframe:", end=" ")
print(routes_trips.shape)

print("\nDirection ID value counts:")
print(routes_trips.direction_id.value_counts())

# get routes with single direction
single_direction_routes = routes_trips.groupby('route_id')['direction_id'].nunique()
single_direction_routes = single_direction_routes[single_direction_routes == 1]

print("\nRoutes with single direction:")
print("\nRoute ID    Count   Direction")
print("-" * 50)
for route_id, direction_count in single_direction_routes.items():
    direction = routes_trips[routes_trips['route_id'] == route_id]['direction_id'].iloc[0]
    print(f"{route_id:<10} {direction_count:>5} unique direction(s)    {direction}")
print("-" * 50)

# get these service infos
service_ids = routes_trips[routes_trips['route_id'].isin(single_direction_routes.index)]['service_id'].unique()
print("\nService IDs:")
print(service_ids)

# get these trips infos

Column names in routes_trips dataframe:
- route_id
- agency_id
- route_short_name
- route_long_name
- route_desc
- route_type
- route_url
- route_color
- route_text_color
- service_id
- trip_id
- trip_headsign
- tts_trip_headsign
- trip_short_name
- direction_id
- block_id
- shape_id
- peak_flag
- fare_id
- wheelchair_accessible
- bikes_allowed

Shape of routes_trips dataframe: (29812, 21)

Direction ID value counts:
direction_id
1    15054
0    14758
Name: count, dtype: int64

Routes with single direction:

Route ID    Count   Direction
--------------------------------------------------
100297         1 unique direction(s)    0
100473         1 unique direction(s)    0
102698         1 unique direction(s)    0
102745         1 unique direction(s)    1
--------------------------------------------------

Service IDs:
[23519 86832 30003 17361]


In [8]:
trip[trip['route_id'].isin(single_direction_routes.index)].sort_values('route_id')

Unnamed: 0,route_id,service_id,trip_id,trip_headsign,tts_trip_headsign,trip_short_name,direction_id,block_id,shape_id,peak_flag,fare_id,wheelchair_accessible,bikes_allowed
1117,100297,23519,491983998,Kent East Hill,kent east hill,,0,7632568,41914001,0,900,1,1
501,100297,23519,491984008,Kent East Hill,kent east hill,,0,7632567,41914001,0,900,1,1
1132,100297,86832,518238938,Kent East Hill,kent east hill,,0,7632504,41914001,0,900,1,1
1131,100297,86832,518238898,Kent East Hill,kent east hill,,0,7632503,41914001,0,900,1,1
1167,100297,23519,491983948,Kent East Hill,kent east hill,,0,7632567,41914001,0,900,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
22501,102745,17361,736364318,Downtown Seattle Via E Madison St,downtown seattle Via east madison street,,1,7631811,30677004,0,101,1,1
22500,102745,17361,736364088,Downtown Seattle Via E Madison St,downtown seattle Via east madison street,,1,7631808,30677004,0,101,1,1
22499,102745,17361,736364058,Downtown Seattle Via E Madison St,downtown seattle Via east madison street,,1,7631809,30677004,0,101,1,1
22498,102745,17361,736363798,Downtown Seattle Via E Madison St,downtown seattle Via east madison street,,1,7631808,30677004,0,101,1,1


In [9]:
tripsForRoute100297 = trip[trip['route_id'] == 100297]
tripsForRoute100297.service_id.value_counts()
tripsForRoute100297[tripsForRoute100297["service_id"] == 23519].shape

(9, 13)

In [10]:
# get these service infos

calendar = pd.read_csv('data/calendar.txt')
calendar.columns

calendar[calendar["service_id"].isin(tripsForRoute100297.service_id.values)]


Unnamed: 0,service_id,monday,tuesday,wednesday,thursday,friday,saturday,sunday,start_date,end_date
1,86832,0,1,1,1,0,0,0,20250501,20250829
15,23519,0,0,0,0,0,1,0,20250503,20250823


In [11]:
# load exceptions
caldates = pd.read_csv('data/calendar_dates.txt')
caldates.columns
# Convert date column to datetime
caldates['date'] = pd.to_datetime(caldates['date'], format='%Y%m%d')

# Add a new column with the day name
caldates['day'] = caldates['date'].dt.day_name()
caldates[caldates['service_id'].isin(tripsForRoute100297.service_id.values)]

Unnamed: 0,service_id,date,exception_type,day
17,86832,2025-05-26,2,Monday
18,86832,2025-08-04,1,Monday
19,86832,2025-08-11,1,Monday
20,86832,2025-06-02,1,Monday
21,86832,2025-07-28,1,Monday
22,86832,2025-05-19,1,Monday
23,86832,2025-07-21,1,Monday
24,86832,2025-05-12,1,Monday
25,86832,2025-07-07,1,Monday
26,86832,2025-07-14,1,Monday


Findings : Route 100297 is a unidirectional trips route , which has two services active ,and one 23519 is weekends (sunday) only service with no exception , other has several exceptions

In [12]:
# count no of each trips id
tv = trips['trip_id'].value_counts()

In [13]:
# 2 Validate trips schedule for route 100297

# load stop times
filtered = routes_trips_stop_times[
    (routes_trips_stop_times['route_id'] == 100297) &
    (routes_trips_stop_times['trip_id'] == 491983928)
].value_counts("stop_id")
filtered.head()

stop_id
57151    2
50740    1
50772    1
50741    1
57000    1
Name: count, dtype: int64

In [14]:
# With 2 direction route
tripsForRoute973 = routes_trips[routes_trips["route_short_name"] == "973"]
tripsForRoute973.service_id.value_counts()
tripsForRoute973[tripsForRoute973["service_id"] == 14796]["trip_id"].shape

(22,)

In [15]:
# Print service info for relevant service_ids
waterTaxiServices = tripsForRoute973.service_id.unique()
print(waterTaxiServices)
print("service Info")
print(calendar[calendar['service_id'].isin(waterTaxiServices)])

# Print exception dates for trips 
print("exceptions Info")
print(caldates[caldates['service_id'].isin(waterTaxiServices)])


[14796 86832 64769 22063 23519  1776   789]
service Info
    service_id  monday  tuesday  wednesday  thursday  friday  saturday  \
1        86832       0        1          1         1       0         0   
5        64769       0        1          1         1       0         0   
15       23519       0        0          0         0       0         1   
16       22063       0        0          0         0       0         0   
27       14796       0        0          0         0       0         0   
40        1776       1        0          0         0       0         0   
42         789       0        0          0         0       1         0   

    sunday  start_date  end_date  
1        0    20250501  20250829  
5        0    20250501  20250828  
15       0    20250503  20250823  
16       0    20250502  20250829  
27       1    20250504  20250824  
40       0    20250526  20250526  
42       0    20250704  20250704  
exceptions Info
     service_id       date  exception_type     day
17 

In [None]:
tripsForRoute973 = routes_trips[routes_trips["route_id"] == 100099]
tripsForRoute973.service_id.value_counts()
# tripsForRoute973[tripsForRoute973["service_id"] == 14796]["trip_id"].shape

service_id
86832    10
Name: count, dtype: int64