1. What is GTFS and GTFS-RT?

GTFS (General Transit Feed Specification – Static) → ZIP file (CSV text files) with scheduled transit data.
    Key files:
        trips.txt → trip IDs & route info
        stop_times.txt → scheduled arrival/departure per stop
        routes.txt → route names
        stops.txt → stop locations

GTFS-Realtime (GTFS-RT) → live updates in Protocol Buffers (.pb) format), usually served via an API endpoint.
    Feeds:
        TripUpdates → actual arrival/departure & delay
        VehiclePositions → GPS positions of buses/trains
        ServiceAlerts → disruptions

Together: GTFS = schedule, GTFS-RT = real-time events → combine them to compute delay.

2. Workflow for Dataset Creation

Think of the pipeline like this:
GTFS (static schedules)  +  GTFS-RT (real-time updates)  →  Merge  →  Calculate Delay  →  Dataset for ML

3. Step-by-Step Dataset Building

Step 1: Get the feeds
    Many cities publish GTFS & GTFS-RT links (check your city’s transport website or MobilityData).
    Example (Toronto):
        Static GTFS: https://open.toronto.ca/dataset/ttc-routes-and-schedules/
        GTFS-RT (TripUpdates): http://gtfsrt.api.transitagency.com/tripupdates.pb

Step 2: Load GTFS Static

In [1]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


In [2]:
import requests, os

download_url = "https://cdn.mbta.com/MBTA_GTFS.zip"
print("Resolved download URL:", download_url)

r = requests.get(download_url, headers={"User-Agent": "Mozilla/5.0"}, stream=True)
with open("mbta_gtfs_static.zip", "wb") as f:
    for chunk in r.iter_content(chunk_size=8192):
        f.write(chunk)

print("Downloaded file size:", os.path.getsize("mbta_gtfs_static.zip") / (1024 * 1024), "MB")

Resolved download URL: https://cdn.mbta.com/MBTA_GTFS.zip


Downloaded file size: 17.123085975646973 MB


In [3]:
import zipfile, os

zip_path = "mbta_gtfs_static.zip"
with zipfile.ZipFile(zip_path, 'r') as z:
    z.extractall("mbta_gtfs_static")

print("Extracted files:", os.listdir("mbta_gtfs_static"))

Extracted files: ['agency.txt', 'areas.txt', 'calendar.txt', 'calendar_attributes.txt', 'calendar_dates.txt', 'checkpoints.txt', 'directions.txt', 'facilities.txt', 'facilities_properties.txt', 'facilities_properties_definitions.txt', 'fare_leg_join_rules.txt', 'fare_leg_rules.txt', 'fare_media.txt', 'fare_products.txt', 'fare_transfer_rules.txt', 'feed_info.txt', 'levels.txt', 'lines.txt', 'linked_datasets.txt', 'multi_route_trips.txt', 'pathways.txt', 'routes.txt', 'route_patterns.txt', 'shapes.txt', 'stops.txt', 'stop_areas.txt', 'stop_times.txt', 'timeframes.txt', 'transfers.txt', 'trips.txt', 'trips_properties.txt', 'trips_properties_definitions.txt']


In [4]:
import pandas as pd

trips = pd.read_csv("mbta_gtfs_static/trips.txt")
stop_times = pd.read_csv("mbta_gtfs_static/stop_times.txt")
stops = pd.read_csv("mbta_gtfs_static/stops.txt")
routes = pd.read_csv("mbta_gtfs_static/routes.txt")

print("Trips:", trips.shape)
print("Stop Times:", stop_times.shape)
print("Stops:", stops.shape)
print("Routes:", routes.shape)

stop_times.head()

  stop_times = pd.read_csv("mbta_gtfs_static/stop_times.txt")


Trips: (88199, 12)
Stop Times: (2130402, 12)
Stops: (10292, 19)
Routes: (401, 14)


Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,timepoint,checkpoint_id,continuous_pickup,continuous_drop_off
0,70505896,05:15:00,05:15:00,70036,1,,0,1,0,ogmnl,,
1,70505896,05:16:00,05:16:00,70034,10,,0,0,0,mlmnl,,
2,70505896,05:20:00,05:20:00,70032,20,,0,0,0,welln,,
3,70505896,05:22:00,05:22:00,70278,30,,0,0,0,astao,,
4,70505896,05:24:00,05:24:00,70030,40,,0,0,0,sull,,


Step 3: Parse GTFS-RT Realtime Feed

1. Are those actual_arrival values correct?
    Example from your file: 2025-08-26 10:27:08
    That value comes from your system converting the GTFS-RT UNIX timestamp into a datetime.
    By default, Python’s datetime.fromtimestamp() (and also Excel) will convert it into your local timezone (in your case, India Standard Time).
    So yes,the times you’re seeing are correct in absolute terms, but they’re displayed in your local timezone, not the transit system’s timezone.

2. Which city’s data is this?
    The file (mbta_delays_dataset.csv) has route_id = Mattapan.
    That belongs to the MBTA (Massachusetts Bay Transportation Authority, Boston, USA).
    So the feed is Boston time (Eastern Time, UTC-5/UTC-4 DST).

3. How MBTA real-time bus delay works
    What data MBTA gives:
        The MBTA API provides predicted arrival/departure times at stops, as well as vehicle position data (latitude/longitude, speed, etc.).
    Where delay comes from:
        Delay is not calculated after the trip ends. Instead, MBTA continuously updates predictions during the trip.

In [5]:
pip install gtfs-realtime-bindings

Note: you may need to restart the kernel to use updated packages.


In [6]:
pip install --upgrade pip setuptools wheel

Note: you may need to restart the kernel to use updated packages.


In [7]:
pip install "timezonefinder==6.2.0"

Note: you may need to restart the kernel to use updated packages.


In [8]:
from google.transit import gtfs_realtime_pb2
import requests
import pandas as pd
from datetime import datetime, timezone
from timezonefinder import TimezoneFinder
import pytz 

tf = TimezoneFinder()

trip_url = "https://cdn.mbta.com/realtime/TripUpdates.pb"
trip_feed = gtfs_realtime_pb2.FeedMessage()
trip_feed.ParseFromString(requests.get(trip_url).content)

vehicle_url = "https://cdn.mbta.com/realtime/VehiclePositions.pb"
vehicle_feed = gtfs_realtime_pb2.FeedMessage()
vehicle_feed.ParseFromString(requests.get(vehicle_url).content)

trip_vehicle_to_coords = {}
for entity in vehicle_feed.entity:
    if entity.vehicle and entity.vehicle.position:
        trip_id = entity.vehicle.trip.trip_id
        vehicle_id = entity.vehicle.vehicle.id
        lat = entity.vehicle.position.latitude
        lon = entity.vehicle.position.longitude
        trip_vehicle_to_coords[(trip_id, vehicle_id)] = (lat, lon)

rows = []
for entity in trip_feed.entity:
    if entity.trip_update:
        trip_id = entity.trip_update.trip.trip_id
        route_id = entity.trip_update.trip.route_id
        vehicle_id = entity.trip_update.vehicle.id
        lat, lon = trip_vehicle_to_coords.get((trip_id, vehicle_id), (None, None))
        
        tz_name = None
        if lat is not None and lon is not None:
            try:
                tz_name = tf.timezone_at(lng=lon, lat=lat)
            except:
                tz_name = None
        tz = pytz.timezone(tz_name) if tz_name else timezone.utc

        for stu in entity.trip_update.stop_time_update:
            stop_id = stu.stop_id  
            arrival_time = stu.arrival.time if stu.HasField("arrival") else None
            delay = stu.arrival.delay if stu.HasField("arrival") else None

            actual_arrival = None
            if arrival_time:
                actual_arrival = datetime.fromtimestamp(arrival_time, tz=timezone.utc)
                actual_arrival = actual_arrival.astimezone(tz)

            rows.append({
                "trip_id": trip_id,
                "vehicle_id": vehicle_id,
                "route_id": route_id,
                "stop_id": stop_id,
                "actual_arrival": actual_arrival,
                "delay_seconds": delay,
                "latitude": lat,
                "longitude": lon,
                "timezone": tz_name
            })

realtime_df = pd.DataFrame(rows)
print("Realtime rows:", realtime_df.shape)
print(realtime_df.head())

Realtime rows: (22226, 9)
    trip_id vehicle_id route_id stop_id             actual_arrival  \
0  72082199      y1701       21     875                       None   
1  72082199      y1701       21     520  2025-11-08 19:22:12+00:00   
2  72082199      y1701       21   11521  2025-11-08 19:22:49+00:00   
3  72082199      y1701       21    5232  2025-11-08 19:24:28+00:00   
4  72082199      y1701       21     523  2025-11-08 19:25:15+00:00   

   delay_seconds  latitude  longitude timezone  
0            NaN       NaN        NaN     None  
1            0.0       NaN        NaN     None  
2            0.0       NaN        NaN     None  
3            0.0       NaN        NaN     None  
4            0.0       NaN        NaN     None  


In [9]:
dataset = pd.merge(
    stop_times[['trip_id', 'stop_id', 'arrival_time']], 
    realtime_df[['trip_id', 'stop_id', 'vehicle_id', 'route_id', 'actual_arrival', 'delay_seconds', 'latitude', 'longitude']],
    on=['trip_id', 'stop_id'],
    how='inner'
)

def hms_to_minutes(hms):
    try:
        h, m, s = map(int, hms.split(':'))
        return h*60 + m + s/60
    except:
        return None

dataset['scheduled_minutes'] = dataset['arrival_time'].apply(hms_to_minutes)

dataset['actual_arrival'] = pd.to_datetime(dataset['actual_arrival'], errors='coerce')

dataset['actual_minutes'] = (
    dataset['actual_arrival'].dt.hour * 60
    + dataset['actual_arrival'].dt.minute
    + dataset['actual_arrival'].dt.second / 60
)

dataset['delay_minutes'] = dataset['actual_minutes'] - dataset['scheduled_minutes']

dataset = dataset[[
    'trip_id',
    'stop_id',
    'vehicle_id',
    'route_id',
    'arrival_time',
    'actual_arrival',
    'delay_seconds',
    'scheduled_minutes',
    'actual_minutes',
    'delay_minutes',
    'latitude',
    'longitude'
]]

dataset.head()

Unnamed: 0,trip_id,stop_id,vehicle_id,route_id,arrival_time,actual_arrival,delay_seconds,scheduled_minutes,actual_minutes,delay_minutes,latitude,longitude
0,Base-772227-5242,BNT-0000,1718,CR-Haverhill,14:00:00,2025-11-08 13:56:57-05:00,0.0,840.0,836.95,-3.05,42.375034,-71.074013
1,Base-772228-5243,WR-0120-S,1702,CR-Haverhill,13:48:00,2025-11-08 13:54:45-05:00,0.0,828.0,834.75,6.75,42.519058,-71.100914
2,Base-772228-5243,WR-0163-S,1702,CR-Haverhill,13:54:00,2025-11-08 13:59:54-05:00,0.0,834.0,839.9,5.9,42.519058,-71.100914
3,Base-772228-5243,WR-0205-02,1702,CR-Haverhill,14:01:00,2025-11-08 14:05:30-05:00,0.0,841.0,845.5,4.5,42.519058,-71.100914
4,Base-772228-5243,WR-0228-02,1702,CR-Haverhill,14:06:00,2025-11-08 14:08:29-05:00,0.0,846.0,848.483333,2.483333,42.519058,-71.100914


In [10]:
dataset = pd.merge(dataset, trips[['trip_id','route_id']], on='trip_id', how='left')

dataset['hour_of_day'] = dataset['actual_arrival'].dt.hour
dataset['day_of_week'] = dataset['actual_arrival'].dt.dayofweek
dataset['is_peak'] = dataset['hour_of_day'].apply(lambda h: 1 if (7<=h<=10 or 16<=h<=19) else 0)

In [11]:
dataset.to_csv("mbta_delays_dataset.csv", index=False)

print("Saved CSV file. Rows:", len(dataset))

Saved CSV file. Rows: 146


In [12]:
from datetime import datetime
current_datetime = datetime.now()
print(current_datetime)

2025-11-09 00:23:58.527756
