# Time schedule tables from GTFS dataset

### Overview
This notebook contains codes for processing standard GTFS dataset and create approximate time schedule tables for various transit lines in the network.  Each time schedule table provides the start times and stop sequences of transit services on each route. The time schedule tables are generated for a typical weekday, Saturday, and Sunday of every month. 

### Import required libraries

In [2]:
import sys
from pathlib import Path
import json
import os

import pandas as pd
import gtfs_kit as gk
import itertools
import numpy as np

import warnings
warnings.filterwarnings("ignore")

from datetime import datetime

### Mandatory user inputs
The mandatory inputs include details such as working directory, region name, and date of release of the GTFS dataset. The typical days are selected manually and must be provided as user input in the codes. The days must be selected such that they are not immediately following or followed by public holdays. Immediate dates of FIL_DATE are used as later days may be missing some of the\ actual trips due to delay in scheduling (this process will be automated in the near future).

In [3]:
#setting GTFS data directory path for loading
DIR = Path('..')
sys.path.append(str(DIR))
DATA_DIR = DIR/'gtfs_data/' #GTFS datasets to be processed must be downloaded or stored here.

#Setting GTFS data details. REGION_NAME attribute is the region covered by the available GTFS dataset. For the openov dataset, it is the Netherlands.
REGION_NAME = 'netherlands'

#Set the date of the GTFS file.Please rename the GTFS folder with this FIL_DATE
FIL_DATE = '20210104'

#list of typical weekday, Saturday and Sunday. Input the list of dates used for generating the weekday, 
#saturday, and sunday time schedule tables (YYYYMMDD format). 
day_list = {
    "weekday": "20210121",
    "saturday": "20210123",
    "sunday": "20210124" 
}

### Settings for saving the time schedule tables

In [4]:
#dictionary to convert month in #MM format to names
months = {
    "01": "Jan", "02": "Feb", "03": "Mar", "04": "Apr", "05": "May", "06": "Jun",
    "07": "Jul", "08": "Aug", "09": "Sept", "10": "Oct", "11": "Nov", "12": "Dec"
}

#Extract year and month information
year = FIL_DATE[0:4]
month = FIL_DATE[4:6]

month_year_text = '{0}{1}_'.format(months[month],year)

In [84]:
#Create the 'time_schedules' folder within the gtfs unzipped folder
if not os.path.exists('{0}/{1}/{2}/time_schedules/'.format(DATA_DIR, REGION_NAME, FIL_DATE)):
    os.makedirs('{0}/{1}/{2}/time_schedules/'.format(DATA_DIR, REGION_NAME, FIL_DATE))

### Create the time schedule tables

In [85]:
#Create the time schedule tables for weekday, saturday and sunday.
for key, curr_date in day_list.items():
    routes = pd.read_csv(DIR/'gtfs_data/{0}/{1}/routes.txt'.format(REGION_NAME, FIL_DATE), low_memory=False)
    routes = routes[['route_id', 'route_type']]
    routes.head()

    calendar_dates = pd.read_csv(DIR/'gtfs_data/{0}/{1}/calendar_dates.txt'.format(REGION_NAME, FIL_DATE), low_memory=False)
    calendar_dates.head()

    #Subsetting services on the calendar date
    services_of_interest = calendar_dates.query('date == {}'.format(curr_date)).service_id.values
    services_of_interest[0:5]

    #Subsetting the trips for the date of interest
    trips = pd.read_csv(DIR/'gtfs_data/{0}/{1}/trips.txt'.format(REGION_NAME, FIL_DATE), low_memory=False)
    trips = trips[['route_id', 'service_id', 'trip_id', 'shape_id', 'direction_id']]
    trips = trips.loc[trips['service_id'].isin(services_of_interest)]
    #print(trips.head())

    trips_of_interest = trips.trip_id.values
    #print(trips_of_interest[0:5])
    
    #Select only those stop_times corresponding to the trips of our interest
    stop_times = pd.read_csv(DIR/'gtfs_data/{0}/{1}/stop_times.txt'.format(REGION_NAME, FIL_DATE), low_memory=False)
    stop_times = stop_times.loc[stop_times['trip_id'].isin(trips_of_interest)]
    stop_times['stop_id'] = stop_times['stop_id'].astype(str)
    
    #Delete any wrong entries in stop times (change the range if any timestamp starts with a number >=50)
    stop_times = stop_times[['trip_id', 'stop_sequence', 'stop_id', 'arrival_time', 'departure_time']]
    for string in range(24, 50):
        stop_times = stop_times[~stop_times.arrival_time.str.startswith("{}:".format(string))]
    stop_times.head()
    
    #merge stop times dataframe with trips dataframe
    stop_times = pd.merge(stop_times, trips, on="trip_id")
    stop_times['arrival_time'] = pd.to_datetime(stop_times['arrival_time'],format= '%H:%M:%S' ).dt.time
    stop_times.head()
    
    #Create an empty dataframe with every combination og unique values of route_id and direction_id
    route_list = stop_times.route_id.unique()
    direction_list = stop_times.direction_id.unique()

    cat = {
        'route_id': route_list,
        'direction_id' : direction_list
    }

    order = cat.keys()
    time_schedule = pd.DataFrame(itertools.product(*[cat[k] for k in order]), columns=order)
    time_schedule['start_time'] = np.nan
    time_schedule['stop_sequence'] = np.nan
    time_schedule['inter_stop_tt'] = np.nan

    #time_schedule.head()

    #generate the time schedule details for every route_id and direction_id pair
    for i in range(time_schedule.shape[0]):
        curr_route_stop_times = stop_times.query('route_id == {0} and direction_id == {1} and stop_sequence == 1'.format(time_schedule.route_id[i], time_schedule.direction_id[i]))
        curr_route_stop_times = curr_route_stop_times.reset_index()
        #print(curr_route_stop_times.head())

        if curr_route_stop_times.shape[0] > 0:
            start_times = curr_route_stop_times.arrival_time.sort_values().astype(str)
            start_array = [','.join(ele.split()) for ele in start_times]
            time_schedule.start_time[i] = start_array
            #print(i)

            #adding time intervals between stops
            first_trip_in_curr_route = stop_times.query('trip_id == {}'.format(curr_route_stop_times.trip_id[0]))
            first_trip_in_curr_route = first_trip_in_curr_route.sort_values('stop_sequence')
            #print(first_trip_in_curr_route.head())

            travel_times = pd.to_timedelta(first_trip_in_curr_route['arrival_time'].astype(str)).diff(-1).dt.total_seconds()
            #print(travel_times)
            tt_array = [','.join(ele.split()) for ele in abs(travel_times[:-1]).astype(int).astype(str)]
            time_schedule.inter_stop_tt[i] = tt_array

            stop_sequence = first_trip_in_curr_route.stop_id.astype(str)
            stop_array = [','.join(ele.split()) for ele in stop_sequence]
            time_schedule.stop_sequence[i] = stop_array
        print('route {} in direction {} computed.'.format(time_schedule.route_id[i],time_schedule.direction_id[i]))

    time_schedule = pd.merge(time_schedule, routes, on="route_id")
    time_schedule.head(10)

    #0 - Tram, Streetcar, Light rail. Any light rail or street level system within a metropolitan area.
    #1 - Subway, Metro. Any underground rail system within a metropolitan area.
    #2 - Rail. Used for intercity or long-distance travel.
    #3 - Bus. Used for short- and long-distance bus routes.
    #4 - Ferry. Used for short- and long-distance boat service.

    # More details at https://developers.google.com/transit/gtfs/reference#tripstxt
    
    #save the time schedule table
    folder = '../gtfs_data/{0}/{1}/time_schedules/'.format(REGION_NAME, FIL_DATE)
    time_schedule.to_csv('{0}{1}{2}.csv'.format(folder,month_year_text,key), header=True, index=False) 
    print("Successfully extracted the time schedule table for {0}-{1}".format(key, curr_date))

    route_id  service_id    trip_id  shape_id  direction_id
19     19413        1337  121558792  938938.0             0
20     19412        1337  121558814  938942.0             0
21     19412        1337  121558813  938941.0             1
22     19411        1337  121558801  938939.0             1
23     19413        1337  121558797  938937.0             1
[121558792 121558814 121558813 121558801 121558797]
route 19413 in direction 0 computed.
route 19413 in direction 1 computed.
route 61959 in direction 0 computed.
route 61959 in direction 1 computed.
route 73441 in direction 0 computed.
route 73441 in direction 1 computed.
route 73442 in direction 0 computed.
route 73442 in direction 1 computed.
route 73443 in direction 0 computed.
route 73443 in direction 1 computed.
route 73444 in direction 0 computed.
route 73444 in direction 1 computed.
route 76001 in direction 0 computed.
route 76001 in direction 1 computed.
route 76003 in direction 0 computed.
route 76003 in direction 1 comput

### Generate the README.txt accompanying the schedule tables

In [86]:
#Create the README.txt file
readme_text = "The files in this folder correspond to the transit time schedules. Each file is \n\
based on a typical day of the month (weekday, Saturday and Sunday). The following \n\
are the fields in each of the files. \n\
\n\
route_id	: The field provides the identifier corresponding to the route. \n\
The routes.txt file can be used to obtain more info.\n\
\n\
direction_id	: route_id is the same for all trips in a route irrespective of their direction. \n\
direction_id is used to differentiate trips in opposite directions.\n\
\n\
start_time	: This field provides arrays of start times of trips in the same route in a day. \n\
This can be used to initiate/schedule the trips on all route.\n\
\n\
stop_sequence	: This field provides arrays of the stops on a route (direction sensitive). \n\
\n\
inter_stop_tt	: This field provides the array of inter-stop travel times. \n\
The number of entries in an array would be one less than that in the stop_sequence array.\n\
\n\
The following dates were used to create the time-schedule tables.\n\
\n\
{0}saturday.csv	: {1}\n\
{0}sunday.csv	: {2}\n\
{0}weekday.csv	: {3}\n\
\n\
Notes:\n\
\n\
1. The time tables are generated using the GTFS dataset for {4} available at \n\
https://transitfeeds.com/p/ov/814/{4}/download\n\
2. Return trip details on some of the routes are not available. Mostly for long-distance routes.".format(month_year_text,
                                                                                                         day_list['saturday'],
                                                                                                        day_list['sunday'],
                                                                                                        day_list['weekday'],
                                                                                                        FIL_DATE)
#print(readme_text)
textfile = open('{0}README.txt'.format(folder), 'w')
textfile.write(readme_text)
textfile.close()

The files in this folder correspond to the transit time schedules. Each file is 
based on a typical day of the month (weekday, Saturday and Sunday). The following 
are the fields in each of the files. 

route_id	: The field provides the identifier corresponding to the route. 
The routes.txt file can be used to obtain more info.

direction_id	: route_id is the same for all trips in a route irrespective of their direction. 
direction_id is used to differentiate trips in opposite directions.

start_time	: This field provides arrays of start times of trips in the same route in a day. 
This can be used to initiate/schedule the trips on all route.

stop_sequence	: This field provides arrays of the stops on a route (direction sensitive). 

inter_stop_tt	: This field provides the array of inter-stop travel times. 
The number of entries in an array would be one less than that in the stop_sequence array.

The following dates were used to create the time-schedule tables.

Jan2021_saturday.csv	: 2