# Overview

This Jupyter Notebook takes in GTFS data and then combines and adjusts the data in order to output these files:

* `lines.json` - This file is a list of line numbers to be used in the MyBus tool (to select lines).
* `[line-number].json` - A file for each line number that lists unique stops across all trips for that line.


In [2]:
import pandas as pd
DATA_INPUT_PATH = 'data-input/'
DATA_OUTPUT_PATH = 'data-output/'

# Load GTFS Data

## __`routes.txt`__

GTFS files are pulled from: https://gitlab.com/LACMTA/gtfs_bus

The data used here is from version hash `9a71d665`.

* `route_id`
* `route_short_name`

## __`stops.txt`__

* `stop_id`
* `stop_name`
* `stop_lat`
* `stop_lng`

## __`stop_times.txt`__

* `trip_id`
* `stop_id`
* `stop_sequence`
* `stop_headsign`

In [45]:
lines = pd.read_csv(DATA_INPUT_PATH + 'routes.txt', 
    usecols={'route_id', 'route_short_name'},
    dtype={'route_id':'string', 'route_short_name':'string'})

lines.tail(5)

Unnamed: 0,route_id,route_short_name
119,854-13139,
120,901-13139,
121,910-13139,
122,DSE-HG,South Bay Dodger Stadium Express
123,DSE-US,Dodger Stadium Express


In [4]:
stops = pd.read_csv(DATA_INPUT_PATH + 'stops.txt',
    usecols=['stop_id','stop_name','stop_lat','stop_lon'],
    dtype={'stop_id':'string','stop_name':'string','stop_lat':'float64','stop_lon':'float64'})
stops.head(5)

Unnamed: 0,stop_id,stop_name,stop_lat,stop_lon
0,1,Paramount / Slauson,33.973248,-118.113113
1,3,Jefferson / 10th,34.025471,-118.328402
2,6,120th / Augustus F Hawkins,33.924696,-118.242222
3,7,120th / Martin Luther King Hospital,33.924505,-118.240369
4,12,15054 Sherman Way,34.201075,-118.461953


In [5]:
stop_times = pd.read_csv(DATA_INPUT_PATH + 'stop_times.txt',
    usecols=['trip_id','stop_id','stop_sequence','stop_headsign'],
    dtype={'trip_id':'string','stop_id':'string','stop_sequence':'int64','stop_headsign':'string'})
stop_times.head(5)

Unnamed: 0,trip_id,stop_id,stop_sequence,stop_headsign
0,52088401-DEC20-D02CAR-1_Weekday,10246,1,611 - Vernon Station
1,52088401-DEC20-D02CAR-1_Weekday,10248,2,611 - Vernon Station
2,52088401-DEC20-D02CAR-1_Weekday,9371,3,611 - Vernon Station
3,52088401-DEC20-D02CAR-1_Weekday,9350,4,611 - Vernon Station
4,52088401-DEC20-D02CAR-1_Weekday,9351,5,611 - Vernon Station


# Output Lines

Output file as: `lines.json`.

`.to_json()` method

* Only works on a DataFrame
* Outputs data by column - use `orient='records'` to output by record

Fields in output:

* `route_id` - route number plus HASTUS version. For lines with sister routes, will only list first line.
  * Ex: `10-13139`
* `route_short_name` - route number, includes sister routes.
  * Ex: `10/48`

Modifications:

* The Silver Line and Orange Line do not have `route_short_name` values so those have to be manually added.
* The L Line Shuttle and the two Dodger Stadium Express Shuttles will be removed since they're temporary services.
* Lines with sister routes may need to be separated to treat each as separate lines.  Unless... a rider can stay on a single vehicle and end up on the other line.  In that case we would want to combine the stops for both lines so they are selectable from the landing page.

## MyBus Usage

Landing Page - Line Select Dropdown

* `route_id` - use as button value, pass this to the results page as a URL parameter
* `route_short_name` - user-friendly text for the dropdown (just a number)

Results Page - Header

* `route_short_name` - use as H1



In [6]:
# Add route_short_name values for the 901 (Orange Line) and 910/950 (Silver Line)
lines.loc[lines["route_id"] == '910-13139', 'route_short_name'] = '910/950'
lines.loc[lines["route_id"] == '901-13139', 'route_short_name'] = '901'

# Remove the entries for the Dodger Express and L Line (Gold) Shuttle
lines = lines.loc[~lines["route_id"].isin(['DSE-HG'])]
lines = lines.loc[~lines["route_id"].isin(['DSE-US'])]
lines = lines.loc[~lines["route_id"].isin(['854-13139'])]

# Separate out the sister lines.
lines_array = lines.loc[lines['route_short_name'].str.contains('/'), 'route_short_name'].values

for i, l in enumerate(lines_array):
    id = lines.loc[lines['route_short_name'] == l]['route_id'].values[0]
    slash = l.find('/')
    line1 = l[:slash]
    line2 = l[slash+1:]
    
    lines = lines.loc[~lines["route_id"].isin([id])]
    newlines = pd.DataFrame([[id, line1], [id, line2]], columns=['route_id', 'route_short_name'])
    lines = lines.append(newlines, ignore_index=True)

# cast route_short_name to int32 so that we can sort by their integer value
lines = lines.astype({'route_short_name': 'int32'}).sort_values('route_short_name')
lines.tail(5)

Unnamed: 0,route_id,route_short_name
103,780-13139,780
104,794-13139,794
105,901-13139,901
134,910-13139,910
135,910-13139,950


In [74]:
# Output the LINES dataframe to a JSON file
lines.to_json(DATA_OUTPUT_PATH + "lines.json",orient='records')
lines.head(5)

Unnamed: 0,route_id,route_short_name
0,2-13139,2
1,4-13139,4
106,10-13139,10
108,14-13139,14
2,16-13139,16


# Combine Stops Data

Merge `stop_times` and `stops` using a LEFT JOIN on `stop_id`.  For each stop on a line, this will show that stop's name and lat/lng.

Use the `lines_and_stops` dataframe to generate a file for each line that lists all unique stops for that line.

In [7]:
lines_and_stops = pd.merge(stop_times, stops, how="left", on="stop_id")
lines_and_stops.head(5)

#lines_and_stops[lines_and_stops['stop_headsign'].str.contains('^48\s', regex=True)]


Unnamed: 0,trip_id,stop_id,stop_sequence,stop_headsign,stop_name,stop_lat,stop_lon
0,52088401-DEC20-D02CAR-1_Weekday,10246,1,611 - Vernon Station,Vernon / Long Beach,34.00405,-118.242837
1,52088401-DEC20-D02CAR-1_Weekday,10248,2,611 - Vernon Station,Vernon / Morgan,34.00404,-118.244926
2,52088401-DEC20-D02CAR-1_Weekday,9371,3,611 - Vernon Station,Compton / Vernon,34.00363,-118.247907
3,52088401-DEC20-D02CAR-1_Weekday,9350,4,611 - Vernon Station,Compton / 46th,34.001767,-118.247915
4,52088401-DEC20-D02CAR-1_Weekday,9351,5,611 - Vernon Station,Compton / 48th,33.999487,-118.247918


# Looking at Stops Data

Questions:

* How many records are there in `lines_and_stops` for a particular line?
  * Use REGEX matching on `stop_headsign`. Will need to use an OR operator for sisters lines because each has distinct headsign values.
* Of those records, how many unique `trip_id`s are there?
* Of those records, how many unique `stop_name`s are there?
* What is the highest `stop_sequence` value for that line?

## Stops Data Findings

Line 2

* 31,094 stop times along Line 2
* 377 unique trips
* 92 stops MAX within a single trip
* 377 x 92 = 34,684 - this means there are some trips with fewer than 92 stops
* 123 unique stop names - this means trips do not all contain the same set of stops

Line 10/48

* ??

Highest `stop_sequence`

* Line 90
* 136 is the highest value
* Line 90 has 158 total unique stops.

In [47]:
####################
#  Line 2          #
####################

# All values for Line 2 = 31,094
#lines_and_stops[lines_and_stops['stop_headsign'].str.contains('^2\s', regex=True)]
 
# Unique stop names for Line 2 = 123
#lines_and_stops[lines_and_stops['stop_headsign'].str.contains('^2\s', regex=True)].stop_name.unique()

# Unique trip_ids for Line 2 = 377
#lines_and_stops[lines_and_stops['stop_headsign'].str.contains('^2\s', regex=True)].trip_id.unique()

# Line 2 stops sorted by highest stop_sequence value = 92
#lines_and_stops[lines_and_stops['stop_headsign'].str.contains('^2\s', regex=True)].sort_values('stop_sequence', ascending=False).head(10)

In [48]:
####################
#  Lines 10/48     #
####################

# REGEX: ^(10\s|48\s)

# All values for Line 10 = 12,534 rows
# All values for Line 10/48 = 16,524 rows
#lines_and_stops[lines_and_stops['stop_headsign'].str.contains('^10\s', regex=True)]
#lines_and_stops[lines_and_stops['stop_headsign'].str.contains('^(10\s|48\s)', regex=True)]

# Unique trip_ids for Line 10 = 231
# Unique trip_ids for Line 10/48 = 316
#lines_and_stops[lines_and_stops['stop_headsign'].str.contains('^10\s', regex=True)].trip_id.unique()
#lines_and_stops[lines_and_stops['stop_headsign'].str.contains('^(10\s|48\s)', regex=True)].trip_id.unique()

# Unique stop names for Line 10 = 117
# Unique stop names for Line 10/48 = 132
#lines_and_stops[lines_and_stops['stop_headsign'].str.contains('^10\s', regex=True)].stop_name.unique()
#lines_and_stops[lines_and_stops['stop_headsign'].str.contains('^(10\s|48\s)', regex=True)].stop_name.unique()

# Line 10/48 stops sorted by highest stop_sequence value = 102
#lines_and_stops[lines_and_stops['stop_headsign'].str.contains('^(10\s|48\s)', regex=True)].sort_values('stop_sequence', ascending=False).head(10)

In [52]:
# Line 90 has highest stop_sequence value of 136
#lines_and_stops.sort_values('stop_sequence', ascending=False).head(10)

# Line 90 has 158 unique stop names
#lines_and_stops[lines_and_stops['stop_headsign'].str.contains('^90\s', regex=True)].stop_name.unique()

# Output All Unique Stops for Line 2

File output as `2-13139.json`

Use `.drop_duplicates()` to generate a new dataframe with only unique values.  Using `.[column name].unique()` outputs a StringArray and we need a DataFrame in order to call `.to_json()`.

Sort results by `stop_name` because we're combining multiple trips that may each have their own set of stops along their routes.

Fields in output:

* `stop_id`
* `stop_name`

## Usage

Landing Page - Stop Select Dropdowns

* `stop_id` - use as button value, pass this to the results page as a URL parameter
* `stop_name` - user-friendly text for the dropdown

## TODO

* Check `stop_name`s for any abbreviations that should be corrected.

In [138]:
# ONLY FOR LINE 2

# Creete a dataframe dropping duplicate stop names
#dedupped_stops = lines_and_stops[lines_and_stops['stop_headsign'].str.contains('^2\s', regex=True)].drop_duplicates(subset='stop_name')

#dedupped_stops[['stop_id','stop_name']].sort_values('stop_name').to_json(DATA_OUTPUT_PATH + '2-13139.json', orient='records')

#dedupped_stops.head()

Unnamed: 0,trip_id,stop_id,stop_sequence,stop_headsign,stop_name,stop_lat,stop_lon
846104,52248752-DEC20-D18CAR-1_Weekday,141012,1,950 - Silver Line - El Monte Sta Via Downtown,Pacific / 21st Layover,33.724933,-118.288128
846105,52248752-DEC20-D18CAR-1_Weekday,5397,2,950 - Silver Line - El Monte Sta Via Downtown,Pacific / 17th,33.728654,-118.287802
846106,52248752-DEC20-D18CAR-1_Weekday,5396,3,950 - Silver Line - El Monte Sta Via Downtown,Pacific / 15th,33.730465,-118.287791
846107,52248752-DEC20-D18CAR-1_Weekday,5395,4,950 - Silver Line - El Monte Sta Via Downtown,Pacific / 11th,33.734092,-118.287774
846108,52248752-DEC20-D18CAR-1_Weekday,5410,5,950 - Silver Line - El Monte Sta Via Downtown,Pacific / 7th,33.737728,-118.287757


# Output All Stops For Each Line

Loop through data to generate a separate file for each line.  Each file will contain all unique `stop_name`s for that line.  Lines are matched using the `stop_headsign` field.  Only one line number shows up at a time in the `stop_headsign`.

## Method

* Create an array of all the `route_short_name` values which should match the line numbers within `stop_headsign`.
* For each of those line numbers, find the rows in `lines_and_stops` that contain that line number within `stop_headsign`.
* From those values, drop duplicate `stop_name`s to create a list of all unique stops for that line.
* Sort the values so the `stop_name`s are in alphabetical order and output the results to JSON files.



In [77]:
lines_array = lines.loc[:, 'route_short_name'].values
line_regex = ''

for line in lines_array:
    line_regex = '^' + str(line) + '\s'
    line_filename = DATA_OUTPUT_PATH + str(line) + '.json'

    dedupped_stops = lines_and_stops[lines_and_stops['stop_headsign'].str.contains(line_regex, regex=True)].drop_duplicates(subset='stop_name')
    dedupped_stops[['stop_id','stop_name']].sort_values('stop_name').to_json(line_filename, orient='records')

    print('Line ' + line_filename + ' created')

Line data-output/2.json created
Line data-output/4.json created
Line data-output/10.json created
Line data-output/14.json created
Line data-output/16.json created
Line data-output/18.json created
Line data-output/20.json created
Line data-output/28.json created
Line data-output/30.json created
Line data-output/33.json created
Line data-output/35.json created
Line data-output/37.json created
Line data-output/38.json created
Line data-output/40.json created
Line data-output/45.json created
Line data-output/48.json created
Line data-output/51.json created
Line data-output/52.json created
Line data-output/53.json created
Line data-output/55.json created
Line data-output/60.json created
Line data-output/62.json created
Line data-output/66.json created
Line data-output/68.json created
Line data-output/70.json created
Line data-output/71.json created
Line data-output/76.json created
Line data-output/78.json created
Line data-output/79.json created
Line data-output/81.json created
Line data-ou

# Random Scratch Code Below

In [68]:
# route_sn_column = lines.loc[:, 'route_short_name']
# route_sn_array = route_sn_column.values
# lines_adjusted = []

# for i, line in enumerate(route_sn_array):
#     slash = line.find('/')

#     if slash > 0:
#         lines_adjusted.append(line[:slash])
#         lines_adjusted.append(line[slash+1:])
#         continue
#     else:
#         lines_adjusted.append(line)

# # 139 lines
# print(lines_adjusted)
# print('\nTotal number of lines after split: ', len(lines_adjusted))

In [None]:
# route_id_array = route_id_column.values
# route_num_array = []

# for i, line in enumerate(route_id_array):
#     route_num_array.append('^' + route_id_array[i].replace('-13139','') + '\s')

# print(route_id_array)
# print(route_num_array)