# Overview

This Jupyter Notebook generates an `n.json` file for each line `n`'s list of stops.  These files are used by the MyBus tool to populate the stops dropdowns on the landing page.

This version is made specifically for the September 2021 shakeup.

In [110]:
import pandas as pd
import numpy as np
from zipfile import ZipFile
import re

In [111]:
LINES_PATH = '../data/lines.json'
OUTPUT_PATH = '../data/'
SCRATCH_PATH = 'scratch/'

In [112]:
lines = pd.read_json(LINES_PATH)
lines_array = lines.loc[:, 'route_short_name'].values

In [113]:
with ZipFile('../data/input/stops.zip', 'r') as zf:
    zf.extractall(SCRATCH_PATH)

with ZipFile('../data/input/stop_times.zip', 'r') as zf:
    zf.extractall(SCRATCH_PATH)

STOPS_PATH = 'scratch/stops.txt'
STOP_TIMES_PATH = 'scratch/stop_times.txt'

In [114]:
stops = pd.read_csv(STOPS_PATH,
    usecols=['stop_id','stop_name'],
    dtype={'stop_id':'string','stop_name':'string'})
#stops.head(5)

In [115]:
#scratch code to look up stop ids and stop names
#stops[stops.stop_id.str.contains('1177')]

In [116]:
stop_times = pd.read_csv(STOP_TIMES_PATH,
    usecols=['stop_id','stop_headsign','pickup_type','drop_off_type'],
    dtype={'stop_id':'string','stop_headsign':'string'})
#stop_times.head(5)

## Remove Stops from Non-Revenue Trips

Indicated by `pickup_type` and `drop_off_type`

Not sure if we should include stops where only one of the values is `1`.  When both values are `0`, those are revenue trips for sure.

This will take care of an issue we had in the June 2021 iteration of MyBus where terminals and layovers were showing up in the stop lists.

In [117]:
# Remove stops from non-revenue trips, as indicated by
# pickup_type == 1 and drop_off_type == 1
stop_times = stop_times[(stop_times.pickup_type == 0) | (stop_times.drop_off_type == 0)]
#stop_times

## Modify Stop Name

Service planners indicated `stop_id` 12422 should have a stop name of `State / Cesar E Chavez`.  Reference GitHub [issue #44](https://github.com/LACMTA/mybus/issues/44).

In [118]:
stops.loc[stops.stop_id == '12422', 'stop_name'] = 'State / Cesar E Chavez'

#stops[stops.stop_id == '12422']

## Combine Lines & Stops Data

Merge `stop_times` and `stops` using a LEFT JOIN on `stop_id`.  For each stop on a line, this will show that stop's name.

Use the `lines_and_stops` dataframe to generate a file for each line that lists all unique stops for that line.

In [119]:
lines_and_stops = pd.merge(stop_times, stops, how="left", on="stop_id")
lines_and_stops.head(5)

Unnamed: 0,stop_id,stop_headsign,pickup_type,drop_off_type,stop_name
0,20500001,577 - El Monte Station,0,0,7th / Channel
1,20500006,577 - El Monte Station,0,0,Cal State Long Beach
2,30016,577 - El Monte Station,0,0,Norwalk Station
3,17035,577 - El Monte Station,0,0,Workman Mill / College
4,30020,577 - El Monte Station,0,0,El Monte Station - Lower Level


In [120]:
# Zero stop locations that include "Terminal" in the name after filtering for pickup_type AND drop_off_type equal to 0.
#lines_and_stops[lines_and_stops.stop_name.str.contains('Terminal')]

# With pickup_type OR drop_off_type equal to 0, there are cases where a stop may have 1 for one of the values
# (this indicates drop-off or pick-up only). None of these stops indicate Terminals or Divisions.
# So they should be good to include in the dropdowns.
#lines_and_stops[lines_and_stops.drop_off_type == 1].stop_name.unique()
#lines_and_stops[lines_and_stops.pickup_type == 1].stop_name.unique()

# Clean up lines_and_stops to remove the extra columns
# that aren't needed from this point forward.
lines_and_stops = lines_and_stops[['stop_id','stop_headsign','stop_name']].copy()
lines_and_stops

Unnamed: 0,stop_id,stop_headsign,stop_name
0,20500001,577 - El Monte Station,7th / Channel
1,20500006,577 - El Monte Station,Cal State Long Beach
2,30016,577 - El Monte Station,Norwalk Station
3,17035,577 - El Monte Station,Workman Mill / College
4,30020,577 - El Monte Station,El Monte Station - Lower Level
...,...,...,...
2030269,2319,Hawthorne / Lennox Station,Hawthorne / Lennox Station
2030270,30022,Hawthorne / Lennox Station,Sofi Stadium Transit Center
2030271,2319,Hawthorne / Lennox Station,Hawthorne / Lennox Station
2030272,30022,Hawthorne / Lennox Station,Sofi Stadium Transit Center


## Extract the Line/Route Number Into a Separate Column

Most cases we can use the number that appears in the headsign.

For lines that have routes (e.g. Line 10 has routes 10 & 48), the stop at which they switch over will only contain one route number but should be included as a stop for both routes.

Example:
`Change to Route 215 - Redondo Beach Station` is a `stop_headsign` value and it should appear for both the 211 and the 215.


```
[          'Change to Route 215 - Redondo Beach Station',
              'Change to Route 211 - South Bay Galleria',
                   'Change to Route 10 - West Hollywood',
                   'Change to Route 10 - Melrose - Vine',
                   'Change to Route 48 - Avalon Station',
           'Change to Route 48 - San Pedro - Manchester',
 'Change to Route 35 - Washington / Fairfax Transit Hub',
 'Change to Route 38 - Washington / Fairfax Transit Hub',
                   'Change to Route 14 - Beverly Center',
 'Change to Route 37 - Washington / Fairfax Transit Hub',
                'Change to Route 14 - Beverly - Western',
        '854 - L - Gold Line Shtl Pico / Aliso Little T',
        '854 - L - Gold Line Shtl Union Via Little Toky',
                                     '489 - Temple City',
                     '489 - Downtown LA - 5th - Beaudry',
        'Change to Route 242 - Porter Ranch Town Center',
        'Change to Route 243 - Porter Ranch Town Center',
                                          '235 - Encino',
                                  '235 - Sylmar Station',
                                'Dodger Stadium Express',
                                'Harbor Gateway Express',
                                 'Union Station Express',
                                          'SoFi Stadium',
                            'Hawthorne / Lennox Station']
```

In [121]:
sister_routes = {
	211:215,
	215:211,
	10:48,
	48:10,
	35:38,
	38:35,
	14:37,
	37:14,
	487:489,
	489:487,
	242:243,
	243:242,
	235:236,
	236:235
}

In [122]:
lines_and_stops['line'] = np.nan
counter = 1

# line values in lines_array can be more than just numbers,
# e.g. the 901, 910, and 950.
for line in lines_array:

    # Account for line being:
    # 901 / G Line (Orange)
    # 910 / J Line (Silver)
    # 950 / J Line (Silver)
    search_obj = re.match(r'\d+', line)
    line_num_only = ''

    if search_obj:
        line_num_only = search_obj.group()
    else:
        line_num_only = line

    print('Line ' + line_num_only)
    
    # Create regex using the line number.
    # If a matching line number is at the beginning of the headsign, that is the line number for that stop
    # Else if a matching line number exists later in the string (at least one character before the digits start),
    #     the sister route is the line number for that stop
    # Else, it is either not included in the lines_array or it is a temporary express bus with no number associated.
    line_first_regex = '^' + str(line_num_only) + '\s-\s'
    line_later_regex = '^\D+\s' + str(line_num_only)

    line_first_list = lines_and_stops.stop_headsign.str.contains(line_first_regex)
    line_later_list = lines_and_stops.stop_headsign.str.contains(line_later_regex)

    lines_and_stops.loc[line_first_list, 'line'] = line_num_only
    for h in lines_and_stops.loc[line_first_list, 'stop_headsign'].unique():
        print('    ' + h)

    if int(line_num_only) in sister_routes.keys():
        lines_and_stops.loc[lines_and_stops.stop_headsign.str.contains(line_later_regex), 'line'] = sister_routes[int(line_num_only)]
        
        print('    ' + '********** Extras: **********')
        for h in lines_and_stops.loc[line_later_list, 'stop_headsign'].unique():
            print('    ' + h)

    if counter % 10 == 0:
        print('**********' + str(counter) + ' lines processed **********')

    counter += 1

lines_and_stops.head()

Line 2
    2 - Sunset - Alvarado
    2 - Westwood / UCLA
    2 - Downtown LA - 7th - Broadway
    2 - West Hollywood - Sunset - San Vicente
    2 - Downtown LA - Broadway - Venice
    2 - Vermont / Sunset Sta
Line 4
    4 - Downtown LA - Broadway - Venice
    4 - Santa Monica
    4 - West LA - Sepulveda Bl
Line 10
    10 - Avalon Station
    10 - Downtown LA - Main - Venice
    10 - West Hollywood
    10 - San Pedro - Manchester
    10 - Downtown LA - 7th - Hill
    ********** Extras: **********
    Change to Route 10 - West Hollywood
    Change to Route 10 - Melrose - Vine
Line 14
    14 - Washington / Fairfax Transit Hub
    ********** Extras: **********
    Change to Route 14 - Beverly Center
    Change to Route 14 - Beverly - Western
Line 16
    16 - West Hollywood
    16 - Downtown LA - 6th - Los Angeles
Line 18
    18 - Montebello Station
    18 - Wilshire / Western Station
    18 - Downtown LA
    18 - Whittier - Garfield
    18 - Wilshire / Vermont Sta
    18 - Commerce Center


Unnamed: 0,stop_id,stop_headsign,stop_name,line
0,20500001,577 - El Monte Station,7th / Channel,577
1,20500006,577 - El Monte Station,Cal State Long Beach,577
2,30016,577 - El Monte Station,Norwalk Station,577
3,17035,577 - El Monte Station,Workman Mill / College,577
4,30020,577 - El Monte Station,El Monte Station - Lower Level,577


In [123]:
lines_and_stops[lines_and_stops.line.isnull()].stop_headsign.unique()

#lines_and_stops[lines_and_stops.stop_headsign.str.contains('235')].head()

<StringArray>
[    'Dodger Stadium Express',     'Harbor Gateway Express',
      'Union Station Express',               'SoFi Stadium',
 'Hawthorne / Lennox Station']
Length: 5, dtype: string

In [124]:
# Drop the rows that have no Line value.
# These should already be validated to be the temporary Express lines (Dodger Stadium, SoFi Stadium)
lines_and_stops.replace('', float("NaN"), inplace=True)
lines_and_stops.dropna(subset = ['line'], inplace=True)
lines_and_stops.line = lines_and_stops.line.astype('int')

# correctly prints nothing:
#lines_and_stops[lines_and_stops.line.isnull()].stop_headsign.unique()

## Add Stops for Restored Service

Lines 110 and 550 have restored service.  The stops do not exist in the June 2021 GTFS data.  Reference GitHub [issue #62](https://github.com/LACMTA/mybus/issues/62).

In [133]:
restored_stops = pd.read_csv('../data/input/restored_stops.csv',
	usecols={'line', 'stop_id','stop_name'},
	dtype={'line':'int','stop_id':'string','stop_name':'string'})

lines_and_stops = pd.concat([lines_and_stops, restored_stops])

lines_and_stops

Unnamed: 0,stop_id,stop_headsign,stop_name,line
0,20500001,577 - El Monte Station,7th / Channel,577
1,20500006,577 - El Monte Station,Cal State Long Beach,577
2,30016,577 - El Monte Station,Norwalk Station,577
3,17035,577 - El Monte Station,Workman Mill / College,577
4,30020,577 - El Monte Station,El Monte Station - Lower Level,577
...,...,...,...,...
41,15616,,39th St / Figueroa,550
42,10994,,Harbor Transit Way / Slauson,550
43,10853,,Harbor Transit Way / Manchester,550
44,2324,,Harbor Transit Way / I105 C Line (Green) Station,550


In [134]:
grouped_stops = lines_and_stops.groupby(['line', 'stop_name'])

unique_grouped_stops = grouped_stops['stop_id'].unique()

unique_grouped_stops = unique_grouped_stops.reset_index()

unique_grouped_stops['stop_id_agg'] = ''

def aggregate_stop_id(row):
    count = 0
    result = ''
    for id in row.stop_id:
        if count > 0:
            result += '|'
        result += id
        count += 1
    return result

unique_grouped_stops.stop_id_agg = unique_grouped_stops.apply(aggregate_stop_id, axis=1)

aggregated_grouped_stops = unique_grouped_stops[['line','stop_name','stop_id_agg']].copy()

aggregated_grouped_stops

Unnamed: 0,line,stop_name,stop_id_agg
0,2,Alvarado / Montana,3360
1,2,Alvarado / Sunset,3362
2,2,Broadway / 12th,15598
3,2,Broadway / 1st,4767
4,2,Broadway / 3rd,13227
...,...,...,...
9697,550,Jefferson / Vermont,7766
9698,550,McClintock / Jefferson,2457
9699,550,Vermont / 36th Pl,5656|14039
9700,550,Vermont / Exposition,5635


In [136]:
count = 0

for line in lines_array:
    search_obj = re.match(r'\d+', line)
    line_num_only = ''

    if search_obj:
        line_num_only = search_obj.group()
    else:
        line_num_only = line
    
    line_filename = OUTPUT_PATH + 'line-stops/' + str(line_num_only) + '.json'

    # no de-dupping necessary because the data was already grouped by line + stop_name
    stop_by_line = aggregated_grouped_stops[aggregated_grouped_stops.line == int(line_num_only)]
    stop_by_line = stop_by_line.rename(columns={'stop_id_agg':'stop_id'})
    stop_by_line[['stop_id','stop_name']].sort_values('stop_name').to_json(line_filename, orient='records')

    print('Line ' + line_filename + ' created' + ' (' + str(len(stop_by_line)) + ')')
    count += 1

print(str(count) + ' files created.')

Line ../data/line-stops/2.json created (122)
Line ../data/line-stops/4.json created (128)
Line ../data/line-stops/10.json created (121)
Line ../data/line-stops/14.json created (78)
Line ../data/line-stops/16.json created (62)
Line ../data/line-stops/18.json created (96)
Line ../data/line-stops/20.json created (99)
Line ../data/line-stops/28.json created (68)
Line ../data/line-stops/30.json created (60)
Line ../data/line-stops/33.json created (113)
Line ../data/line-stops/35.json created (81)
Line ../data/line-stops/37.json created (75)
Line ../data/line-stops/38.json created (77)
Line ../data/line-stops/40.json created (121)
Line ../data/line-stops/45.json created (72)
Line ../data/line-stops/48.json created (108)
Line ../data/line-stops/51.json created (101)
Line ../data/line-stops/53.json created (96)
Line ../data/line-stops/55.json created (80)
Line ../data/line-stops/60.json created (117)
Line ../data/line-stops/62.json created (102)
Line ../data/line-stops/66.json created (85)
Lin