# Overview

This Jupyter Notebook takes in data from a Google Sheet that contains stop change details and their associated high level categories and outputs a JSON file for each line to be used in the MyBus tool.

The output file is used by the MyBus tool's results page and contains the Stop-level changes that are displayed there.


In [6]:
import pandas as pd
DATA_INPUT_PATH = 'data-input/'
DATA_OUTPUT_PATH = 'data-output/changes/'

In [7]:
stop_changes = pd.read_csv(DATA_INPUT_PATH + 'stop_changes - ALL.csv')
#stop_changes.count()
stop_changes.head()

Unnamed: 0,source,line,direction,stop_id,on_street,at_street,stop_name,service_canceled,service_changed,service_replaced,stop_canceled,stop_relocated,route_changed,owl_service_canceled
0,SGV June 2021 Bus Stop Signage Print Tracking,70,E,678.0,,,Cesar E Chavez / Progress,False,False,False,True,False,False,False
1,SGV June 2021 Bus Stop Signage Print Tracking,70,E,672.0,,,Cesar E Chavez / Ditman,False,False,False,True,False,False,False
2,SGV June 2021 Bus Stop Signage Print Tracking,70,E,7666.0,,,Garvey / Chandler,False,False,False,True,False,False,False
3,SGV June 2021 Bus Stop Signage Print Tracking,70,E,1463.0,,,Garvey / Nicholson,False,False,False,True,False,False,False
4,SGV June 2021 Bus Stop Signage Print Tracking,70,E,2177.0,,,Garvey / River,False,False,False,True,False,False,False


# Stop Changes

Data was compiled from spreadsheets and slides provided by Service Planners.

As of 5/25/2021, there are 4752 rows.

## No `stop_id`

40 entries have no `stop_id`, 4712 rows have `stop_id`.  These need to be set to 0 in order to cast the column as type `int`.

Of these entries:

* 8 are for lines 111, 207 (Scott)
* 3 are placeholders for line 757 (Scott)
* 1 is for Terminal 28/East Lot
* 28 are for SFV lines 92, 94, 164, 165, 224, 234, 240 (Israel)



In [8]:
# Set empty stop_ids to 0
stop_changes['stop_id'] = stop_changes['stop_id'].fillna(0)
stop_changes['stop_id'] = stop_changes['stop_id'].astype(int)

# Show rows where stop_id == 0 (stop_id doesn't exist)
#stop_changes.loc[stop_changes['stop_id'] == 0]

# Analysis

## Duplicate `stop_id`

Not counting the 40 entries with no `stop_id`, there are 505 `stop_id`s that occur at least 2 times.

There are 55 `stop_id`s that occur at least 3 times.  `30015` (SYL/SF Metrolink Station) occurs 14 times.

## Duplicate `line` & `stop_id` combo

After removing for non-existent stop_ids, there are 4506 unique line + stop_id combinations.

* 4317 combos occur 1 time, 189 combos occur multiple times
* 179 combos occur 2 times
* 10 combos occur at least 3 times

Within the 10 combos that that have at least 3 duplicate entries with the same `line` and `stop_id`, the line 237, stop_id 25001 combo has entries with service_changed and service_replaced selected.

## Merge

Join datasets to show the stop change categories for the line-stop combos with duplicates.  The 189 line-stop combos matched with 395 rows from the overall dataset.

From this identify where each line-stop combo has different change categories applied.

This dataframe was exported to CSV and loaded into Google Sheets for easier analysis.
Of those 395 rows, 178 line-stop combos have different categories listed.

In [34]:
# Find duplicate stop_ids
#stop_changes['stop_id'].value_counts()
#stop_changes['stop_id'].value_counts().loc[lambda x : x>2]

# stop_id == 30015 (14 occurences)
# SYL/SF Metrolink Station, different lines, directions, and locations
#stop_changes.loc[stop_changes['stop_id'] == 30015]

In [99]:
# Find duplicate line + stop_id combos

# exclude non-existent stop_ids
stop_changes_existing_stopids = stop_changes.loc[stop_changes['stop_id'] != 0]
# 4712 rows

stop_changes_existing_stopids = stop_changes_existing_stopids.groupby(['line', 'stop_id']).size().reset_index(name="count")
stop_changes_existing_stopids = stop_changes_existing_stopids.loc[stop_changes_existing_stopids['count'] > 1]

#combo_lines = stop_changes_existing_stopids.loc[stop_changes_existing_stopids['count'] > 1].line.unique()
#combo_stops = stop_changes_existing_stopids.loc[stop_changes_existing_stopids['count'] > 1].stop_id.unique()

#stop_changes[stop_changes.line.isin(combo_lines)].sort_values(by=['line', 'stop_id'])

filtered_combos = pd.merge(stop_changes_existing_stopids, stop_changes, how='inner', on=['line', 'stop_id']).sort_values(by=['line', 'stop_id'])

# Output file with the stops that have duplicate rows.
filtered_combos.to_csv(DATA_OUTPUT_PATH + 'stop_changes_duplicates.csv')

# Output

Output a file for each line.

Combine duplicate rows and make sure the appropriate categories are selected for each in the case that there are multiple rows.



In [8]:
stop_changes.to_json(DATA_OUTPUT_PATH + 'stop_changes.json', orient='records')

