## 1.0 Setting up the environment

### 1.1 Importing libraries

We are only using the following libraries for this project:
- **pandas** for data manipulation
- **requests** for downloading the data
- **zipfile** for unzipping the data
- **io** for reading the data
- **os** for checking the existence of files

In [1]:
import pandas as pd
import requests
import zipfile
import io
import os


### 1.2 Check for gtfs.zip and extract

We are using the gtfs.zip file from the [LACMTA GTFS Data Repository]()
- If the file is not present, download it
- If the file is present, extract it

In [2]:

def download_gtfs_and_extract_zip(url, zip_file_name,output_folder):
    r = requests.get(url)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall(output_folder)
    print(f'Extracted {zip_file_name} to {output_folder}')

def check_if_gtfs_files_exist(target_folder):
    gtfs_files = ['agency.txt', 'calendar.txt', 'calendar_dates.txt', 'routes.txt', 'shapes.txt', 'stop_times.txt', 'stops.txt', 'trips.txt']
    for file in gtfs_files:
        if not os.path.isfile(f'{target_folder}/{file}'):
            print(f'{file} is missing from {target_folder}')
            return False
    return True

gtfs_url = 'https://github.com/LACMTA/los-angeles-regional-gtfs/raw/main/lacmta/current-base/gtfs_bus.zip'

if check_if_gtfs_files_exist('input'):
    print('GTFS files already exist')
else:
    download_gtfs_and_extract_zip(gtfs_url, 'gtfs_bus.zip', 'input')


GTFS files already exist


### 1.3 Loading the data

We are using the following GTFS Static datasets:
- **trips.txt** for the trips information: 
  - route_id, service_id, trip_id, trip_headsign, direction_id, shape_id
- **stop_times.txt** for the stop times information:
  - trip_id, stop_id, stop_sequence

In [3]:
gtfs_trips = pd.read_csv('input/trips.txt')
gtfs_stop_times = pd.read_csv('input/stop_times.txt')
gtfs_stops = pd.read_csv('input/stops.txt')

  gtfs_stop_times = pd.read_csv('input/stop_times.txt')


## 2.0 Joining and grouping the data

We will perform following tasks:
1. Join the `trips` and `stop_times` data to get each `stop_id` and `stop_sequence` for each trip
2. Group the data by `route_id` and `direction_id` to get the number of `shape_ids` for each route and stop
3. Count the number of `shapes_id`s for each `route_id` and `direction_id`
4. Count the number of `shape_id`s for each `route_id` and `stop_id` combination
5. Filter the data to only include stops with more than 1 `shape_id`


### 2.1 Joining the trips and stop times data

We begin by joining the two datasets on the `trip_id` column. To get a list of all the `stop_id`s for each `route_id`. 

This will give us a dataframe called `simplified_trips_join_stop_times` with the following columns:
- `route_id`
- `service_id`
- `trip_id`
- `direction_id`
- `shape_id`
- `stop_id`
- `stop_sequence`

In [4]:
trips_joined_stop_times = pd.merge(gtfs_trips, gtfs_stop_times, on='trip_id', how='inner')
group_join_by_stop_data = trips_joined_stop_times.groupby(['route_id','direction_id','shape_id','stop_id','stop_sequence','route_code']).size().reset_index(name='count')
order_by_shape_id_stop_sequence = group_join_by_stop_data.sort_values(by=['shape_id','stop_sequence'], ascending=True)
simplified_trips_join_stop_times = order_by_shape_id_stop_sequence[['route_id','direction_id','shape_id','stop_id', 'stop_sequence','route_code']]

### 2.2 Grouping the `stop_id`s by `route_id` and `direction_id`

We then group the `stop_id`s by `route_id` and `direction_id` to unique list of `shape_id`s for each `route_id` and `direction_id` combination.


In [5]:
group_route_id_by_distinct_direction_id_shape_ids = simplified_trips_join_stop_times.groupby(['route_id','direction_id','route_code'])['shape_id'].unique().reset_index(name='shape_ids')
group_by_direction_id_and_shape_ids = simplified_trips_join_stop_times.groupby(['route_id','direction_id','route_code']).agg({'shape_id': pd.Series.nunique}).reset_index()
group_by_direction_id_and_shape_ids

Unnamed: 0,route_id,direction_id,route_code,shape_id
0,10-13168,0,10,14
1,10-13168,0,48,7
2,10-13168,1,10,9
3,10-13168,1,48,6
4,102-13168,0,102,4
...,...,...,...,...
247,96-13168,1,96,1
248,DSE-HG,0,,1
249,DSE-HG,1,,1
250,DSE-US,0,,1


### 2.3 Count up the number of shapes for each `route_id` and `direction_id`

As an intermediate step, we count up the number of shapes for each `route_id` and `direction_id` combination. This will give us a dataframe called `group_by_direction_id_and_shape_ids` with the following columns:
- `route_id`
- `direction_id`
- `number_of_shape_ids`


In [6]:
group_by_direction_id_and_shape_ids.columns = ['route_id','direction_id','route_code','number_of_shape_ids']
group_by_direction_id_and_shape_ids

Unnamed: 0,route_id,direction_id,route_code,number_of_shape_ids
0,10-13168,0,10,14
1,10-13168,0,48,7
2,10-13168,1,10,9
3,10-13168,1,48,6
4,102-13168,0,102,4
...,...,...,...,...
247,96-13168,1,96,1
248,DSE-HG,0,,1
249,DSE-HG,1,,1
250,DSE-US,0,,1


### 2.4 Group by `route_id` and `direction_id`

We then count the number of `shape_id`s for each `route_id` and `stop_id` combination.

This will give us a column called `number_of_shape_ids` which is the number of shapes for each `route_id` and `stop_id` combination.

In [7]:
group_by_route_id_and_direction_id = group_by_direction_id_and_shape_ids.groupby(['route_id','direction_id','route_code'])['number_of_shape_ids'].unique().reset_index(name='shape_ids')
group_by_route_id_and_direction_id.rename(columns={'shape_ids': 'number_of_shapes_per_direction'}, inplace=True)
group_by_route_id_and_direction_id['number_of_shapes_per_direction'] = group_by_route_id_and_direction_id['number_of_shapes_per_direction'].apply(lambda x: x[0])

#### 2.4.1 Output to CSV as overview

This is meant to help keep track of the number of shapes for each `route_id`, `stop_id`, and `route_code` combination for ALL the data (not just the filtered ones that have more than 1 trip shape).

In [8]:
group_by_route_id_and_direction_id.to_csv('output/route_analysis_overview.csv', index=False)

### 2.5 Filter out the `route_id`s with only one shape

We then filter out the `route_id`s with only one shape into a dataframe called `route_ids_number_of_shapes_per_direction_greater_than_1`

In [9]:
route_ids_number_of_shapes_per_direction_greater_than_1 = group_by_direction_id_and_shape_ids[group_by_route_id_and_direction_id['number_of_shapes_per_direction'] > 1]
route_ids_number_of_shapes_per_direction_greater_than_1

Unnamed: 0,route_id,direction_id,route_code,number_of_shape_ids
0,10-13168,0,10,14
1,10-13168,0,48,7
2,10-13168,1,10,9
3,10-13168,1,48,6
4,102-13168,0,102,4
...,...,...,...,...
237,901-13168,1,901,2
242,92-13168,0,92,6
243,92-13168,1,92,7
244,94-13168,0,94,2


## 3.0 Merging the joined data with the number of shapes data


### 3.1 Get a list of all the `stop_id`s for each `route_id`

We need to get a list of all the `stop_id`s for each `route_id` to get the differences between each `route_id` and `stop_id` combination.

In [10]:
get_stop_ids_for_each_route_ids_number_of_shapes_per_direction_greater_than_1 = pd.merge(route_ids_number_of_shapes_per_direction_greater_than_1, simplified_trips_join_stop_times, on=['route_id','direction_id','route_code'], how='inner')
group_stop_ids_for_each_route_ids_number_of_shapes_per_direction_greater_than_1 = get_stop_ids_for_each_route_ids_number_of_shapes_per_direction_greater_than_1.groupby(['route_id','direction_id','route_code','shape_id'])['stop_id'].unique().reset_index(name='stop_ids')

### 3.2 Get the `stop_ids` that are not in the `route_id` and `stop_id` combination

We then get the `stop_id`s that are not in the `route_id` and `stop_id` combination. This will give us a dataframe called `stop_ids_not_in_route_id_and_stop_id_combination` with the following columns:
- `route_id`
- `direction_id`
- `stop_ids`



In [11]:
route_id_with_stop_id_differences = group_stop_ids_for_each_route_ids_number_of_shapes_per_direction_greater_than_1.groupby(['route_id','direction_id','route_code'])['stop_ids'].apply(lambda x: x.iloc[0] if len(x) == 1 else list(set(x.iloc[0]) - set(x.iloc[1]))).reset_index(name='stop_ids')
route_id_with_stop_id_differences

Unnamed: 0,route_id,direction_id,route_code,stop_ids
0,10-13168,0,10,[]
1,10-13168,0,48,[]
2,10-13168,1,10,[3203]
3,10-13168,1,48,[]
4,102-13168,0,102,[]
...,...,...,...,...
161,901-13168,1,901,"[15568, 15601, 15444, 15607, 15608]"
162,92-13168,0,92,"[3328, 3585, 3586, 3584, 1544, 10761, 10766, 1..."
163,92-13168,1,92,[]
164,94-13168,0,94,[]


### 3.3 Merge the `stop_ids` that are not in the `route_id` and `stop_id` combination with the joined data

In [12]:
combined_route_id_with_stop_differences_to_joined_data = pd.merge(group_route_id_by_distinct_direction_id_shape_ids, route_id_with_stop_id_differences, on=['route_id','direction_id','route_code'], how='inner')
combined_route_id_with_stop_differences_to_joined_data['shape_id_counts'] = combined_route_id_with_stop_differences_to_joined_data['shape_ids'].apply(lambda x: len(x))
combined_route_id_with_stop_differences_to_joined_data['shape_ids_list'] = combined_route_id_with_stop_differences_to_joined_data['shape_ids'].apply(lambda x: ','.join(x))
combined_route_id_with_stop_differences_to_joined_data.drop(columns=['shape_ids'], inplace=True)
combined_route_id_with_stop_differences_to_joined_data['count_of_different_stop_ids'] = combined_route_id_with_stop_differences_to_joined_data['stop_ids'].apply(lambda x: len(x))

# Reorder and rename columns
cleaned_up_combined_data = combined_route_id_with_stop_differences_to_joined_data[['route_id','direction_id','route_code','shape_id_counts','shape_ids_list','count_of_different_stop_ids','stop_ids']]
cleaned_up_combined_data.rename(columns={'stop_ids': 'different_stop_ids'}, inplace=True)
cleaned_up_combined_data


Unnamed: 0,route_id,direction_id,route_code,shape_id_counts,shape_ids_list,count_of_different_stop_ids,different_stop_ids
0,10-13168,0,10,14,"100750_JUNE23,100753_JUNE23,100767_JUNE23,1007...",0,[]
1,10-13168,0,48,7,"100750_JUNE23,100753_JUNE23,100813_JUNE23,1008...",0,[]
2,10-13168,1,10,9,"100751_JUNE23,100756_JUNE23,100794_JUNE23,1007...",1,[3203]
3,10-13168,1,48,6,"100751_JUNE23,100756_JUNE23,100796_JUNE23,1008...",0,[]
4,102-13168,0,102,4,"1020037_JUNE23,1020038_JUNE23,1020067_JUNE23,1...",0,[]
...,...,...,...,...,...,...,...
161,901-13168,1,901,2,"9010054_JUNE23,9010057_JUNE23",5,"[15568, 15601, 15444, 15607, 15608]"
162,92-13168,0,92,6,"920224_JUNE23,920274_JUNE23,920275_JUNE23,9202...",50,"[3328, 3585, 3586, 3584, 1544, 10761, 10766, 1..."
163,92-13168,1,92,7,"920225_JUNE23,920299_JUNE23,920300_JUNE23,9203...",0,[]
164,94-13168,0,94,2,"940258_JUNE23,940259_JUNE23",0,[]


### 3.4 Add stop_id's stop_name

In [13]:
cleaned_up_combined_data['stop_names_list'] = cleaned_up_combined_data['different_stop_ids'].apply(lambda x: ','.join(gtfs_stops[gtfs_stops['stop_id'].isin(x)]['stop_name'].tolist())).apply(lambda x: x.replace(',', ', '))
cleaned_up_combined_data

Unnamed: 0,route_id,direction_id,route_code,shape_id_counts,shape_ids_list,count_of_different_stop_ids,different_stop_ids,stop_names_list
0,10-13168,0,10,14,"100750_JUNE23,100753_JUNE23,100767_JUNE23,1007...",0,[],
1,10-13168,0,48,7,"100750_JUNE23,100753_JUNE23,100813_JUNE23,1008...",0,[],
2,10-13168,1,10,9,"100751_JUNE23,100756_JUNE23,100794_JUNE23,1007...",1,[3203],Arden Layover
3,10-13168,1,48,6,"100751_JUNE23,100756_JUNE23,100796_JUNE23,1008...",0,[],
4,102-13168,0,102,4,"1020037_JUNE23,1020038_JUNE23,1020067_JUNE23,1...",0,[],
...,...,...,...,...,...,...,...,...
161,901-13168,1,901,2,"9010054_JUNE23,9010057_JUNE23",5,"[15568, 15601, 15444, 15607, 15608]","Canoga Station, Chatsworth Station, Sherman Wa..."
162,92-13168,0,92,6,"920224_JUNE23,920274_JUNE23,920275_JUNE23,9202...",50,"[3328, 3585, 3586, 3584, 1544, 10761, 10766, 1...","Magnolia / San Fernando, Glenoaks / Gain, Glen..."
163,92-13168,1,92,7,"920225_JUNE23,920299_JUNE23,920300_JUNE23,9203...",0,[],
164,94-13168,0,94,2,"940258_JUNE23,940259_JUNE23",0,[],


### 3.5 Output the data to a csv file

We then output the data to a csv file called `route_analysis.csv`.

In [14]:
cleaned_up_combined_data.to_csv('output/route_analysis.csv', index=False)

## 4.0 Detailed Route Analysis

### 4.1 Route Analysis by `shape_id`

We then output the data to a csv file called `route_analysis_by_shape_id.csv` that is grouped by `shape_id`.

In [15]:
get_list_differing_stop_ids_per_route_id_with_distinct_shape_ids = simplified_trips_join_stop_times.groupby(['route_id','route_code','direction_id','shape_id'])['stop_id'].unique().reset_index(name='stop_ids')
get_list_differing_stop_ids_per_route_id_with_distinct_shape_ids['stop_ids_list'] = get_list_differing_stop_ids_per_route_id_with_distinct_shape_ids['stop_ids'].apply(lambda x: ','.join(map(str, x)))
get_list_differing_stop_ids_per_route_id_with_distinct_shape_ids['stop_names_list'] = get_list_differing_stop_ids_per_route_id_with_distinct_shape_ids['stop_ids'].apply(lambda x: ','.join(gtfs_stops[gtfs_stops['stop_id'].isin(x)]['stop_name'].tolist())).apply(lambda x: x.replace(',', ', ')).apply(lambda x: x.replace(',', ', '))
get_list_differing_stop_ids_per_route_id_with_distinct_shape_ids['number_of_stops_per_shape'] = get_list_differing_stop_ids_per_route_id_with_distinct_shape_ids['stop_ids'].apply(lambda x: len(x))
get_list_differing_stop_ids_per_route_id_with_distinct_shape_ids.drop(columns=['stop_ids'], inplace=True)
routes_and_stop_ids = get_list_differing_stop_ids_per_route_id_with_distinct_shape_ids
routes_and_stop_ids

Unnamed: 0,route_id,route_code,direction_id,shape_id,stop_ids_list,stop_names_list,number_of_stops_per_shape
0,10-13168,10,0,100750_JUNE23,"3202,3217,3232,3227,3231,4716,3212,3221,11689,...","Clinton / Hoover, Hoover / Plata, Melrose / ...",31
1,10-13168,10,0,100753_JUNE23,"3203,3202,3217,3232,3227,3231,4716,3212,3221,1...","Clinton / Hoover, Hoover / Plata, Melrose / ...",32
2,10-13168,10,0,100767_JUNE23,"17013,3222,3226,3210,3220,3225,3215,3213,3214,...","Clinton / Hoover, Hill / 7th, Hoover / Plata...",56
3,10-13168,10,0,100769_JUNE23,"3203,3202,3217,3232,3227,3231,4716,3212,3221,1...","Clinton / Hoover, Hill / 7th, Hoover / Plata...",48
4,10-13168,10,0,100771_JUNE23,"3202,3217,3232,3227,3231,4716,3212,3221,11689,...","Clinton / Hoover, Hill / 7th, Hoover / Plata...",47
...,...,...,...,...,...,...,...
706,DSE-HG,,0,DSE-HG-DS,300052321108552320232263500003,"Harbor Transitway / Manchester, Harbor Transi...",6
707,DSE-HG,,1,DSE-HG-HG,63500003109941085323241084630005,"Harbor Transitway / Harbor Fwy Station, Harbo...",6
708,DSE-US,,0,DSE-US-DS,215592216350000163500004,"Cesar E Chavez / Broadway, Dodger Stadium Exp...",4
709,DSE-US,,1,DSE-US-US-CF,635000011420432155,"Alameda / College, Dodger Stadium Express - U...",3


### 4.2 Output the data to a csv file

We then output the data to a csv file called `route_analysis_by_route_id.csv` that has the `shape_id` as rows.

In [16]:
routes_and_stop_ids.to_csv('output/route_analysis_per_shape_id.csv', index=False, sep=',')

# 5.0 `drop_off_type` and `pickup_type` analysis

Based on the following table:

| rider usage | rider usage code | `drop_off_type` | `pickup_type` | meaning |
| - | - | - | - | - |
| yes | 1 |0 or empty | 0 or empty | riders can be picked up and dropped off at this stop |
| yes, but indicate pickup only | 2 |1 | 0 or empty | riders can be picked up but NOT dropped off at this stop |
| yes, but indicate drop-off only | 3 | 0 or empty | 1 | riders can be dropped off but NOT picked up at this stop |
| no | 0| 1 | 1 | riders can NOT be picked up or dropped off at this stop |

We want to analyze what trips fall into each category.



In [43]:
join_trips_to_stop_times = pd.merge(gtfs_trips, gtfs_stop_times, on='trip_id', how='inner')
simplified_trips_join_stop_times = join_trips_to_stop_times[['trip_id','route_id','direction_id','shape_id','stop_id', 'stop_sequence','route_code', 'pickup_type', 'drop_off_type']]
simplified_trips_join_stop_times_join_to_stops = pd.merge(simplified_trips_join_stop_times, gtfs_stops, on='stop_id', how='inner')
simplified_trips_join_stop_times_join_to_stops = simplified_trips_join_stop_times_join_to_stops[['trip_id','route_id','direction_id','shape_id','stop_id', 'stop_sequence','route_code', 'pickup_type', 'drop_off_type', 'stop_name']]
simplified_trips_join_stop_times_join_to_stops

Unnamed: 0,trip_id,route_id,direction_id,shape_id,stop_id,stop_sequence,route_code,pickup_type,drop_off_type,stop_name
0,10169001480606-JUNE23,169-13168,0,1690148_JUNE23,5505,1,169,0,0,Rocketdyne Layover
1,10169001431438-JUNE23,169-13168,1,1690143_JUNE23,5505,134,169,0,0,Rocketdyne Layover
2,10169001430828-JUNE23,169-13168,1,1690143_JUNE23,5505,134,169,0,0,Rocketdyne Layover
3,10169001431032-JUNE23,169-13168,1,1690143_JUNE23,5505,134,169,0,0,Rocketdyne Layover
4,10169001431237-JUNE23,169-13168,1,1690143_JUNE23,5505,134,169,0,0,Rocketdyne Layover
...,...,...,...,...,...,...,...,...,...,...
2131475,10699000012433-06-JUNE23,699-13168,1,6990001_JUNE23,2319,2,699,0,0,Hawthorne / Lennox Station
2131476,10699000012436-06-JUNE23,699-13168,1,6990001_JUNE23,2319,2,699,0,0,Hawthorne / Lennox Station
2131477,10699000012439-06-JUNE23,699-13168,1,6990001_JUNE23,2319,2,699,0,0,Hawthorne / Lennox Station
2131478,10699000012442-06-JUNE23,699-13168,1,6990001_JUNE23,2319,2,699,0,0,Hawthorne / Lennox Station


In [45]:
simplified_trips_join_stop_times_join_to_stops['rider_usage_code_before_coding'] = simplified_trips_join_stop_times_join_to_stops['pickup_type'].astype(str) + simplified_trips_join_stop_times_join_to_stops['drop_off_type'].astype(str)
simplified_trips_join_stop_times_join_to_stops['rider_usage_code'] = simplified_trips_join_stop_times_join_to_stops['rider_usage_code_before_coding'].apply(lambda x: '1' if x == '00' else '2' if x == '10' else '3' if x == '01' else '0' if x == '11' else '-1')
simplified_trips_join_stop_times_join_to_stops.drop(columns=['rider_usage_code_before_coding'], inplace=True)
simplified_trips_join_stop_times_no_rider_usage_code_equals_1 = simplified_trips_join_stop_times_join_to_stops[simplified_trips_join_stop_times_join_to_stops['rider_usage_code'] != '1']
simplified_trips_join_stop_times_no_rider_usage_code_equals_1.to_csv('output/rider_usage_code_not_1.csv', index=False, sep=',')

In [23]:
routes_and_stop_ids_to_add_drop_off_type_and_pick_up_type = pd.merge(routes_and_stop_ids, gtfs_stop_times, on=['route_code','stop_id'], how='inner')
routes_and_stop_ids_to_add_drop_off_type_and_pick_up_type

KeyError: 'stop_id'