# CTA "L" accessibility analysis


## Datasets


1.  [List of 'L' stops](https://data.cityofchicago.org/Transportation/CTA-System-Information-List-of-L-Stops/8pix-ypme/about_data)
2.  [CTA - Ridership - 'L' Station Entries - Monthly Day-Type Averages & Totals](https://data.cityofchicago.org/Transportation/CTA-Ridership-L-Station-Entries-Monthly-Day-Type-A/t2rn-p8d7/about_data)


### Findings up front

1.  The columns that tie the datasets together are "MAP_ID" in the list of "L" stops and "station_id" in the ridership data.
2.  There are no stations with both accessible and non-accesible stops.

In [531]:
import math
import itertools
from pathlib import Path
import numpy as np
import pandas as pd
from matplotlib import pyplot

### List of 'L' stops


Source: [List of 'L' stops](https://data.cityofchicago.org/Transportation/CTA-System-Information-List-of-L-Stops/8pix-ypme/about_data)


#### Overview

The dataset is about stops.  Stops include details such as direction of travel.  Each station therefore has at least two stops.  For example, one for northbound trains and one for southbound trains.  Station and stop IDs are not directly related.  Instead, the "MAP_ID" column references the stop's parent station.

ADA accessibility information is stored in a boolean column named "ADA".  Because there is a row for each stop and multiple stops per station the structure of the data technically allows for stations that are partially accessible.  If such mixed-accessibility stations actually existed in the data then choices would need to be made about how to handle ridership data from them.  Thankfully they do not occur in the data.

The various name columns, espeically "STATION_NAME" are not unique.  There are multiple stations with the same name, even on the same line.  The blue line has two stops named Western and two named Harlem, for example.

There are are boolean columns indicating which lines a stop serves, and a location colum with physical location data.  Since these fields are not needed to uniquely identify a station, we don't need them for our analysis.

In [532]:
STATION_DATA_PATH = Path(
    "data/CTA_-_System_Information_-_List_of__L__Stops_20240921.csv")

In [533]:
stations_df = pd.read_csv(STATION_DATA_PATH,
                          usecols=[
                              "MAP_ID", "STATION_NAME", "STATION_DESCRIPTIVE_NAME", "ADA"],
                          dtype={"DIRECTION_ID": "category",
                                 "STOP_NAME": pd.StringDtype(),
                                 "STATION_NAME": pd.StringDtype(),
                                 "STATION_DESCRIPTIVE_NAME": pd.StringDtype(), },

                          )

In [534]:
stations_df.head()

Unnamed: 0,STATION_NAME,STATION_DESCRIPTIVE_NAME,MAP_ID,ADA
0,Cicero,Cicero (Pink Line),40420,True
1,Central Park,Central Park (Pink Line),40780,True
2,Halsted,Halsted (Green Line),40940,True
3,Cumberland,Cumberland (Blue Line),40230,True
4,Racine,Racine (Blue Line),40470,False


### Ridership data

Source: [CTA - Ridership - 'L' Station Entries - Monthly Day-Type Averages & Totals](https://data.cityofchicago.org/Transportation/CTA-Ridership-L-Station-Entries-Monthly-Day-Type-A/t2rn-p8d7/about_data)

#### Overview

This dataset is about stations.  Specifically, it tracks turnstyle numbers.  These numbers tell us nothing about which direction a passenger intends to travel and therefore nothing abou the stop that they intend to use.  This means that there is a mismatch in the level of detail between the dataset describing how many people ride and the dataset describing where they can go.

The "station_id" column is the shared key between the datasets.  As in the list of stops, the "stationame" column is not unique.

In [535]:

RIDERSHIP_DATA_PATH = Path(
    "data/CTA_-_Ridership_-__L__Station_Entries_-_Monthly_Day-Type_Averages___Totals_20240921.csv")

In [536]:
ridership_df = pd.read_csv(
    RIDERSHIP_DATA_PATH,
    usecols=["station_id", "stationame", "month_beginning", "monthtotal"],
    dtype={
        "stationname": pd.StringDtype()
    },
    parse_dates=["month_beginning"],
)

In [537]:
ridership_df.head()

Unnamed: 0,station_id,stationame,month_beginning,monthtotal
0,40900,Howard,2001-01-01,164447
1,41190,Jarvis,2001-01-01,40567
2,40100,Morse,2001-01-01,119772
3,41300,Loyola,2001-01-01,125008
4,40760,Granville,2001-01-01,84189


### Demonstrating the shared key between datasets

The datasets are not described in great detail, so some of the relationships must be demonstrated through examination.  There's no obvious reason why a column named "MAP_ID" would identify a station, for example.  Let us check some examples.

First we need to find some station identifiers.  I know that there are some stops that contain the word "Lake" so let's get those.

In [538]:
stations_df[stations_df["STATION_DESCRIPTIVE_NAME"].str.contains("Lake")]

Unnamed: 0,STATION_NAME,STATION_DESCRIPTIVE_NAME,MAP_ID,ADA
7,Clark/Lake,"Clark/Lake (Blue, Brown, Green, Orange, Purple...",40380,True
36,Lake,Lake (Red Line),41660,True
56,Harlem/Lake,Harlem/Lake (Green Line),40020,True
62,Clark/Lake,"Clark/Lake (Blue, Brown, Green, Orange, Purple...",40380,True
103,Lake,Lake (Red Line),41660,True
180,Clark/Lake,"Clark/Lake (Blue, Brown, Green, Orange, Purple...",40380,True
253,State/Lake,"State/Lake (Brown, Green, Orange, Pink & Purpl...",40260,False
261,Harlem/Lake,Harlem/Lake (Green Line),40020,True
267,State/Lake,"State/Lake (Brown, Green, Orange, Pink & Purpl...",40260,False
290,Clark/Lake,"Clark/Lake (Blue, Brown, Green, Orange, Purple...",40380,True


Let's try searching the ridership data for a few of the station identifiers we just found and see if they match.

In [539]:
ridership_df[ridership_df["station_id"] == 40380].value_counts("stationame")

stationame
Clark/Lake    281
Name: count, dtype: int64

In [540]:
ridership_df[ridership_df["station_id"] == 41660].value_counts("stationame")

stationame
Lake/State    281
Name: count, dtype: int64

In [541]:
ridership_df[ridership_df["station_id"] == 40020].value_counts("stationame")

stationame
Harlem-Lake    281
Name: count, dtype: int64

As we can see from the examples above the short names for the stations match - to a human - when linked by the ID columns that we found.

The datasets are clearly not carefully groomed since even the same dataset uses slashes to separate intersection names sometimes (Lake/State) and uses dashes (Harlem-Lake) at other times.  But we already knew this becuase "MAP_ID" and "station_id" are not in the same format.  That's just the nature of real-world datasets.

### Finding orphans

Ideally the mapping between stations and ridership data would be complete.  Every station would have ridership numbers and all the ridership info would be linked to a station we knew about. Unfortunately that is not the case.  The code below finds any station identifiers that only occur in one dataset or the other.  There are five of them.  They are explained below.

The ridership data with no associated station information is for stations that have been closed.  The station with no ridership information is too new to appear in the data.  A major source of this information was the following link: https://www.chicago-l.org/stations/index.html



1.  Madison/Wabash was closed in 2015  
2.  Washington/State has been temporarily closed since 2006
3.  Randolph/Wabash was closed in 2017. 
4.  Homan has been closed since 1994, permanently removed from the line in 1996
5.  The Damen green line station opened in August 20224 so ridership data is not yet available in September 2024.

Stations with no ridership data:
1.  41710: Damen (Green Line)



Ridership data with no station information:
1.  40500:  Washington/State
2.  40640:    Madison/Wabash
3.  40200:   Randolph/Wabash
4.  41580:  Homan

In [542]:
def find_orphan_station_ids(stations=stations_df, ridership=ridership_df):
    map_id_set = set(stations["MAP_ID"])
    station_id_set = set(ridership["station_id"])
    print(f"There are {len(map_id_set)} unique entries in the MAP_ID column")
    print(f"There are {len(station_id_set)
                       } unique entries in the station_id column")
    print(f"There are {len((sd := map_id_set.symmetric_difference(
        station_id_set)))} entries that only appear in one column or the other")
    print(f"They are: {sd}")
    return sd

In [543]:
orphans = find_orphan_station_ids()

There are 144 unique entries in the MAP_ID column
There are 147 unique entries in the station_id column
There are 5 entries that only appear in one column or the other
They are: {40500, 41580, 40640, 41710, 40200}


In [544]:
orphans_in_stop_list = stations_df[stations_df['MAP_ID'].isin(orphans)]
print(orphans_in_stop_list)

    STATION_NAME STATION_DESCRIPTIVE_NAME  MAP_ID   ADA
300        Damen       Damen (Green Line)   41710  True
301        Damen       Damen (Green Line)   41710  True


In [545]:
oprhans_in_ridership_data = ridership_df[ridership_df["station_id"].isin(
    orphans)][["station_id", "stationame"]].drop_duplicates()
print(oprhans_in_ridership_data)

     station_id        stationame
20        40500  Washington/State
138       40640    Madison/Wabash
139       40200   Randolph/Wabash
945       41580             Homan
