# COGS 108 - Data Checkpoint

## Authors

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Example team list and credits:
- Assilem Martinez: Data curation, Visualization, Experimental investigation.
- Mia Yanez: Writing, Background research, Analysis
- Brandon Khuu: Project administration, Software, Writing, Data curation 
- Ivy Tucker: Experimental investigation, Analysis, Background research, Writing
- Karina Goyal: Visualization, Background research, Writing - review and editing

## Research Question

How do changes in average U.S. gasoline prices relate to changes in monthly public transit ridership over time?


## Background and Prior Work

The United States has long been derided for its poor implementation of public transit infrastructure and poor urban planning centered around private vehicle travel. Proponents of U.S. public transit have argued that investment in automobile infrastructure in the 20th century pushed the United States away from effective transit systems.1 As a consequence of this early investment, the United States faces significantly worse public transit systems than other modernized nations do. Dependency on private vehicles for urban transportation would suggest that U.S. consumers have relatively inelastic responses to changes in gas prices. In light of this, our analysis aims to find whether gas prices can be used as a predictor of public transit ridership in the United States, and if it can, to what extent ridership can be predicted in each state based on changes in gas prices year over year.

The relationship between gas prices and rise in public transit ridership is well documented, with many studies within the past two decades reporting a positive association between the two variables worldwide. This can be seen in the studies by Chao et al.2 and Alves et al.3, who looked at ridership responses to gas prices in Taiwan and in the European Union (EU). Chao finds that gasoline prices have a “significantly positive” relationship on bus and transit use in Taiwan, with use increasing more than it decreases as gasoline prices rise and fall respectively. Chao’s findings are not unexpected and build on existing literature on the subject; from a study in 1978 by Agthe & Billings up to a study in 2014 by Fujisaki, many other studies have reported a similar relationship. In an even more recent study, Alves looks at the change in public transportation use in the EU in the wake of the Russian invasion of Ukraine in 2022, which led to spikes in oil prices as a result of various trade restrictions that emerged from the conflict. Alves findings reveal a strong correlation on a week by week basis between price and ridership, particularly in countries in the EU where motorization was already prevalent. These studies reveal that there is a consistent, positive correlative pattern between fuel prices and public transit ridership internationally.

Similar studies have been conducted on the United States as well. Blanchard, in a study in 2009, analyzed the extent of the increase of public transit demand in response to increasing gas prices at the time.4 Their own study finds significant variability in the results depending on the city, tending to vary somewhat depending on population, but more significantly with price as they rose over. Blanchard’s own review of the existing literature was inconclusive, with “the size and significance of gasoline prices’ effect on ridership” being indeterminate. While this is somewhat consistent with the international correlation of gasoline prices and public transit ridership, it does suggest that predicting ridership changes may be strongly multivariate and difficult to pin down based on gas prices alone. This perspective is mirrored in a study by Lane in 2010, who points out that contemporary research had looked at “the role of gas prices as one of many characteristics in predicting ridership.” 5 Changes to the energy and transportation sectors over time may mean that Blanchard’s work is not entirely representative of modern day transportation preferences and developments.

As gas prices rise, it’s important to understand how demand for transportation will change in response. This data analysis aims to build on the existing literature to determine the extent to which public transportation demand correlates with changes to gas prices. The applications of the findings of this analysis are valuable for policy and urban planning, as managing transit load is vital for city operations in urban landscapes.

^ (14 Jan 2026) Transit in the U.S. Today Transportation for America. https://t4america.org/resource/world-class-transit/state-of-u-s-transit/
^ Chao, M., Huang, W., Jou, R. (Dec 2015) The asymmetric effects of gasoline prices on public transportation use in Taiwan Transportation Research Part D: Transport and Environment. https://doi.org/10.1016/j.trd.2015.09.021
^ Alves, A., Costa, N. (15 Oct 2024) Changes in Fuel Prices and the Use of Public Transport: Insights from the European Union following the Invasion of Ukraine European Journal of Geography. https://doi.org/10.48088/ejg.a.alv.15.4.232.243
^ Blanchard, C. (15 Apr 2009) The Impact of Rising Gasoline Prices on US Public Transit Ridership Duke University. https://dukespace.lib.duke.edu/server/api/core/bitstreams/7064748d-6dd5-4100-a354-e60597c00e1a/content
^ Lane, B. (March 2010) The relationship between recent gasoline price fluctuations and transit ridership in major US cities Journal of Transport Geography. https://doi.org/10.1016/j.jtrangeo.2009.04.002

## Hypothesis


We predict a positive correlation between average monthly U.S. gasoline prices and monthly public transit ridership in the United States. As gasoline prices increase, individuals may be more likely to use public transportation as a cost-saving alternative to driving. This is consistent with prior transportation research on fuel costs and travel behavior.

## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name: U.S. Monthly Gasoline Prices
  - Link to the dataset: https://www.eia.gov/dnav/pet/pet_pri_gnd_dcus_sca_w.htm
  - Number of observations:
  - Number of variables: 2
  - Descripition of the variables:
      - Date/Week Ending Date: Indicates when the price data were collected. This time index allows for trend analysis and linking to corresponding transit ridership data over the same periods.
      - Price (Dollars per Gallon): Reflects the average retail price consumers pay at gas stations for all grades and formulations of gasoline (including taxes). This is the main independent variable used in the project because changes in gasoline prices can influence individual travel behavior and transit ridership patterns.

  - Descriptions of any shortcomings:
      - Regional Aggregation: The price data represent a statewide average for California rather than localized values for specific metropolitan areas. Gasoline prices can vary significantly by city or region due to taxes, regulations, and supply differences, so the statewide average may not capture local price signals that influence ridership.
      - Frequency and Timing: The original data are reported weekly, and converting to monthly averages may obscure short-term spikes or declines that could correlate with weekly ridership changes.

- Dataset #2
  - Dataset Name: Monthly Public Transit Ridership
  - Link to the dataset: https://www.transit.dot.gov/ntd/data-product/monthly-module-adjusted-data-release
  - Number of observations: 1927
  - Number of variables: 14
  - Most Relevant Variables:
  - Agency – Transit agency name
  - HQ City / HQ State – Agency location
  - UZA Name – Urbanized area served
  - Last Closed FY End Year – Fiscal year of data
  - Unlinked Passenger Trips FY – Total ridership
  - Passenger Miles FY – Total passenger miles traveled
  - Fares FY – Total fare revenue
  - Operating Expenses FY – Total operating costs
  - Avg Cost Per Trip FY – Average cost per trip
  - Avg Fares Per Trip FY – Average fare revenue per trip
  - These variables allow analysis of ridership levels and financial performance across agencies and years.

Shortcomings
- Data is aggregated at the fiscal-year level (limited time detail).
- Does not include external factors like gas prices or economic conditions.
- Some minor missing values.
- Large variation in agency size may require normalization for comparisons.

In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [3]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://drive.google.com/file/d/1y7KKlVfPrJ5_6CstnUzzw5s59spbFMqt/view?usp=drive_link', 'filename':'gas_prices.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:  50%|█████     | 1/2 [00:00<00:00,  3.72it/s]

Error downloading gas_prices.csv: 401 Client Error: Unauthorized for url: https://drive.google.com/file/d/1y7KKlVfPrJ5_6CstnUzzw5s59spbFMqt/view?usp=drive_link



Downloading bad-drivers.csv:   0%|          | 0.00/1.37k [00:00<?, ?B/s][A
Overall Download Progress: 100%|██████████| 2/2 [00:00<00:00,  6.30it/s][A

Successfully downloaded: bad-drivers.csv





### Dataset #1: California Weekly Retail Gasoline Prices (All Grades, All Formulations)
This dataset contains weekly average retail gasoline prices in California, measured in dollars per gallon. The data is published by the U.S. Energy Information Administration, a federal agency that collects and standardizes energy statistics across the country. The reported values represent the average price consumers pay at gas stations throughout California for all grades and all formulations of gasoline, including applicable taxes.

The primary metric in this dataset is retail gasoline price, expressed in dollars per gallon. For example, a value of 2.50 indicates that gasoline costs \$2.50 per gallon during that week. Historically, prices in this dataset range from approximately \$1.12 per gallon at the lowest to over \$6.00 per gallon at the highest. Lower values generally reflect periods of reduced oil demand, economic downturns, or temporary supply surpluses. Higher values typically occur during supply disruptions, refinery constraints, inflationary pressures, or geopolitical conflicts that affect crude oil markets. Since gasoline is a major transportation expense for households, changes in price can significantly influence driving behavior. When gasoline prices increase substantially, individuals may reduce discretionary driving, carpool, or consider alternatives such as public transit. Therefore, gasoline price serves as a relevant independent variable for analyzing changes in public transit ridership over time.

#### Major Concerns and Limitations

One concern with this dataset is that it reflects an average price across the entire state of California. Gasoline prices can vary widely between regions within the state due to local taxes, environmental regulations, and supply chain differences. As a result, the statewide average may not perfectly capture the price signals that influence transit ridership in specific metropolitan areas.

Additionally, gasoline price is only one factor influencing public transit ridership. Transit usage is also affected by service frequency, route availability, population density, employment patterns, and weather conditions. Because these additional factors are not included in the dataset, gasoline price alone cannot fully explain changes in ridership.

Overall, the dataset is reliable, standardized, and appropriate for analyzing fuel cost trends over time. However, its statewide aggregation and lack of contextual behavioral variables should be considered when interpreting results in relation to public transit usage.

In [1]:
import pandas as pd
#Read as CSV (comma-separated)
gas_df = pd.read_csv('gas_prices.csv', header=0)

# Check first few rows
print(gas_df.head())
print(gas_df.columns)


FileNotFoundError: [Errno 2] No such file or directory: 'gas_prices.csv'

In [3]:
import pandas as pd

#reads the csv file into a dataframe within the imported pandas. Gets the raw data into a structured table in python.
gas_df = pd.read_csv('gas_prices.csv')

# renaming the column names for ease to work with and additionally provides a cleaner data set to work with
gas_df.rename(columns={'Weekly California All Grades All Formulations Retail Gasoline Prices  (Dollars per Gallon)': 'Gas_Price'}, inplace=True)

# Originally had to create a clean version of the data and this code allows 
gas_df = gas_df[['Date', 'Gas_Price']]

# this code helps with sorting any time within the dataset like any dates accurately
gas_df['Date'] = pd.to_datetime(gas_df['Date'], errors='coerce')

# turns data into numeric values for actual calculations which is necessary for datasets
gas_df['Gas_Price'] = pd.to_numeric(gas_df['Gas_Price'], errors='coerce')

#any missing values can distort data and how its interpreted in later sections
gas_df = gas_df.dropna(subset=['Gas_Price'])

print(gas_df.head())

          Date  Gas_Price
281 2000-05-22      1.679
282 2000-05-29      1.673
283 2000-06-05      1.661
284 2000-06-12      1.662
285 2000-06-19      1.664


### Dataset #2: December 2025 Complete Monthly Ridership (with Adjustments and Estimates)

This dataset contains monthly public transit ridership and operational data for transit agencies across the United States. The data is reported to the Federal Transit Administration through the National Transit Database, which collects and standardizes transit performance statistics nationwide. 

The primary metric in this dataset is Unlinked Passenger Trips, which measures the number of individual boardings on public transit vehicles. Each boarding counts as one trip, even if a rider transfers between services. For example, if a commuter boards a bus and later transfers to a train, this counts as two unlinked passenger trips. Higher UPT values indicate greater transit usage, while lower values reflect reduced ridership. Ridership levels fluctuate in response to economic conditions, fuel prices, employment trends, seasonal patterns, and service availability.

#### Major Concerns and Limitations

One concern with this dataset is that it aggregates data at the agency and mode level. While this provides a comprehensive national overview, it does not capture variation within specific routes or subregions. Localized ridership changes may therefore be masked by totals.

Overall, this dataset is reliable, standardized, and appropriate for analyzing public transit activity and operational performance. However, its aggregated structure and limited contextual variables should be considered when interpreting results in relation to broader transportation trends.


In [7]:
import pandas as pd

ride_df = pd.read_csv('data/00-raw/ride_share.csv')

ride_df.columns = ride_df.columns.str.strip()

keep_cols = [
    'Agency',
    'Mode/Type of Service Status',
    'HQ City',
    'HQ State',
    'UZA Name',
    'Last Closed FY End Year',
    'UZA Population',
    'Unlinked Passenger Trips FY',
    'Passenger Miles FY',
    'Avg Trip Length FY',
    'Fares FY',
    'Operating Expenses FY',
    'Avg Cost Per Trip FY',
    'Avg Fares Per Trip FY'
]

keep_cols = [c for c in keep_cols if c in ride_df.columns]
ride_df = ride_df[keep_cols]

if 'Last Closed FY End Year' in ride_df.columns:
    ride_df['Last Closed FY End Year'] = pd.to_numeric(
        ride_df['Last Closed FY End Year'], errors='coerce'
    )

numeric_cols = [
    'UZA Population',
    'Unlinked Passenger Trips FY',
    'Passenger Miles FY',
    'Avg Trip Length FY',
    'Fares FY',
    'Operating Expenses FY',
    'Avg Cost Per Trip FY',
    'Avg Fares Per Trip FY'
]

for col in numeric_cols:
    if col in ride_df.columns:
        ride_df[col] = (
            ride_df[col]
            .astype(str)
            .str.replace(',', '', regex=False)
            .str.replace('$', '', regex=False)
            .str.strip()
        )
        ride_df[col] = pd.to_numeric(ride_df[col], errors='coerce')

required = [c for c in ['Last Closed FY End Year', 'Unlinked Passenger Trips FY', 'HQ City', 'UZA Name', 'UZA Population', 'Passenger Miles FY', 'Fares FY'] if c in ride_df.columns]
ride_df = ride_df.dropna(subset=required)

print(ride_df.head())
print("\nRows, Cols:", ride_df.shape)
print("\nMissing values per column:\n", ride_df.isna().sum())


        Agency Mode/Type of Service Status  HQ City HQ State  \
0  King County                      Active  SEATTLE       WA   
1  King County                      Active  SEATTLE       WA   
2  King County                      Active  SEATTLE       WA   
3  King County                    Inactive  SEATTLE       WA   
4  King County                      Active  SEATTLE       WA   

              UZA Name  Last Closed FY End Year  UZA Population  \
0  Seattle--Tacoma, WA                   2024.0       3544011.0   
1  Seattle--Tacoma, WA                   2024.0       3544011.0   
2  Seattle--Tacoma, WA                   2024.0       3544011.0   
3  Seattle--Tacoma, WA                   2011.0       3544011.0   
4  Seattle--Tacoma, WA                   2024.0       3544011.0   

   Unlinked Passenger Trips FY  Passenger Miles FY  Avg Trip Length FY  \
0                    1045365.0           7519226.0                 7.0   
1                     111550.0           1410322.0              

## Ethics

A. Data Collection
A.1 Informed consent: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
This project uses publicly available, aggregated data on average gas prices and public transportation usage in the United States. No data is collected directly from individuals, and there are no human subjects involved in the data collection process. As a result, informed consent is not directly applicable to this project.

A.2 Collection bias: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
We considered potential sources of collection bias in the datasets used. Public transportation usage may be overrepresented in urban areas and underrepresented in rural regions where transit options are limited. Additionally, gas prices vary significantly by region, which could influence the observed relationship. These factors are acknowledged when interpreting the results

A.3 Limit PII exposure: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
No personally identifiable information (PII) is collected or used in this project. All data is aggregated at a regional or national level, minimizing any risk of individual identification.

A.4 Downstream bias mitigation: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
We acknowledge that the impacts of gas prices and public transportation usage may differ across demographic groups such as income level or geographic location. However, the datasets used do not include detailed demographic information. We note this limitation and avoid making claims about specific protected groups.

B. Data Storage
B.1 Data security: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
B.2 Right to be forgotten: Do we have a mechanism through which an individual can request their personal information be removed?
B.3 Data retention plan: Is there a schedule or plan to delete the data after it is no longer needed?
C. Analysis
C.1 Missing perspectives: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
We considered potential blind spots in the analysis, particularly the differences between urban and rural communities. Public transportation availability and usage can vary widely depending on infrastructure, population density, and regional policy, which may not be fully captured in the datasets.

C.2 Dataset bias: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
The datasets may contain biases due to uneven reporting across regions or time periods. Additionally, public transportation usage data may not capture informal or alternative transportation methods. These limitations are acknowledged in the analysis.

C.3 Honest representation: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
All visualizations and summary statistics are designed to accurately reflect the underlying data. Care is taken to avoid misleading representations, and correlation results are not interpreted as evidence of causation.

C.4 Privacy in analysis: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
No data containing PII is used or displayed in the analysis. All results are reported in aggregated form.

C.5 Auditability: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
The data sources, preprocessing steps, and analysis methods are documented to ensure that the analysis is reproducible and can be reviewed in the future if issues are identified.e

D. Modeling
D.1 Proxy discrimination: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
D.2 Fairness across groups: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
D.3 Metric selection: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
D.4 Explainability: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
D.5 Communicate limitations: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?
E. Deployment
E.1 Monitoring and evaluation: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
E.2 Redress: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
E.3 Roll back: Is there a way to turn off or roll back the model in production if necessary?
E.4 Unintended use: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

## Team Expectations 

* The expectation for communication is that all members respond to project related messages in our group chat within 12 hours, allowing for good progress and checkins
* We will meet on Zoom or in person at least twice a week to stay on top our goals, deadlines, and responsibilities
* We have agreed to actively share ideas while always leaving the floor open for thoughts, comments, and concerns, which we believe will create a collaborative environment.
* When making decisions, all members will communicate how they feel and explain their reasoning. This allows us to identify different perspectives early, and make sure everyone is actually on the same page. Although each member will take on a specific role, roles remain flexible and open to revision so that everyone can contribute.
* If any member is struggling to keep up, they have agreed to inform the group in good time. In response, the team will dedicate at least 30 minutes to provide support and adjust plans as needed.
* If conflicts arise that cannot be resolved within, we believe it is best to ask help from a TA to help us move forward respectfully and collaboratively.
* All members agree to come to meetings prepared, having completed assigned tasks or reviewed relevant materials in advance to make meeting times more helpful and respectful of others time.
* The team agrees to regularly checking in to make sure that work is distributed fairly and adjusted if not.


## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/29  |  5 PM | Look over the project proposal assignment; Prepare any questions about the assignment | Share our ideas for project/research questions and interest; Move things around on the repo and help if any confusions arise | 
| 2/01  |  3 PM |  Reflect on proposed deas and brainstorm how each can be worked on | Begin to draft the propject proposal and discuss group dynamic as stated on the proposal document; Talk about which ideas stick out more and can be interesting for this project| 
| 2/03  | 8:30 PM | Add the contributions to document discussed in previous meeting; Edit and alter our finalized idea to fit the guidelines of the proposal rubric; begin to look into what datasets would match our research question | Share what we find on datasets and begin to have practice with getting it onto a notebook (experimentally before we finalize any data sets |
| 2/11  | 5 PM  | Each person looks at the datasets we have and be prepared to hsare with the group how we can use it in our research and what specific features would be unnecessary to keep (Ant Man);| Import & Wrangle Data as a group to be able to see what can be changed on the spot; giving feedback on code and learning from each other;Have at least two people share on the analysis portion to the data that was imported and wrangled and how it refelcts the research question |
| 2/17  | 8:30 PM  |Continue to add to the code through individuals branch to help our visuals such as graphs and wrangling done) | Give time for questions and feedback; Make sure our varibales are making sense and contributing to the research question |
| 2/23  | 3 PM  | Brush up any parts of the data checpoint that is still unclear;| Take the meeting time to discuss the direction of the project and take a overview of what future code can add/contribut and how that can change things|
| 2/26  | 6 PM  | Work on assigned portions of details and analysis; Contrinute to final drafting work| Look at code together to make sure were getting what were intending to target and overall project progress |
| 3/4  | 5 PM  | Review the feedback we recieved from the 1st checkpoint| Plan what we can fix if any points were deducted and take a look at checkpoint 2 |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |