Skip to content

Data cleaning and basic analysis of Citi's bike program data to support visualizations and mini-report

License

Notifications You must be signed in to change notification settings

Gendo90/citi-bike-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NYC Citi Bike Program Analysis

Overview

This report details the differences found in the Citi Bike program data (available here) for the winter and summer seasons of 2018-2019, as well as the daily and hourly usage bike usage patterns for summer 2019. The key findings are that summer ridership is approximately twice that of winter ridership over the period examined, and that ridership consists mainly of subscribers to the Citi Bike program. Additionally, each day of the week has relatively similar bike ride rates, with the possible exception of Sundays having relatively fewer bike rides. There also appear to be two "peak" times for bike usage in summer, at around 8 am in the morning and 5-6m pm in the evening, which corresponds to before and after the usual work day. The bikes are likely used to commute, exercise, or run errands at these times, and there is also some moderate bike usage during the work day. The fewest rides were taken in the early morning, between about midnight and 4 am, when there is no daylight.

Initial Data Cleaning & Scope

The initial scope of the project was to look at three years of data from August 2017 to August 2020, since the last month of data available at the time was August 2020. The data was cleaned to add approximate speed and rider age columns, and also to remove outlier ages that are not realistic (e.g. over about 80 years old, since some riders claimed their ages were over 120). The data was downloaded using the requests library for Python, and then the data file for each month was combined to represent one year (12 months).

However, Tableau Public refused to load the full year data file for the past year, because there was too much data (the free program only supports 15M rows of data, and it had at least 18M rows), so the scope of the project was narrowed from seasonal patterns for three years to only winter and summer for one year, since summer and winter were expected to show the largest differences in ridership numbers. Likewise, the scope was changed from the aggregate of peak usage days/times over three years to only summer 2019. The primary impact of this change in scope is that the results may not be as generalizable as hoped, since the data only covers a small portion of the overall data for the Citi Bike program.

Finally, 2019 was chosen as the year to use for this analysis because it was not affected by the coronavirus pandemic or related shutdown of NYC, which could have accounted for major changes in usage patters or ridership numbers of the Citi Bike program that are localized to only 2020 data.

Findings

The analysis section of this file largely reproduces the information found in the Tableau workbook story accessible here on my Tableau Public account. The analysis is taken directly from that workbook, in case the workbook is not available to the reader and the general analysis text is desired.

Seasonal Impact on Ridership

This first visualization of the NYC Citi Bike program data shows how the number of rides in Summer 2019 were much higher each month than the number of rides during the winter months using a bar chart (see here).

The total number of rides taken during the summer months of 2019 was 9,095,558, which is approximately 113% more than the total number of rides taken during the winter months (starting in December 2018 through March 2019) totaling 4,255,496.

The number of rides taken by subscribers dwarfs the number of rides taken by single-use customers for both the summer and winter of 2019. The total number of rides taken by subscribers in winter was approximately 15.7 times larger than the total number of rides taken by customers in winter, and the total number of rides taken by subscribers in summer was approximately 4.6 times larger than the total number of rides taken by customers in summer.

The number of rides taken did not change much over the months within a season, with a range of about 400,000 rides taken per month between the most active and least active winter months, and a range of about 300,000 rides taken per month between the most active and least active summer months. The ridership per month differs largely between summer and winter seasons, however, with the median number of rides taken per month rising about 1,300,000 from winter's median of approximately 990,000 rides taken per month vs. summer's median of approximately 2,260,000 rides taken per month. Since there is no overlap between the ranges of the values in summer and winter months, the data suggests that ridership is significantly different in different seasons, with summer months having a larger number of rides taken.

The first data dashboard (see here) visually summarizes the differences seen between the summer and winter months on Citi Bike ridership. Winter and summer months experience different number of rides taken, with summer having over twice as many rides taken as winter. The rides taken, especially in winter months, are dominated by riders who subscribe to the Citi Bike program as compared to single-use customers, and the count of rides taken does not relatively differ by much for each month within the same season. This suggests that there is a core ridership that subscribes to the program that will ride in winter at a rate of approximately 1 million rides per month, which when divided by 30 days per month gives ~33,000 rides per day. Assuming this population commutes using bikes at a rate of two rides per day, the core ridership that subscribes to the Citi Bike program and commutes daily using Citi Bikes can be estimated at about 16,000 riders. Additionally, more data should be examined for summer and winter months to ensure that the trends detailed here are consistent across other years of the program.

Peak Usage Days/Times during Summer

Gathering the data by day of the week the ride was taken, we can see that the number of rides taken throughout summer of 2019 by day was relatively stable. The fewest bike rides were taken on Sundays, and this day may show different ridership numbers across longer time periods if examined in more detail, because the range of rides taken per day is about 150,000 including Sunday in the data set, and changes to approximately 70,000 if Sunday is excluded from the data set.

This line chart shows the number of rides taken over summer of 2019 by time of day. There are two "peaks" in the data set that correspond to 8 am and 5-6 pm, suggesting that the bike rides were taken before and/or after normal working hours of 9 am to 5 pm. This suggests that the bikes were ridden either to commute, exercise, or run errands before or after work. The bikes also show relatively high usage between these two peaks, during the workday of 9 am to 5 pm. The minimum number of rides were taken at 4 am, and the wee hours of the morning show the lowest numbers of rides taken, which makes sense.

This dashboard (see here) visually summarizes the time and day of week ridership data for the NYC Citi Bike program for summer 2019. The bar chart shows that most days of the week have similar numbers of rides taken, but Sunday may have fewer rides taken compared to the others - further analysis must be performed. Additionally, the data reinforces the findings from the previous dashboard, that the two peaks before/after working hours likely correspond to commuters or at least subscribers to the Citi Bike program that form the core ridership that is even actively riding Citi bikes during winter. The time of day data does show two clear peak times before and after usual working hours, and also moderate activity during the work day, with early morning times having the fewest rides taken.

This map shows the number of bike rides from each station for winter starting December 2018 and summer starting June 2019. The data can be filtered to show only certain months and years using the options to the right of the map, and is initially set to show all data. The stations that had the most rides start there were mainly located in Midtown, decreasing as you move towards Uptown or Downtown districts. The number of rides taken at each station decreased significantly after crossing the East River from Manhattan proper, and decreased more as the distance from Manhattan increased. There also appear to be fewer bike rides taken from stations located in less-populated areas or on the borders of highly populated and lightly populated areas, since the station likely serves fewer people and/or is farther from their housing. One highly interesting finding is that the most popular station overall (the largest and darkest circle) serves Grand Central Terminal, suggesting that riders take mass transit to Grand Central and then bike somewhere nearby. More data must be used to confirm these trends, since this dataset only includes eight months of data within a 1-year period, and could also be used to determine how bike ridership has changed over time.

Future Work

The findings in this report are largely based on less than one year's worth of Citi Bike program ridership data. As a result, the remainder of the data should be examined at some point to determine whether or not these trends hold over all, most, or some of the years that the program has been available in NYC. These preliminary results are a good starting point for analyzing the rest of the data, which requires more city resources to be allocated like more time, funding, or a larger data analysis team. Similarly, additional analysis can be performed regarding ride time, distance, etc. that might have more of an impact on where to place new stations or how far to space them.

About

Data cleaning and basic analysis of Citi's bike program data to support visualizations and mini-report

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published