# **Milestone** | Capital Bikeshare Ride Analysis

<div style="text-align: center;">
<img src="https://cdn.lyft.com/static/bikesharefe/logo/CapitalBikeshare-main.svg" alt="Capital Bikeshare Logo" width="320"/>
</div>


## Introduction
In this Milesone, you'll take on the role of a junior data analyst at Capital Bikeshare, the public bicycle-sharing system in Washington, D.C. Your job is to help city planners understand how people are using the public bike share system across Washington, D.C. in 2024. The city wants to make data-driven decisions to improve bike availability, reduce maintenance downtime, and better serve high-demand areas.

Your manager has asked you to analyze ride data to identify patterns in usage volume, trip distances, and which stations are most frequently used. This information will inform where to allocate bikes, prioritize maintenance resources, and promote underused locations.

You will use the `2024_capitol_bikeshare.csv` dataset to complete your analysis. Each row represents one completed bike ride.

To start, import the pandas library, so that you can load the data into a DataFrame.

In [1]:
# import the pandas library
import pandas as pd

# load the data into a dataframe called bike_rides
bike_rides = pd.read_csv("datasets/2024_capitol_bikeshare.csv")

# Preview the dataset
bike_rides.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,trip_duration_min,start_station_name,start_station_id,end_station_name,end_station_id,member_type
0,4E026D43FD09E59C,classic_bike,2024-03-11 17:46:29,2024-03-11 18:03:07,17.0,22nd & H St NW,31127.0,15th & W St NW,31125.0,member
1,AB210B914033D41B,classic_bike,2024-03-17 19:31:24,2024-03-17 19:39:34,8.0,Crystal Dr & 15th St S,31003.0,Pentagon City Metro / 12th St & S Hayes St,31005.0,member
2,3B328C72BC05FDAB,classic_bike,2024-03-07 14:32:34,2024-03-07 15:19:45,47.0,Crystal Dr & 15th St S,31003.0,Crystal Dr & 15th St S,31003.0,casual
3,A2FD150593E11106,classic_bike,2024-03-29 18:44:08,2024-03-29 18:49:59,6.0,Columbia Rd & Belmont St NW,31113.0,Massachusetts Ave & Dupont Circle NW,31200.0,member
4,4E18243CAADD3542,classic_bike,2024-03-24 11:18:00,2024-03-24 11:24:28,6.0,Columbia Rd & Belmont St NW,31113.0,Massachusetts Ave & Dupont Circle NW,31200.0,member


## Task 1: Exploring The Data

This task will help you build a foundational understanding of the dataset — what kind of data you’re working with? How much of it is there? Are there any potential issues like missing values? Before any analysis, it’s important to get familiar with the structure so you can make informed decisions later.

In [2]:
# preview the first 10 rows of the dataset
bike_rides.sample(10)

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,trip_duration_min,start_station_name,start_station_id,end_station_name,end_station_id,member_type
148035,7FE784E150543D84,classic_bike,2024-03-16 20:58:46,2024-03-16 21:06:43,8.0,Lincoln Park / 13th & East Capitol St NE,31619.0,Kingman Island/The Fields at RFK,31716.0,casual
713089,466EDDBF9B815CD0,electric_bike,2024-03-18 08:14:56,2024-03-18 08:33:12,18.0,Virginia Hospital Center,31976.0,21st St & G st NW,31328.0,member
546448,BBAECDDCFF863801,electric_bike,2024-03-06 15:36:31,2024-03-06 15:40:54,4.0,11th & Girard St NW,31126.0,14th & Newton St NW,31649.0,member
468830,C327B8A5BC262B26,electric_bike,2024-03-12 12:58:45,2024-03-12 13:05:28,7.0,,,,,casual
245832,5BC200D1843FC0DD,classic_bike,2024-03-18 18:35:41,2024-03-18 19:06:55,31.0,Jefferson Memorial,31249.0,1st & K St NE,31662.0,casual
302106,8A006A2D2D39CA00,classic_bike,2024-03-13 09:17:46,2024-03-13 09:47:18,30.0,3rd & H St NE,31616.0,4th & M St SW,31108.0,casual
549581,9C2DE2291B228AEB,classic_bike,2024-03-21 17:38:08,2024-03-21 17:55:18,17.0,17th St & Rhode Island Ave NW,31210.0,3rd & Elm St NW,31118.0,member
588872,2DA6D138F16A0570,electric_bike,2024-03-29 22:35:03,2024-03-29 22:47:50,13.0,,,11th & Clifton St NW,31136.0,member
162053,88DA5F1E8DD8AD29,electric_bike,2024-03-24 22:38:33,2024-03-24 22:42:55,4.0,8th & V St NW,31134.0,7th & K St NW,31653.0,member
33293,C9F56F707D7BDB01,electric_bike,2024-03-07 13:39:35,2024-03-07 13:43:54,4.0,Virginia Square Metro / Monroe St & 9th St N,31024.0,Fairfax Dr & N Taylor St,31049.0,member


In [10]:
# How many rows and columns are in the data?
rows, columns = bike_rides.shape

print(f"Number of rows:\t\t{rows:,.0f}")
print(f"Number of columns:\t{columns:,.0f}")

Number of rows:		862,444
Number of columns:	10


In [None]:
# What kinds of data are in each column? Are there any missing values?
# There are null values in multiple station names and ids.

There are a couple of columns that do have missing values. What might be some reasons for that? How could those missing values affect your analysis?


<div style="border: 3px solid #30EE99; background-color: #f0fff4; padding: 15px; border-radius: 8px; color: #222; display: flex; align-items: center;">
  <span style="font-size: 10pt;">
    <strong>Try This AI Prompt:</strong> I noticed the following columns in my dataset on bikeshare rides contain null values: <i>[list the columns]</i> Why might those values be missing? How should I decide whether to ignore them, fill them in, or drop those rows? What factors should I consider before taking action?
  </span>
</div>

Double-click (or enter) to edit

In [41]:
# What is the average trip duration in minutes?
average_duration = bike_rides["trip_duration_min"].mean()
print(f"Average Trip Duration:\t{average_duration:.1f} min(s).")

Average Trip Duration:	18.3 min(s).


## Task 2: Station Usage Analysis

This task explores how riders interact with the bike network — where trips are starting and ending, and which stations are most or least popular. Understanding station usage helps identify hotspots, gaps in service, and opportunities to optimize bike and dock placement.

How many unique starting stations are there in the data? Print the answer to the screen.

In [17]:
# Number of unique starting stations
num_unique_stations = bike_rides["start_station_name"].nunique()
print(f"There are {num_unique_stations} unique starting stations.")

There are 771 unique starting stations.


What are the five most common stations where rides begin?

<div style="border: 3px solid #b67ae5; background-color: #f9f1ff; padding: 15px; border-radius: 8px; color: #222; display: flex; align-items: center;">
<span style="font-size: 10pt;">
<strong>HINT: </strong> After using <span style="font-family: monospace; color: #222;">.value_counts()</span> to rank the stations, you can add <span style="font-family: monospace; color: #222;">.head(5)</span> at the end to show only the top five most common ones!
</span>
</div>


In [24]:
# Top 5 most common starting stations
top5_start_stations = bike_rides["start_station_name"].value_counts().head(5)
print(f"The top five starting stations are:\n\n{top5_start_stations}")

The top five starting stations are:

Columbus Circle / Union Station    8996
New Hampshire Ave & T St NW        7792
Lincoln Memorial                   6490
Jefferson Memorial                 6476
15th & P St NW                     6468
Name: start_station_name, dtype: int64


What are the five most common ride destinations?

In [25]:
# Top 5 most common ending stations
top5_end_stations = bike_rides["end_station_name"].value_counts().head(5)
print(f"The top five ending stations are:\n\n{top5_end_stations}")

The top five ending stations are:

Columbus Circle / Union Station    8960
New Hampshire Ave & T St NW        7542
15th & P St NW                     6494
Jefferson Memorial                 6444
Jefferson Dr & 14th St SW          6426
Name: end_station_name, dtype: int64


## Task 3: Member Type Analysis

The column `member_type` indicates whether user was a "registered" member (Annual Member, 30-Day Member or Day Key Member) or a "casual" rider (Single Trip, 24-Hour Pass, 3-Day Pass or 5-Day Pass). How much each group contributes to the overall ridership can provide insights into which can inform service improvements, membership incentives, and marketing strategies.

How many rides were taken by "members" versus "casual" riders?
<div style="border: 3px solid #b67ae5; background-color: #f9f1ff; padding: 15px; border-radius: 8px; color: #222; display: flex; align-items: center;">
<span style="font-size: 10pt;">
    <strong>HINT: </strong> Check the <strong>count</strong> of each <strong>value</strong> in the <span style="font-family: monospace; color: #222;">member_type</span> column!
</span>
</div>

In [29]:
# Count of values in member_type
member_counts = bike_rides["member_type"].value_counts()
print(f"Number of rides taken by \"members\" vs. \"casual\" riders:\n\n{member_counts}")

Number of rides taken by "members" vs. "casual" riders:

member    549216
casual    313228
Name: member_type, dtype: int64


Are there more members or casual riders in March and April in Washington, D.C.?

There were more member/registered riders.

Find the longest and shortest rides in the entire dataset. Print both to the screen. Think about what might these extreme values tell you about the overall trip behavior for both members and casual riders.


In [36]:
# min trip_duration
longest_ride = bike_rides["trip_duration_min"].max()
shortest_ride = bike_rides["trip_duration_min"].min()
print(f"Longest ride:\t\t{longest_ride} min(s).")
print(f"Shortest ride:\t\t{shortest_ride} min(s).")

Longest ride:		1560.0 min(s).
Shortest ride:		1.0 min(s).


What is the median trip duration in minutes, across all users? Print to the screen.

In [40]:
# median trip_duration
median_trip_duration = bike_rides["trip_duration_min"].median()
print(f"Median trip duration:\t{median_trip_duration} min(s).")

Median trip duration:	10.0 min(s).


In Task 1, you found the mean trip duration. Why might the median be more useful than the mean in this case?

<div style="border: 3px solid #30EE99; background-color: #f0fff4; padding: 15px; border-radius: 8px; color: #222; display: flex; align-items: center;">
  <span style="font-size: 10pt;">
    <strong>Try This AI Prompt:</strong> I found that the median trip duration was XX minutes, but the mean was YY minutes. Why might the median be a more reliable metric in this context?
  </span>
</div>

Double-click (or enter) to edit

## Task 4: Identifying Underused Resources

Finally, you're looking for stations with low engagement to uncover inefficiencies in the system. Spotting underused stations helps inform marketing strategies, relocation plans, or targeted service improvements to boost ridership in overlooked areas.



Identify the **top 25** stations with the *fewest* ride departures.

In [42]:
# Top 25 most unpopular starting stations
bottom25_start_stations = bike_rides["start_station_name"].value_counts().tail(25)
print(f"The \"top\" twenty-five most unpopular starting stations are:\n\n{bottom25_start_stations}")

The "top" twenty-five most unpopular starting stations are:

Monroe St & Monroe Pl                        10
Division Ave & Foote St NE                    8
Fair Woods Pkwy & Fairfax Blvd                8
GMU/Rappahannock River Ln                     8
Lake Newport Rd and Autumn Ridge Cir          8
Ridge Rd Community Center                     8
The Shoppes @ Burnt Mills                     8
Fairfax Village                               8
Key West Ave & Diamondback Dr                 6
37th & Ely Pl SE                              6
Ridge Rd & Southern Ave SE                    6
United Medical Center                         6
White House                                   6
Green Range Dr and Glade Dr                   6
Becontree Ln & Goldenrain Ct                  6
Medical Center Dr & Key West Ave              4
GMU/Patriot Cir & York Dr                     4
Key West Ave & Great Seneca Hwy               4
GMU/Horizon Hall & Harris Theater             4
New Hampshire & Lockwood   

What proportion of all rides started from these 25 least-used stations?

<div style="border: 3px solid #b67ae5; background-color: #f9f1ff; padding: 15px; border-radius: 8px; color: #222; display: flex; align-items: center;">
<span style="font-size: 10pt;">
<strong>HINT: </strong> There are several ways to go about this one! Here's one way: first find the total number of rides in the dataset using <span style="font-family: monospace; color: #222;">bike_rides.shape</span> from Task 1. Then, sum the number of rides that started from the ten least-used stations and divide that by the total number of rides. You could also use ChatGPT for help here, too!
    </span>
</div>



In [53]:
# proportion of unpopular rides
proportion = bike_rides["start_station_name"].value_counts().tail(25).sum() / bike_rides.shape[0]
# print("Proportion of rides from 25 least-used stations:", proportion)
print(f"Proportion of Most Unpopular Starting Stations to All Starting Stations: {round(proportion * 100, 2):.2f}%")

Proportion of Most Unpopular Starting Stations to All Starting Stations: 0.02%


Based on your findings, do you think low-usage stations are underperforming due to location or lack of awareness? What would you recommend Capital Bikeshare do to increase usage in those areas? Don't forget you can use ChatGPT as a teammate here in crafting your recommendation!

Low-Usage stations are likely due to the location of the station itself. Integrated bike-friendly roads and passageways cost money, infrastructure space, zoning law changes, and many other factors that mean that they could be more focused around busy neighborhoods- i.e. neighborhoods more favoring bike-friendly paths. It's often a counted vote of what restructuring for the local roadways are undertaken. Capital Bikeshare could increase incentive for these stations by offering discounts for lower-used stations. Many rideshare and public transport companies/systems already do this in many cities. It's a simple copying of supply and demand, much like I'm sure the property values in those low-use station neighborhoods is less than in the more popular areas.