# 03 - DC Bikeshare Statistical Analysis

This notebook performs comprehensive statistical analysis to identify usage patterns, peak demand periods, and key insights.

## Objectives
1. Analyze peak usage patterns (hours, days, seasons)
2. Identify top stations and routes
3. Compare member vs casual user behavior
4. Analyze trip duration patterns
5. Generate summary statistics and insights

---


## 1. Import Libraries


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Libraries imported successfully!")


Libraries imported successfully!


## 2. Load Cleaned Data


In [2]:
bikeshare_df = pd.read_parquet('../data/processed/bikeshare_cleaned.parquet')

print(f"Loaded {len(bikeshare_df):,} records")
print(f"Date range: {bikeshare_df['started_at'].min()} to {bikeshare_df['started_at'].max()}")
print(f"Columns: {len(bikeshare_df.columns)}")


Loaded 434,489 records
Date range: 2025-06-30 16:47:53.810000 to 2025-07-31 23:55:37.416000
Columns: 29


## 3. Peak Hour Analysis


In [3]:
hourly_trips = bikeshare_df.groupby('hour').size().reset_index(name='trip_count')

peak_hour = hourly_trips.loc[hourly_trips['trip_count'].idxmax(), 'hour']
peak_trips = hourly_trips['trip_count'].max()

print("=" * 60)
print("PEAK HOUR ANALYSIS")
print("=" * 60)
print(f"\nPeak Hour: {int(peak_hour)}:00")
print(f"Trips during peak hour: {peak_trips:,}")

print("\nTop 5 Busiest Hours:")
top_hours = hourly_trips.nlargest(5, 'trip_count')
for idx, row in top_hours.iterrows():
    print(f"  {int(row['hour']):02d}:00 - {row['trip_count']:,} trips")

print("\nLeast Busy Hours:")
bottom_hours = hourly_trips.nsmallest(3, 'trip_count')
for idx, row in bottom_hours.iterrows():
    print(f"  {int(row['hour']):02d}:00 - {row['trip_count']:,} trips")

hourly_trips


PEAK HOUR ANALYSIS

Peak Hour: 17:00
Trips during peak hour: 43,883

Top 5 Busiest Hours:
  17:00 - 43,883 trips
  18:00 - 36,900 trips
  16:00 - 32,666 trips
  19:00 - 29,801 trips
  08:00 - 29,760 trips

Least Busy Hours:
  03:00 - 880 trips
  04:00 - 979 trips
  02:00 - 1,686 trips


Unnamed: 0,hour,trip_count
0,0,5056
1,1,2778
2,2,1686
3,3,880
4,4,979
5,5,3058
6,6,9126
7,7,18871
8,8,29760
9,9,21953


## 4. Day of Week Analysis


In [4]:
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
daily_trips = bikeshare_df.groupby('day_name').size().reindex(day_order)

peak_day = daily_trips.idxmax()
peak_day_trips = daily_trips.max()

print("=" * 60)
print("DAY OF WEEK ANALYSIS")
print("=" * 60)
print(f"\nBusiest Day: {peak_day}")
print(f"Trips on {peak_day}: {peak_day_trips:,}")

print("\nTrips by Day of Week:")
for day, count in daily_trips.items():
    pct = (count / daily_trips.sum()) * 100
    bar = '█' * int(pct)
    print(f"  {day:9s}: {count:>7,} ({pct:>5.2f}%) {bar}")

weekday_avg = daily_trips[['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']].mean()
weekend_avg = daily_trips[['Saturday', 'Sunday']].mean()

print(f"\nWeekday Average: {weekday_avg:,.0f} trips/day")
print(f"Weekend Average: {weekend_avg:,.0f} trips/day")
print(f"Weekday vs Weekend Difference: {((weekday_avg - weekend_avg) / weekend_avg * 100):+.1f}%")


DAY OF WEEK ANALYSIS

Busiest Day: Thursday
Trips on Thursday: 73,749

Trips by Day of Week:
  Monday   :  50,538 (11.63%) ███████████
  Tuesday  :  67,747 (15.59%) ███████████████
  Wednesday:  70,956 (16.33%) ████████████████
  Thursday :  73,749 (16.97%) ████████████████
  Friday   :  62,357 (14.35%) ██████████████
  Saturday :  59,616 (13.72%) █████████████
  Sunday   :  49,526 (11.40%) ███████████

Weekday Average: 65,069 trips/day
Weekend Average: 54,571 trips/day
Weekday vs Weekend Difference: +19.2%


## 5. Seasonal Analysis


In [5]:
season_order = ['Winter', 'Spring', 'Summer', 'Fall']
seasonal_stats = bikeshare_df.groupby('season').agg({
    'ride_id': 'count',
    'duration_min': 'mean'
}).rename(columns={'ride_id': 'total_trips', 'duration_min': 'avg_duration_min'})

seasonal_stats = seasonal_stats.reindex(season_order)
seasonal_stats['pct_of_annual'] = (seasonal_stats['total_trips'] / seasonal_stats['total_trips'].sum() * 100)

print("=" * 60)
print("SEASONAL DEMAND ANALYSIS")
print("=" * 60)

for season in season_order:
    if season in seasonal_stats.index:
        row = seasonal_stats.loc[season]
        print(f"\n{season}:")
        print(f"  Total Trips: {row['total_trips']:,.0f}")
        print(f"  % of Annual: {row['pct_of_annual']:.1f}%")
        print(f"  Avg Duration: {row['avg_duration_min']:.1f} minutes")

peak_season = seasonal_stats['total_trips'].idxmax()
low_season = seasonal_stats['total_trips'].idxmin()

print(f"\nPeak Season: {peak_season}")
print(f"Low Season: {low_season}")
print(f"Seasonal Variation: {(seasonal_stats['pct_of_annual'].max() / seasonal_stats['pct_of_annual'].min()):.2f}x")

seasonal_stats


SEASONAL DEMAND ANALYSIS

Winter:
  Total Trips: nan
  % of Annual: nan%
  Avg Duration: nan minutes

Spring:
  Total Trips: nan
  % of Annual: nan%
  Avg Duration: nan minutes

Summer:
  Total Trips: 434,489
  % of Annual: 100.0%
  Avg Duration: 16.1 minutes

Fall:
  Total Trips: nan
  % of Annual: nan%
  Avg Duration: nan minutes

Peak Season: Summer
Low Season: Summer
Seasonal Variation: 1.00x


Unnamed: 0_level_0,total_trips,avg_duration_min,pct_of_annual
season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Winter,,,
Spring,,,
Summer,434489.0,16.078489,100.0
Fall,,,


## 6. Top 20 Busiest Stations


In [6]:
top_start_stations = bikeshare_df['start_station_name'].value_counts().head(20)
top_end_stations = bikeshare_df['end_station_name'].value_counts().head(20)

print("=" * 60)
print("TOP 20 BUSIEST START STATIONS")
print("=" * 60)
for i, (station, count) in enumerate(top_start_stations.items(), 1):
    print(f"{i:2d}. {station:45s} {count:>7,} trips")

print("\n" + "=" * 60)
print("TOP 20 BUSIEST END STATIONS")
print("=" * 60)
for i, (station, count) in enumerate(top_end_stations.items(), 1):
    print(f"{i:2d}. {station:45s} {count:>7,} trips")

common_stations = set(top_start_stations.index) & set(top_end_stations.index)
print(f"\nStations in both top 20 lists: {len(common_stations)}")


TOP 20 BUSIEST START STATIONS
 1. Columbus Circle / Union Station                 5,230 trips
 2. New Hampshire Ave & T St NW                     4,575 trips
 3. 15th & P St NW                                  3,955 trips
 4. 5th & K St NW                                   3,917 trips
 5. Eastern Market Metro                            3,710 trips
 6. 1st & M St NE                                   3,489 trips
 7. 14th & V St NW                                  3,351 trips
 8. 14th & R St NW                                  3,011 trips
 9. M St & Delaware Ave NE                          2,943 trips
10. Lincoln Memorial                                2,722 trips
11. 17th & Corcoran St NW                           2,670 trips
12. 14th St & Rhode Island Ave NW                   2,599 trips
13. 18th & New Hampshire Ave NW                     2,586 trips
14. 4th St & Madison Dr NW                          2,582 trips
15. 4th & Florida Ave NE                            2,548 trips
16. 1st & 

## 7. Top 20 Most Popular Routes


In [7]:
top_routes = bikeshare_df['route'].value_counts().head(20)

print("=" * 60)
print("TOP 20 MOST POPULAR ROUTES")
print("=" * 60)
for i, (route, count) in enumerate(top_routes.items(), 1):
    print(f"{i:2d}. {route}")
    print(f"    {count:,} trips\n")

round_trips_in_top = sum(1 for route in top_routes.index if '→' in route and route.split(' → ')[0] == route.split(' → ')[1])
print(f"Round trips in top 20: {round_trips_in_top}")


TOP 20 MOST POPULAR ROUTES
 1. Gravelly Point → Gravelly Point
    387 trips

 2. Columbus Circle / Union Station → 8th & F St NE
    373 trips

 3. Smithsonian-National Mall / Jefferson Dr & 12th St SW → Smithsonian-National Mall / Jefferson Dr & 12th St SW
    358 trips

 4. 8th & F St NE → Columbus Circle / Union Station
    330 trips

 5. Columbus Circle / Union Station → 6th & H St NE
    322 trips

 6. Eastern Market Metro → Lincoln Park / 13th & East Capitol St NE 
    284 trips

 7. Lincoln Park / 13th & East Capitol St NE  → Eastern Market Metro
    268 trips

 8. 4th & M St SW → 2nd & V St SW / James Creek Marina
    254 trips

 9. 15th St & Constitution Ave NW → 15th St & Constitution Ave NW
    243 trips

10. 6th & H St NE → Columbus Circle / Union Station
    239 trips

11. 4th St & Madison Dr NW → 4th St & Madison Dr NW
    222 trips

12. Jefferson Dr & 14th St SW → Jefferson Dr & 14th St SW
    221 trips

13. New Hampshire Ave & T St NW → 15th & P St NW
    212 trips

14

## 8. Rush Hour Analysis


In [8]:
rush_hour_stats = bikeshare_df.groupby('is_rush_hour').size()
time_category_stats = bikeshare_df['time_category'].value_counts()

print("=" * 60)
print("RUSH HOUR ANALYSIS")
print("=" * 60)
print(f"\nRush Hour Trips: {rush_hour_stats.get(True, 0):,} ({rush_hour_stats.get(True, 0)/len(bikeshare_df)*100:.1f}%)")
print(f"Non-Rush Hour Trips: {rush_hour_stats.get(False, 0):,} ({rush_hour_stats.get(False, 0)/len(bikeshare_df)*100:.1f}%)")

print("\nTrips by Time Category:")
for category, count in time_category_stats.items():
    pct = (count / len(bikeshare_df)) * 100
    print(f"  {category:25s}: {count:>7,} ({pct:>5.1f}%)")


RUSH HOUR ANALYSIS

Rush Hour Trips: 181,168 (41.7%)
Non-Rush Hour Trips: 253,321 (58.3%)

Trips by Time Category:
  Evening Rush             : 143,250 ( 33.0%)
  Midday                   : 132,105 ( 30.4%)
  Morning Rush             :  79,710 ( 18.3%)
  Night                    :  64,987 ( 15.0%)
  Late Night/Early Morning :  14,437 (  3.3%)


## 9. Member vs Casual User Analysis


In [9]:
user_stats = bikeshare_df.groupby('member_casual').agg({
    'ride_id': 'count',
    'duration_min': 'mean',
    'is_weekend': 'mean',
    'is_rush_hour': 'mean'
}).rename(columns={
    'ride_id': 'total_trips',
    'duration_min': 'avg_duration_min',
    'is_weekend': 'weekend_trip_pct',
    'is_rush_hour': 'rush_hour_pct'
})

user_stats['weekend_trip_pct'] *= 100
user_stats['rush_hour_pct'] *= 100

print("=" * 60)
print("MEMBER VS CASUAL USER COMPARISON")
print("=" * 60)

for user_type in user_stats.index:
    row = user_stats.loc[user_type]
    print(f"\n{user_type.upper()} USERS:")
    print(f"  Total Trips: {row['total_trips']:,.0f}")
    print(f"  % of Total: {row['total_trips']/len(bikeshare_df)*100:.1f}%")
    print(f"  Avg Duration: {row['avg_duration_min']:.1f} minutes")
    print(f"  Weekend Trips: {row['weekend_trip_pct']:.1f}%")
    print(f"  Rush Hour Trips: {row['rush_hour_pct']:.1f}%")

member_duration = user_stats.loc['member', 'avg_duration_min']
casual_duration = user_stats.loc['casual', 'avg_duration_min']

print(f"\nKey Insights:")
print(f"  Casual users ride {(casual_duration/member_duration):.2f}x longer than members")
print(f"  Casual users take {user_stats.loc['casual', 'weekend_trip_pct']:.1f}% of trips on weekends")
print(f"  Members take {user_stats.loc['member', 'rush_hour_pct']:.1f}% of trips during rush hour")

user_stats


MEMBER VS CASUAL USER COMPARISON

CASUAL USERS:
  Total Trips: 159,989
  % of Total: 36.8%
  Avg Duration: 23.3 minutes
  Weekend Trips: 31.4%
  Rush Hour Trips: 36.6%

MEMBER USERS:
  Total Trips: 274,500
  % of Total: 63.2%
  Avg Duration: 11.9 minutes
  Weekend Trips: 21.5%
  Rush Hour Trips: 44.7%

Key Insights:
  Casual users ride 1.96x longer than members
  Casual users take 31.4% of trips on weekends
  Members take 44.7% of trips during rush hour


Unnamed: 0_level_0,total_trips,avg_duration_min,weekend_trip_pct,rush_hour_pct
member_casual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
casual,159989,23.297444,31.367782,36.583765
member,274500,11.871008,21.47796,44.676867


## 10. Trip Duration Analysis


In [10]:
duration_stats = bikeshare_df['duration_min'].describe()

print("=" * 60)
print("TRIP DURATION STATISTICS")
print("=" * 60)
print(f"\nMean Duration: {duration_stats['mean']:.2f} minutes")
print(f"Median Duration: {duration_stats['50%']:.2f} minutes")
print(f"Std Deviation: {duration_stats['std']:.2f} minutes")
print(f"Min Duration: {duration_stats['min']:.2f} minutes")
print(f"Max Duration: {duration_stats['max']:.2f} minutes")

print("\nPercentiles:")
for pct in [25, 50, 75, 90, 95, 99]:
    val = bikeshare_df['duration_min'].quantile(pct/100)
    print(f"  {pct}th percentile: {val:.2f} minutes")

print("\nDuration Distribution:")
duration_bins = [(0, 5), (5, 10), (10, 15), (15, 30), (30, 60), (60, 1440)]
for start, end in duration_bins:
    count = ((bikeshare_df['duration_min'] >= start) & (bikeshare_df['duration_min'] < end)).sum()
    pct = count / len(bikeshare_df) * 100
    print(f"  {start:3d}-{end:4d} min: {count:>7,} ({pct:>5.1f}%)")


TRIP DURATION STATISTICS

Mean Duration: 16.08 minutes
Median Duration: 9.87 minutes
Std Deviation: 32.74 minutes
Min Duration: 1.00 minutes
Max Duration: 1429.32 minutes

Percentiles:
  25th percentile: 5.86 minutes
  50th percentile: 9.87 minutes
  75th percentile: 16.96 minutes
  90th percentile: 30.42 minutes
  95th percentile: 45.56 minutes
  99th percentile: 105.13 minutes

Duration Distribution:
    0-   5 min:  82,811 ( 19.1%)
    5-  10 min: 137,159 ( 31.6%)
   10-  15 min:  84,793 ( 19.5%)
   15-  30 min:  85,369 ( 19.6%)
   30-  60 min:  30,779 (  7.1%)
   60-1440 min:  13,578 (  3.1%)


## 11. Bike Type Analysis


In [11]:
bike_stats = bikeshare_df.groupby('rideable_type').agg({
    'ride_id': 'count',
    'duration_min': 'mean'
}).rename(columns={'ride_id': 'total_trips', 'duration_min': 'avg_duration_min'})

bike_stats['pct_of_total'] = (bike_stats['total_trips'] / bike_stats['total_trips'].sum() * 100)

print("=" * 60)
print("BIKE TYPE ANALYSIS")
print("=" * 60)

for bike_type in bike_stats.index:
    row = bike_stats.loc[bike_type]
    print(f"\n{bike_type.upper().replace('_', ' ')}:")
    print(f"  Total Trips: {row['total_trips']:,.0f}")
    print(f"  % of Total: {row['pct_of_total']:.1f}%")
    print(f"  Avg Duration: {row['avg_duration_min']:.1f} minutes")

bike_stats


BIKE TYPE ANALYSIS

CLASSIC BIKE:
  Total Trips: 265,710
  % of Total: 61.2%
  Avg Duration: 17.9 minutes

ELECTRIC BIKE:
  Total Trips: 168,779
  % of Total: 38.8%
  Avg Duration: 13.2 minutes


Unnamed: 0_level_0,total_trips,avg_duration_min,pct_of_total
rideable_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
classic_bike,265710,17.914112,61.154598
electric_bike,168779,13.188655,38.845402


## 12. Comprehensive Summary Report


In [13]:
print("╔" + "═" * 78 + "╗")
print("║" + " " * 20 + "DC BIKESHARE ANALYSIS - KEY FINDINGS" + " " * 22 + "║")
print("╠" + "═" * 78 + "╣")

print("║ DATASET OVERVIEW" + " " * 61 + "║")
print("║   • Total trips analyzed: {:>50s}   ║".format(f"{len(bikeshare_df):,}"))
print("║   • Date range: {:>60s}   ║".format(f"{bikeshare_df['started_at'].min().date()} to {bikeshare_df['started_at'].max().date()}"))
print("║   • Unique stations: {:>55s}   ║".format(f"{bikeshare_df['start_station_name'].nunique():,}"))
print("║   • Unique routes: {:>57s}   ║".format(f"{bikeshare_df['route'].nunique():,}"))

print("╠" + "═" * 78 + "╣")
print("║ PEAK USAGE PATTERNS" + " " * 58 + "║")
print("║   • Peak hour: {:>61s}   ║".format(f"{int(peak_hour):02d}:00 ({peak_trips:,} trips)"))
print("║   • Busiest day: {:>59s}   ║".format(f"{peak_day} ({peak_day_trips:,} trips)"))
print("║   • Peak season: {:>59s}   ║".format(f"{peak_season}"))
print("║   • Rush hour trips: {:>55s}   ║".format(f"{rush_hour_stats.get(True, 0):,} ({rush_hour_stats.get(True, 0)/len(bikeshare_df)*100:.1f}%)"))

print("╠" + "═" * 78 + "╣")
print("║ TOP PERFORMERS" + " " * 63 + "║")
print("║   • Busiest start station: {:>49s}   ║".format(top_start_stations.index[0][:45]))
print("║   • Busiest end station: {:>51s}   ║".format(top_end_stations.index[0][:45]))
print("║   • Most popular route: {:>52s}   ║".format(top_routes.index[0][:45]))

print("╠" + "═" * 78 + "╣")
print("║ USER BEHAVIOR" + " " * 64 + "║")
member_pct = user_stats.loc['member', 'total_trips'] / len(bikeshare_df) * 100
casual_pct = user_stats.loc['casual', 'total_trips'] / len(bikeshare_df) * 100
print("║   • Member trips: {:>58s}   ║".format(f"{user_stats.loc['member', 'total_trips']:,.0f} ({member_pct:.1f}%)"))
print("║   • Casual trips: {:>58s}   ║".format(f"{user_stats.loc['casual', 'total_trips']:,.0f} ({casual_pct:.1f}%)"))
print("║   • Avg trip duration (member): {:>44s}   ║".format(f"{user_stats.loc['member', 'avg_duration_min']:.1f} min"))
print("║   • Avg trip duration (casual): {:>44s}   ║".format(f"{user_stats.loc['casual', 'avg_duration_min']:.1f} min"))

print("╠" + "═" * 78 + "╣")
print("║ SEASONAL INSIGHTS" + " " * 60 + "║")
print("║   • Summer trips: {:>58s}   ║".format(f"{seasonal_stats.loc['Summer', 'pct_of_annual']:.1f}% of annual"))
print("║   • Winter trips: {:>58s}   ║".format(f"{seasonal_stats.loc['Winter', 'pct_of_annual']:.1f}% of annual"))
print("║   • Seasonal variation: {:>52s}   ║".format(f"{(seasonal_stats['pct_of_annual'].max() / seasonal_stats['pct_of_annual'].min()):.2f}x"))

print("╚" + "═" * 78 + "╝")

print("\n✓ Analysis complete! Proceed to notebook 04 for visualizations.")


╔══════════════════════════════════════════════════════════════════════════════╗
║                    DC BIKESHARE ANALYSIS - KEY FINDINGS                      ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ DATASET OVERVIEW                                                             ║
║   • Total trips analyzed:                                            434,489   ║
║   • Date range:                                     2025-06-30 to 2025-07-31   ║
║   • Unique stations:                                                     804   ║
║   • Unique routes:                                                    76,420   ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ PEAK USAGE PATTERNS                                                          ║
║   • Peak hour:                                          17:00 (43,883 trips)   ║
║   • Busiest day:                                     Thursday (73,749 trips)   ║
║   • Peak seaso