# 01 - DC Bikeshare Data Collection

This notebook handles the initial data loading and inspection of Capital Bikeshare trip data.

## Objectives
1. Load raw CSV files from the data directory
2. Inspect data structure and quality
3. Combine multiple files if needed
4. Perform initial data validation
5. Save combined dataset for processing

---


## 1. Import Libraries


In [1]:
import pandas as pd
import numpy as np
import glob
import os
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")


Libraries imported successfully!
Pandas version: 2.1.4
NumPy version: 1.26.4


## 2. Locate Data Files


In [2]:
data_path = '../data/raw/'
csv_files = glob.glob(f'{data_path}*.csv')

print(f"Found {len(csv_files)} CSV file(s):")
for i, file in enumerate(csv_files, 1):
    file_size = os.path.getsize(file) / (1024 * 1024)
    print(f"  {i}. {os.path.basename(file)} ({file_size:.2f} MB)")


Found 1 CSV file(s):
  1. 202507-capitalbikeshare-tripdata.csv (125.86 MB)


## 3. Preview First File Structure


In [3]:
sample_df = pd.read_csv(csv_files[0], nrows=5)

print("Column Names:")
print("-" * 50)
for i, col in enumerate(sample_df.columns, 1):
    print(f"  {i}. {col}")

print("\n" + "=" * 50)
print("First 5 Rows:")
print("=" * 50)
sample_df


Column Names:
--------------------------------------------------
  1. ride_id
  2. rideable_type
  3. started_at
  4. ended_at
  5. start_station_name
  6. start_station_id
  7. end_station_name
  8. end_station_id
  9. start_lat
  10. start_lng
  11. end_lat
  12. end_lng
  13. member_casual

First 5 Rows:


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,7E0AE91FE33A22BB,classic_bike,2025-07-04 21:56:18.589,2025-07-04 22:11:51.395,18th & C St NW,31284,Wisconsin Ave & K St NW,31225,38.893511,-77.041544,38.902801,-77.062819,member
1,271012A4A29FF6E9,classic_bike,2025-07-20 19:50:08.020,2025-07-20 19:53:43.449,Potomac Ave & 8th St SE,31635,M St & New Jersey Ave SE,31208,38.876737,-76.994468,38.876219,-77.004169,member
2,9B368F8D9B50458D,classic_bike,2025-07-15 00:08:36.191,2025-07-15 00:12:36.635,9th & G St NW,30201,6th St & Indiana Ave NW,31264,38.898097,-77.023924,38.894573,-77.01994,member
3,1BB184FC7D908479,classic_bike,2025-07-31 14:55:06.771,2025-07-31 15:04:25.499,9th & G St NW,30201,3rd & H St NW,31604,38.898097,-77.023924,38.899408,-77.015289,member
4,D9224B14FFBBFFCE,classic_bike,2025-07-07 17:20:29.587,2025-07-07 17:35:37.726,Tysons Metro South,32203,Madron Ln & Bermudez Ct,32271,38.919475,-77.221179,38.904415,-77.221807,member


## 4. Load All Data Files


In [4]:
dfs = []

print("Loading CSV files...")
print("=" * 50)

for file in csv_files:
    filename = os.path.basename(file)
    print(f"Loading: {filename}...", end=" ")
    
    df = pd.read_csv(file, low_memory=False)
    df['source_file'] = filename
    dfs.append(df)
    
    print(f"✓ ({len(df):,} rows)")

bikeshare_df = pd.concat(dfs, ignore_index=True)

print("=" * 50)
print(f"Total records loaded: {len(bikeshare_df):,}")
print(f"Memory usage: {bikeshare_df.memory_usage(deep=True).sum() / (1024**2):.2f} MB")


Loading CSV files...
Loading: 202507-capitalbikeshare-tripdata.csv... ✓ (705,343 rows)
Total records loaded: 705,343
Memory usage: 429.23 MB


## 5. Data Overview


In [5]:
print("Dataset Information:")
print("=" * 50)
bikeshare_df.info()


Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 705343 entries, 0 to 705342
Data columns (total 14 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   ride_id             705343 non-null  object 
 1   rideable_type       705343 non-null  object 
 2   started_at          705343 non-null  object 
 3   ended_at            705343 non-null  object 
 4   start_station_name  503135 non-null  object 
 5   start_station_id    503135 non-null  float64
 6   end_station_name    497393 non-null  object 
 7   end_station_id      497288 non-null  float64
 8   start_lat           705343 non-null  float64
 9   start_lng           705343 non-null  float64
 10  end_lat             704629 non-null  float64
 11  end_lng             704629 non-null  float64
 12  member_casual       705343 non-null  object 
 13  source_file         705343 non-null  object 
dtypes: float64(6), object(8)
memory usage: 75.3+ MB


## 6. Check for Missing Values


In [6]:
missing_data = pd.DataFrame({
    'Column': bikeshare_df.columns,
    'Missing_Count': bikeshare_df.isnull().sum(),
    'Missing_Percent': (bikeshare_df.isnull().sum() / len(bikeshare_df) * 100).round(2)
})

missing_data = missing_data[missing_data['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)

if len(missing_data) > 0:
    print("Columns with Missing Values:")
    print("=" * 50)
    print(missing_data.to_string(index=False))
else:
    print("✓ No missing values found!")


Columns with Missing Values:
            Column  Missing_Count  Missing_Percent
    end_station_id         208055            29.50
  end_station_name         207950            29.48
start_station_name         202208            28.67
  start_station_id         202208            28.67
           end_lat            714             0.10
           end_lng            714             0.10


## 7. Date Range Analysis


In [7]:
bikeshare_df['started_at_temp'] = pd.to_datetime(bikeshare_df['started_at'])

print("Date Range:")
print("=" * 50)
print(f"Start Date: {bikeshare_df['started_at_temp'].min()}")
print(f"End Date: {bikeshare_df['started_at_temp'].max()}")
print(f"Duration: {(bikeshare_df['started_at_temp'].max() - bikeshare_df['started_at_temp'].min()).days} days")

bikeshare_df.drop('started_at_temp', axis=1, inplace=True)


Date Range:
Start Date: 2025-06-30 00:00:39.895000
End Date: 2025-07-31 23:58:54.664000
Duration: 31 days


## 8. Basic Statistics


In [8]:
print("Quick Statistics:")
print("=" * 50)
print(f"Total Trips: {len(bikeshare_df):,}")
print(f"Unique Start Stations: {bikeshare_df['start_station_name'].nunique()}")
print(f"Unique End Stations: {bikeshare_df['end_station_name'].nunique()}")
print(f"Bike Types: {bikeshare_df['rideable_type'].unique()}")
print(f"User Types: {bikeshare_df['member_casual'].unique()}")

print("\nUser Type Distribution:")
print(bikeshare_df['member_casual'].value_counts())

print("\nBike Type Distribution:")
print(bikeshare_df['rideable_type'].value_counts())


Quick Statistics:
Total Trips: 705,343
Unique Start Stations: 804
Unique End Stations: 805
Bike Types: ['classic_bike' 'electric_bike']
User Types: ['member' 'casual']

User Type Distribution:
member_casual
member    488137
casual    217206
Name: count, dtype: int64

Bike Type Distribution:
rideable_type
electric_bike    438893
classic_bike     266450
Name: count, dtype: int64


## 9. Sample Data Display


In [9]:
print("First 10 Rows:")
bikeshare_df.head(10)


First 10 Rows:


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,source_file
0,7E0AE91FE33A22BB,classic_bike,2025-07-04 21:56:18.589,2025-07-04 22:11:51.395,18th & C St NW,31284.0,Wisconsin Ave & K St NW,31225.0,38.893511,-77.041544,38.902801,-77.062819,member,202507-capitalbikeshare-tripdata.csv
1,271012A4A29FF6E9,classic_bike,2025-07-20 19:50:08.020,2025-07-20 19:53:43.449,Potomac Ave & 8th St SE,31635.0,M St & New Jersey Ave SE,31208.0,38.876737,-76.994468,38.876219,-77.004169,member,202507-capitalbikeshare-tripdata.csv
2,9B368F8D9B50458D,classic_bike,2025-07-15 00:08:36.191,2025-07-15 00:12:36.635,9th & G St NW,30201.0,6th St & Indiana Ave NW,31264.0,38.898097,-77.023924,38.894573,-77.01994,member,202507-capitalbikeshare-tripdata.csv
3,1BB184FC7D908479,classic_bike,2025-07-31 14:55:06.771,2025-07-31 15:04:25.499,9th & G St NW,30201.0,3rd & H St NW,31604.0,38.898097,-77.023924,38.899408,-77.015289,member,202507-capitalbikeshare-tripdata.csv
4,D9224B14FFBBFFCE,classic_bike,2025-07-07 17:20:29.587,2025-07-07 17:35:37.726,Tysons Metro South,32203.0,Madron Ln & Bermudez Ct,32271.0,38.919475,-77.221179,38.904415,-77.221807,member,202507-capitalbikeshare-tripdata.csv
5,73B9CA4A5AD90C97,electric_bike,2025-07-13 14:51:59.595,2025-07-13 15:03:40.700,9th & N St NW,31336.0,16th & Harvard St NW,31135.0,38.906622,-77.023885,38.926102,-77.03665,member,202507-capitalbikeshare-tripdata.csv
6,0F5612AC84156EF8,electric_bike,2025-07-25 13:03:29.271,2025-07-25 13:46:30.349,9th & G St NW,30201.0,44th St & New Mexico Ave NW,31391.0,38.898097,-77.023924,38.933899,-77.086263,member,202507-capitalbikeshare-tripdata.csv
7,680C8E6986192751,classic_bike,2025-07-12 00:03:19.465,2025-07-12 00:22:10.817,9th & G St NW,30201.0,19th St & Pennsylvania Ave NW,31100.0,38.898097,-77.023924,38.9003,-77.0429,casual,202507-capitalbikeshare-tripdata.csv
8,7C791699B4C495E3,classic_bike,2025-07-24 17:09:03.853,2025-07-24 17:16:44.566,18th St & Pennsylvania Ave NW,31242.0,18th & New Hampshire Ave NW,31324.0,38.89968,-77.041539,38.911268,-77.041829,member,202507-capitalbikeshare-tripdata.csv
9,F0027CC790F48AA8,classic_bike,2025-07-30 16:23:21.310,2025-07-30 16:31:01.281,18th St & Pennsylvania Ave NW,31242.0,18th & New Hampshire Ave NW,31324.0,38.89968,-77.041539,38.911268,-77.041829,member,202507-capitalbikeshare-tripdata.csv


## 10. Save Raw Combined Data


In [10]:
output_path = '../data/processed/bikeshare_raw_combined.parquet'
bikeshare_df.to_parquet(output_path, index=False)

print(f"✓ Raw combined data saved to: {output_path}")
print(f"  Records: {len(bikeshare_df):,}")
print(f"  Columns: {len(bikeshare_df.columns)}")
print("\nReady for data cleaning in notebook 02!")


✓ Raw combined data saved to: ../data/processed/bikeshare_raw_combined.parquet
  Records: 705,343
  Columns: 14

Ready for data cleaning in notebook 02!
