# Cyclistic Case Study: Data Cleaning & Standardization

This notebook applies the reusable cleaning functions from `cyclistic_cleaning.py`
to the raw Divvy trip data (Dec 2024 – Nov 2025).  

Goals:
- Apply standardized cleaning pipeline
- Validate schema and categories
- Save cleaned dataset for downstream analysis


In [16]:
import importlib
import pandas as pd
import cyclistic_cleaning as cc  # utility module

# Now you can reload the module
importlib.reload(cc)


<module 'cyclistic_cleaning' from 'C:\\Users\\Andrew\\Desktop\\GoogleCyclistic15_12_25\\cyclistic_cleaning.py'>

In [18]:
# List of raw monthly CSVs (update paths if needed)
files = [
    "divvy_data/202412-divvy-tripdata.csv",
    "divvy_data/202501-divvy-tripdata.csv",
    "divvy_data/202502-divvy-tripdata.csv",
    "divvy_data/202503-divvy-tripdata.csv",
    "divvy_data/202504-divvy-tripdata.csv",
    "divvy_data/202505-divvy-tripdata.csv",
    "divvy_data/202506-divvy-tripdata.csv",
    "divvy_data/202507-divvy-tripdata.csv",
    "divvy_data/202508-divvy-tripdata.csv",
    "divvy_data/202509-divvy-tripdata.csv",
    "divvy_data/202510-divvy-tripdata.csv",
    "divvy_data/202511-divvy-tripdata.csv"
]

# Concatenate into one DataFrame
df_raw = cc.load_and_concat(files)

print("Raw dataset shape:", df_raw.shape)
df_raw.head()


Raw dataset shape: (5590832, 13)


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,6C960DEB4F78854E,electric_bike,2024-12-31 01:38:35.018,2024-12-31 01:48:45.775,Halsted St & Roscoe St,TA1309000025,Clark St & Winnemac Ave,TA1309000035,41.943632,-87.649083,41.973348,-87.667855,member
1,C0913EEB2834E7A2,classic_bike,2024-12-21 18:41:26.478,2024-12-21 18:47:33.871,Clark St & Wellington Ave,TA1307000136,Halsted St & Roscoe St,TA1309000025,41.936497,-87.647539,41.943632,-87.649083,member
2,848A37DD4723078A,classic_bike,2024-12-21 11:41:01.664,2024-12-21 11:52:45.094,Sheridan Rd & Montrose Ave,TA1307000107,Broadway & Barry Ave,13137,41.96167,-87.65464,41.937582,-87.644098,member
3,3FA09C762ECB48BD,electric_bike,2024-12-26 13:07:27.526,2024-12-26 13:10:54.130,Aberdeen St & Jackson Blvd,13157,Green St & Randolph St*,chargingstx3,41.877726,-87.654787,41.883602,-87.648627,member
4,E60317ADD1A87488,electric_bike,2024-12-13 15:17:55.063,2024-12-13 15:27:32.583,Paulina St & Flournoy St,KA1504000104,Fairfield Ave & Roosevelt Rd,KA1504000102,41.873061,-87.669135,41.866624,-87.694521,member


In [20]:
# 1) Exact full-row duplicates
full_dupe_rows = df_raw.duplicated().sum()
print("Full-row duplicates:", full_dupe_rows)

# 2) Duplicate ride_id (regardless of content in other columns)
ride_id_dupes = df_raw.duplicated(subset=['ride_id']).sum() if 'ride_id' in df_raw.columns else None
print("Duplicate ride_id entries:", ride_id_dupes)

# 3) Null or malformed ride_id that can disrupt deduping
print("Null ride_id count:", df_raw['ride_id'].isna().sum())
print("Non-str ride_id types (sample):", df_raw['ride_id'].map(type).value_counts().head())


Full-row duplicates: 0
Duplicate ride_id entries: 0
Null ride_id count: 0
Non-str ride_id types (sample): ride_id
<class 'str'>    5590832
Name: count, dtype: int64


In [5]:
# Run the full cleaning pipeline
df_clean = cc.clean_cyclistic(df_raw)

print("Cleaned dataset shape:", df_clean.shape)
df_clean.info()


Initial dataset: 5590832 rows
Convert datetimes: 0 rows removed (0.00%), 5590832 rows remain.
Remove duplicates: 0 rows removed (0.00%), 5590832 rows remain.
Add trip duration (1–1440 min): 152387 rows removed (2.73%), 5438445 rows remain.
Validate categories: 0 rows removed (0.00%), 5438445 rows remain.
Final dataset: 5438445 rows

=== Cleaning Summary ===
Convert datetimes              | Removed: 0        |   0.00% | Remaining: 5590832
Remove duplicates              | Removed: 0        |   0.00% | Remaining: 5590832
Add trip duration (1–1440 min) | Removed: 152387   |   2.73% | Remaining: 5438445
Validate categories            | Removed: 0        |   0.00% | Remaining: 5438445

Cleaned dataset shape: (5438445, 15)
<class 'pandas.core.frame.DataFrame'>
Index: 5438445 entries, 0 to 5590831
Data columns (total 15 columns):
 #   Column              Dtype         
---  ------              -----         
 0   ride_id             object        
 1   rideable_type       object        
 2   s

In [6]:
# Quick checks
print("Unique member_casual values:", df_clean['member_casual'].unique())
print("Unique rideable_type values:", df_clean['rideable_type'].unique())

# Trip duration summary
df_clean['trip_duration_min'].describe()


Unique member_casual values: ['member' 'casual']
Unique rideable_type values: ['electric_bike' 'classic_bike']


count    5.438445e+06
mean     1.488011e+01
std      2.870600e+01
min      1.000000e+00
25%      5.666900e+00
50%      9.660167e+00
75%      1.679743e+01
max      1.439976e+03
Name: trip_duration_min, dtype: float64

# Validation Notes
- Categories standardized: `member` vs `casual`, rideable types normalized.
- Trip durations are positive and within reasonable bounds.
- Station names cleaned and title-cased.
- No duplicate ride_ids remain.


In [8]:
# Save full cleaned dataset
df_clean.to_csv("cyclistic_cleaned.csv", index=False)

# Save rolling-year subset (Dec 2024 – Nov 2025)
mask_rolling = (df_clean['started_at'] >= "2024-12-01") & (df_clean['started_at'] < "2025-12-01")
df_rolling = df_clean.loc[mask_rolling].copy()
df_rolling.to_csv("cyclistic_cleaned_rolling_dec2024_nov2025.csv", index=False)

print("Saved cleaned datasets.")


Saved cleaned datasets.


# Next Steps
- Use `03_eda_and_visualization.ipynb` to explore seasonality, trip counts, and member vs casual behavior.
- Build visualizations (monthly counts, duration distributions, geospatial maps).
- Prepare features for classification modeling in later notebooks.
