# Cyclistic Bike-Share Analysis  
## Data Cleaning & Feature Engineering

This notebook focuses on cleaning the raw trip data and creating derived features required for meaningful analysis of rider behaviour.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

In [2]:
DATA_PATH = Path("../data/raw")

files = sorted(DATA_PATH.glob("*.csv"))

df = pd.concat(
    [pd.read_csv(file) for file in files],
    ignore_index=True
)

df.shape

(5552994, 13)

In [3]:
df["started_at"] = pd.to_datetime(df["started_at"], errors="coerce")
df["ended_at"] = pd.to_datetime(df["ended_at"], errors="coerce")

df[["started_at", "ended_at"]].dtypes

started_at    datetime64[ns]
ended_at      datetime64[ns]
dtype: object

In [4]:
df["ride_duration_minutes"] = (
    (df["ended_at"] - df["started_at"])
    .dt.total_seconds()
    / 60
)

df["ride_duration_minutes"].describe()

count    5.552994e+06
mean     1.602772e+01
std      5.511650e+01
min     -5.479480e+01
25%      5.394933e+00
50%      9.426058e+00
75%      1.656325e+01
max      1.574900e+03
Name: ride_duration_minutes, dtype: float64

## Ride Duration Feature

- Ride duration was calculated as the difference between end and start timestamps.
- Duration is expressed in minutes for easier interpretation.
- Negative or zero durations indicate invalid or erroneous records and will be removed.

In [5]:
initial_rows = df.shape[0]

df_clean = df[
    (df["ride_duration_minutes"] > 0) &
    (df["ride_duration_minutes"] <= 1440)
].copy()

final_rows = df_clean.shape[0]

initial_rows, final_rows

(5552994, 5547380)

In [6]:
removed_pct = (initial_rows - final_rows) / initial_rows * 100
round(removed_pct, 2)

0.1

- Approximately 0.1% of total rides were removed during duration-based validation.

## Data Cleaning: Ride Duration Validation

To ensure meaningful analysis, the following records were removed:

- Rides with zero or negative duration
- Rides longer than 24 hours (1,440 minutes)

These records likely represent data errors or operational anomalies rather than typical customer behavior. Removing them improves data quality while preserving the integrity of the analysis.

In [7]:
df_clean["ride_duration_minutes"].describe()

count    5.547380e+06
mean     1.453468e+01
std      2.869838e+01
min      7.666667e-04
25%      5.391483e+00
50%      9.416533e+00
75%      1.652962e+01
max      1.439976e+03
Name: ride_duration_minutes, dtype: float64