# ✅ Data Integrity Check

Before we dive into behavioral modeling, we need to ensure our foundation is solid. This notebook validates the processed trip data, specifically focusing on the casual-rider segment. We're checking for completeness, unique identifiers, and ensuring the time ranges align with our expectations.

### 1. Preparation
We start by loading our essential libraries and defining the path to our processed data.

In [1]:
import pandas as pd
from pathlib import Path

In [5]:
DATA_DIR = Path("../data/processed")

### 2. Loading the Dataset
We load the `fact_trips.csv` file, specifically pulling in the station names, timestamps, and rider types. To ensure accuracy, we'll calculate the 'hour' feature on the fly from the raw timestamps.

In [6]:
file_path = DATA_DIR / "fact_trips.csv"

if not file_path.exists():
    raise FileNotFoundError("\u274c fact_trips.csv not found. Run pipeline first.")

print("Running Data Integrity Check (Casual-Only Analysis Readiness)...")

df = pd.read_csv(file_path, usecols=['start_station_name', 'started_at', 'member_casual'])
df['hour'] = pd.to_datetime(df['started_at']).dt.hour
casual_df = df[df['member_casual'] == 'casual']

Running Data Integrity Check (Casual-Only Analysis Readiness)...


### 3. Running Validation Checks
Now, we execute our health checks. We're looking at the total volume of casual rides, checking for any missing station names, and verifying that the hourly data spans the full daily cycle.

In [7]:
missing_stations = casual_df['start_station_name'].isna().sum()
unique_stations = casual_df['start_station_name'].nunique()

print("-" * 50)
print(f"Total Casual Rides: {len(casual_df):,}")
print(f"Missing Station Names: {missing_stations:,} ({(missing_stations/len(casual_df))*100:.2f}%)")
print(f"Unique Stations: {unique_stations}")
print(f"Time Range Validated: {casual_df['hour'].min()}h to {casual_df['hour'].max()}h")

if missing_stations > (0.25 * len(casual_df)):
    print("\u26a0\ufe0f WARNING: High null count in station names. Clean the source data.")
else:
    print("\u2705 SUCCESS: Data integrity verified for behavioral modeling.")

--------------------------------------------------
Total Casual Rides: 1,568,655
Missing Station Names: 0 (0.00%)
Unique Stations: 1697
Time Range Validated: 0h to 23h
✅ SUCCESS: Data integrity verified for behavioral modeling.
