# Data Integrity Check

This notebook performs a data integrity check on the processed trip data to ensure readiness for casual-only behavioral analysis. It validates station names, casual ride counts, and time ranges.

## 1. Setup & Configuration

Import necessary libraries and define the data directory path.

In [1]:
import pandas as pd
from pathlib import Path

DATA_DIR = Path("../data/processed")

## 2. Integrity Check Logic

The function below loads the fact trips data, filters for casual riders, calculates the hour of the trip, and performs various validation checks including null counts for station names.

In [2]:
def run_integrity_check():
    file_path = DATA_DIR / "fact_trips.csv"
    
    if not file_path.exists():
        print("❌ Error: fact_trips.csv not found.")
        return

    print("Running Data Integrity Check (Casual-Only Analysis Readiness)...")
    
   
    df = pd.read_csv(file_path, usecols=['start_station_name', 'started_at', 'member_casual'])
    
  
    df['hour'] = pd.to_datetime(df['started_at']).dt.hour
    casual_df = df[df['member_casual'] == 'casual']

    missing_stations = casual_df['start_station_name'].isna().sum()
    unique_stations = casual_df['start_station_name'].nunique()
    
    print("-" * 50)
    print(f"Total Casual Rides: {len(casual_df):,}")
    print(f"Missing Station Names: {missing_stations:,} ({(missing_stations/len(casual_df))*100:.2f}%)")
    print(f"Unique Stations: {unique_stations}")
    print(f"Time Range Validated: {casual_df['hour'].min()}h to {casual_df['hour'].max()}h")
    
    if missing_stations > (0.25 * len(casual_df)):
         print("⚠️ WARNING: High null count in station names. Clean the source data.")
    else:
         print("✅ SUCCESS: Data integrity verified for behavioral modeling.")

## 3. Execution

Execute the integrity check.

In [3]:
if __name__ == "__main__":
    run_integrity_check()

Running Data Integrity Check (Casual-Only Analysis Readiness)...
--------------------------------------------------
Total Casual Rides: 1,568,655
Missing Station Names: 0 (0.00%)
Unique Stations: 1697
Time Range Validated: 0h to 23h
✅ SUCCESS: Data integrity verified for behavioral modeling.
