# Statistical Validation: Hourly Ride Distributions

This notebook builds the foundational dataset for statistical validation. It aggregates hourly ride distributions for different station segments (Anchors vs. Noise), calculating the percentage share of daily rides for each hour. This data is essential for visualizing the 'Behavioral Peak' and validating the commuter signal at selected stations.

## 1. Setup & Configuration

Define paths for the fact table, station segments, and the validation output.

In [None]:
import pandas as pd
from pathlib import Path


DATA_DIR = Path("../data/processed")

## 2. Statistical Ingestion & Transformation

The validation process follows these steps:
1. **Casual Filtering**: Isolates casual rides from the master fact table.
2. **Segment Merging**: Integrates station classification labels (Anchors/Noise).
3. **Hourly Aggregation**: Calculates the total rides per hour per segment.
4. **Share Calculation**: Computes the percentage share of a segment's total daily volume for each specific hour, enabling a direct 'peakedness' comparison.

In [None]:
def run_statistical_validation():
    master_path = DATA_DIR / "fact_trips.csv"
    segment_path = DATA_DIR / "station_behavior_segments.csv"
    output_path = DATA_DIR / "hourly_validation_metrics.csv"

    if not master_path.exists() or not segment_path.exists():
        print("❌ Error: Required datasets missing. Run pipeline and segmentation first.")
        return

    print("Building Hourly Statistical Validation dataset...")


    df = pd.read_csv(master_path, usecols=['started_at', 'member_casual', 'start_station_name'])
    df = df[df['member_casual'] == 'casual'].copy()
    
    segments = pd.read_csv(segment_path, usecols=['start_station_name', 'final_status'])

   
    df['hour'] = pd.to_datetime(df['started_at']).dt.hour
    merged = df.merge(segments, on="start_station_name", how="inner")

   
    filtered = merged[merged["final_status"].isin(["Confirmed Behavioral Anchor", "Inconsistent / Noise"])]

   
    hourly_dist = filtered.groupby(["final_status", "hour"]).size().reset_index(name="rides")
    
    
    hourly_dist["pct_of_daily_rides"] = (
        hourly_dist.groupby("final_status")["rides"]
        .transform(lambda x: (x / x.sum()) * 100)
    )

  
    hourly_dist.to_csv(output_path, index=False)
    
    print("-" * 50)
    print(f"✅ SUCCESS: Validation metrics saved to {output_path}")
    print("This file will power your 'Behavioral Peak' Line Chart in Power BI.")

## 3. Execution

Execute the statistical validation pipeline.

In [None]:
if __name__ == "__main__":
    run_statistical_validation()

Building Hourly Statistical Validation dataset...
--------------------------------------------------
✅ SUCCESS: Validation metrics saved to ..\data\processed\hourly_validation_metrics.csv
This file will power your 'Behavioral Peak' Line Chart in Power BI.
