# NYC Rideshare Forecasting Pipeline - Part 2: Data Validation

**Author:** K Flowers  
**GitHub:** [github.com/KRFlowers](https://github.com/KRFlowers)  
**Date:** December 2025

This notebook validates the 684 million NYC For-Hire Vehicle (Uber, Lyft) trip records acquired during the download phase. Thresholds are applied to duration, distance, and fare fields. Invalid records are flagged and a clean dataset containing only valid data is produced for downstream analysis. *Note* Outlier detection is performed in the next notebook (02_exploratory_analysis.ipynb).

**Pipeline Position:** Notebook 2 of 4 — Data Validation

- 00_data_download.ipynb
- 01_data_validation.ipynb ← **this notebook**
- 02_exploratory_analysis.ipynb
- 03_demand_forecasting.ipynb

**Objective:** Validate raw trip records against thresholds for duration, distance, and fare. Create a clean dataset containing only valid data. 

**Technical Approach:**
- Use DuckDB for memory-efficient processing of 18GB dataset
- Set validation thresholds for duration, distance, and fare fields
- Flag invalid records and save to a new dataset to preserve the original full dataset

**Inputs:**
- `data/raw/combined_fhvhv_tripdata.parquet` — Combined trip data (18GB)

**Outputs:**
- `data/validated/fhvhv_all_data_flagged.parquet` — All 684M records with validation flags
- `data/validated/fhvhv_valid_data_for_eda.parquet` — Valid trips only (683M rows)
- `data/quality_reports/validation_report.csv` — Validation metrics by rule

**Runtime:** ~30 minutes (flagging ~20 min, validation counts ~10 min)

## 1. Configure Environment

### 1.1 Import Libraries

In [None]:
# Standard library
from pathlib import Path
import warnings

# Core data libraries
import pandas as pd
import duckdb
from datetime import datetime

Libraries imported successfully


### 1.2 Set Display and Plot Options

In [None]:
# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.4f}'.format)

### 1.3 Set Paths and Constants

In [None]:
# Project Constants
PROJECT_YEARS = [2022, 2023, 2024]
TLC_DATASET = 'fhvhv'

# Paths
PROJECT_ROOT = Path("..").resolve()
RAW_DIR = PROJECT_ROOT / "data" / "raw"
VALIDATED_DIR = PROJECT_ROOT / "data" / "validated"
REPORTS_DIR = PROJECT_ROOT / "data" / "quality_reports"

# Input/Output Files
INPUT_FILE = RAW_DIR / f"combined_{TLC_DATASET}_tripdata.parquet"
FLAGGED_FILE = VALIDATED_DIR / f"{TLC_DATASET}_all_data_flagged.parquet"
EDA_FILE = VALIDATED_DIR / f"{TLC_DATASET}_valid_data_for_eda.parquet"

# Create Directories
VALIDATED_DIR.mkdir(parents=True, exist_ok=True)
REPORTS_DIR.mkdir(parents=True, exist_ok=True)

print(f"Config loaded: {INPUT_FILE.name}")

✓ Config loaded: combined_fhvhv_tripdata.parquet


### 1.4 Create Database Connection
Initialize DuckDB connection and configure performance settings. Database connection is required to validate the combined trip data file.

In [None]:
# Initialize DuckDB connection with optimized settings
con = duckdb.connect()
con.execute("SET threads=4")
con.execute("SET preserve_insertion_order=false")
#con.execute("SET enable_progress_bar = true")
#con.execute("SET progress_bar_time = 2000")
print("DuckDB connection established")

✓ DuckDB connection established


## 2. Review Data
Reviews the dataset structure, date range coverage, and missing values.

### 2.1 Review Columns

In [None]:
# Display total record count
total_records = con.execute(f"SELECT COUNT(*) FROM '{INPUT_FILE}'").fetchone()[0]
print(f"Total records: {total_records:,}\n")

# Review column names and data types
con.execute(f"DESCRIBE SELECT * FROM '{INPUT_FILE}'").df()[['column_name', 'column_type']]

Total records: 684,376,551



Unnamed: 0,column_name,column_type
0,hvfhs_license_num,VARCHAR
1,dispatching_base_num,VARCHAR
2,originating_base_num,VARCHAR
3,request_datetime,TIMESTAMP
4,on_scene_datetime,TIMESTAMP
5,pickup_datetime,TIMESTAMP
6,dropoff_datetime,TIMESTAMP
7,PULocationID,BIGINT
8,DOLocationID,BIGINT
9,trip_miles,DOUBLE


### 2.2 Check Date Range

In [5]:
# Check date range of pickup_datetime to verify coverage period
date_range = con.execute(f"""
    SELECT 
        MIN(pickup_datetime) as earliest,
        MAX(pickup_datetime) as latest
    FROM '{INPUT_FILE}'
""").df()

print(f"Date range: {date_range['earliest'].iloc[0]} to {date_range['latest'].iloc[0]}")

Date range: 2022-01-01 00:00:00 to 2024-12-31 23:59:59


### 2.3 Check Missing Values
Identify null values by column. High-null columns will be excluded during aggregation in EDA, not removed here.

In [6]:
# Get list of column names
columns = con.execute(f"DESCRIBE SELECT * FROM '{INPUT_FILE}'").df()['column_name'].tolist()

# Build SQL to count NULLs per column using CASE WHEN - single scan counts nulls for all columns once
null_count_sql = f"""
    SELECT 
        {', '.join([f"SUM(CASE WHEN {col} IS NULL THEN 1 ELSE 0 END) AS {col}_null_count" for col in columns])}
    FROM '{INPUT_FILE}'
"""

# Execute query and get results as tuple
null_counts = con.execute(null_count_sql).fetchone()

# Calculate null percentages
null_pct = [(count / total_records) * 100 for count in null_counts]

# Display results 
print(f"{'Column':<25} {'Null Count':<12} {'Null Percentage'}")
print("-" * 50)
for col, count, pct in zip(columns, null_counts, null_pct):
    print(f"{col:<25} {count:<12} {pct:>15.2f}%")

Column                    Null Count   Null Percentage
--------------------------------------------------
hvfhs_license_num         0                       0.00%
dispatching_base_num      0                       0.00%
originating_base_num      183954837              26.88%
request_datetime          0                       0.00%
on_scene_datetime         183891654              26.87%
pickup_datetime           0                       0.00%
dropoff_datetime          0                       0.00%
PULocationID              0                       0.00%
DOLocationID              0                       0.00%
trip_miles                0                       0.00%
trip_time                 0                       0.00%
base_passenger_fare       0                       0.00%
tolls                     0                       0.00%
bcf                       0                       0.00%
sales_tax                 0                       0.00%
congestion_surcharge      0                       0.00

**Note:** Core fields that will be used for analysis have zero nulls:
- `pickup_datetime`
- `PULocationID`
- `trip_time`
- `trip_miles`
- `base_passenger_fare`

The two high-null columns (`originating_base_num`, `on_scene_datetime`) are not used in this analysis.

## 3. Validate Data
Flags all records based on duration, distance, and fare thresholds. Creates two output datasets: one with all 684M records and quality flags, and one with only valid records for analysis.

### 3.1 Set Validation Thresholds

In [9]:
# Set validation thresholds for key fields that impact analysis

DURATION_MIN = 60          # 1 min - filters GPS errors, keeps short trips
DURATION_MAX = 43200       # 12 hrs - covers NYC to Philadelphia
DURATION_EXTREME = 604800  # 7 days - obvious corruption
DISTANCE_MIN = 0.1         # Filters GPS noise
DISTANCE_MAX = 200         # NYC-Philadelphia service area
FARE_MIN = 0               # No negative fares ($0 allowed for promos)
FARE_MAX = 500             # 99.9th percentile ~$150, allows surge

### 3.2 Flag Records Against Thresholds 

In [None]:
print("Creating flagged dataset...")
print("This will take approximately 10 minutes for 684M records\n")

# Validate key fields, create flags and save dataset with flags
con.execute(f"""
    COPY (
        SELECT 
            *,
            
            -- DURATION FLAGS - Checking null, zero/negative, min, max, and extreme
            (trip_time IS NULL) AS flag_duration_null,
            (trip_time <= 0) AS flag_duration_zero_negative,
            (trip_time < {DURATION_MIN}) AS flag_duration_too_short,
            (trip_time > {DURATION_MAX}) AS flag_duration_exceeds_max,
            (trip_time > {DURATION_EXTREME}) AS flag_duration_extreme,
            
            -- DISTANCE FLAGS - Checking null, negative, min, and max 
            (trip_miles IS NULL) AS flag_distance_null,
            (trip_miles < 0) AS flag_distance_negative,
            (trip_miles < {DISTANCE_MIN}) AS flag_distance_too_short,
            (trip_miles > {DISTANCE_MAX}) AS flag_distance_exceeds_max,
            
            -- FARE FLAGS - Checking null, negative, zero, and extreme high 
            (base_passenger_fare IS NULL) AS flag_fare_null,
            (base_passenger_fare < {FARE_MIN}) AS flag_fare_negative,
            (base_passenger_fare = 0) AS flag_fare_zero,
            (base_passenger_fare > {FARE_MAX}) AS flag_fare_extreme_high,
            
            -- APPLYY MASTER VALIDITY FLAG - Record is valid if all checks pass (zero fare is allowed)
            (
                trip_time IS NOT NULL AND
                trip_time >= {DURATION_MIN} AND 
                trip_time <= {DURATION_MAX} AND
                trip_miles IS NOT NULL AND
                trip_miles >= {DISTANCE_MIN} AND
                trip_miles <= {DISTANCE_MAX} AND
                base_passenger_fare IS NOT NULL AND
                base_passenger_fare >= {FARE_MIN} AND
                base_passenger_fare <= {FARE_MAX}
            ) AS is_valid
            
        FROM '{INPUT_FILE}'
    ) TO '{FLAGGED_FILE}' (FORMAT PARQUET)
""")

# Verify the flagged dataset was created successfully
flagged_count = con.execute(f"SELECT COUNT(*) FROM '{FLAGGED_FILE}'").fetchone()[0]
print(f"Flagged dataset created: {flagged_count:,} records")

In [None]:
# Review a sample of flagged records to verify validation logic
con.execute(f"""
    SELECT 
        trip_time,
        trip_miles,
        base_passenger_fare,
        flag_duration_too_short,
        flag_distance_too_short,
        flag_fare_negative,
        is_valid
    FROM '{FLAGGED_FILE}'
    WHERE is_valid = FALSE
    LIMIT 20
""").df()

Unnamed: 0,trip_time,trip_miles,base_passenger_fare,flag_duration_too_short,flag_distance_too_short,flag_fare_negative,is_valid
0,743,4.22,-2.07,False,False,True,False
1,514,1.22,-0.94,False,False,True,False
2,424,1.67,-0.74,False,False,True,False
3,2500,27.64,-29.72,False,False,True,False
4,2341,13.82,-5.58,False,False,True,False
5,1590,7.34,-6.15,False,False,True,False
6,3950,21.5,-29.77,False,False,True,False
7,977,4.98,-2.75,False,False,True,False
8,2705,20.29,-23.72,False,False,True,False
9,1524,13.09,-11.35,False,False,True,False


### 3.3 Count Flagged Records
Counts invalid records overall and then by zone.  Verify no zone exceeds the 1% exclusion threshold.

In [None]:
# Count all validation flags in one query
validation_stats = con.execute(f"""
    SELECT 
        COUNT(*) as total,
        
        -- Duration issues
        SUM(CAST(flag_duration_null AS INTEGER)) as dur_null,
        SUM(CAST(flag_duration_zero_negative AS INTEGER)) as dur_zero_neg,
        SUM(CAST(flag_duration_too_short AS INTEGER)) as dur_too_short,
        SUM(CAST(flag_duration_exceeds_max AS INTEGER)) as dur_exceeds_max,
        SUM(CAST(flag_duration_extreme AS INTEGER)) as dur_extreme,
        
        -- Distance issues
        SUM(CAST(flag_distance_null AS INTEGER)) as dist_null,
        SUM(CAST(flag_distance_negative AS INTEGER)) as dist_negative,
        SUM(CAST(flag_distance_too_short AS INTEGER)) as dist_too_short,
        SUM(CAST(flag_distance_exceeds_max AS INTEGER)) as dist_exceeds_max,
        
        -- Fare issues
        SUM(CAST(flag_fare_null AS INTEGER)) as fare_null,
        SUM(CAST(flag_fare_negative AS INTEGER)) as fare_negative,
        SUM(CAST(flag_fare_zero AS INTEGER)) as fare_zero,
        SUM(CAST(flag_fare_extreme_high AS INTEGER)) as fare_extreme_high,
        
        -- Overall validity
        SUM(CAST(is_valid AS INTEGER)) as valid,
        SUM(CAST(NOT is_valid AS INTEGER)) as invalid
        
    FROM '{FLAGGED_FILE}'
""").fetchone()

# Unpack results
(total, 
 dur_null, dur_zero_neg, dur_too_short, dur_exceeds_max, dur_extreme,
 dist_null, dist_negative, dist_too_short, dist_exceeds_max,
 fare_null, fare_negative, fare_zero, fare_extreme_high,
 valid, invalid) = validation_stats

print(f"Validation counts complete: {total:,} records")

✓ Validation counts complete: 684,376,551 records


In [17]:
# create a dictionary of validation results
validation_results = [
    {'Field': 'Duration', 'Rule': 'Null', 'Count': dur_null, 'Pct': dur_null/total*100},
    {'Field': 'Duration', 'Rule': 'Zero/Negative', 'Count': dur_zero_neg, 'Pct': dur_zero_neg/total*100},
    {'Field': 'Duration', 'Rule': f'Too Short (<{DURATION_MIN}s)', 'Count': dur_too_short, 'Pct': dur_too_short/total*100},
    {'Field': 'Duration', 'Rule': f'Exceeds Max (>{DURATION_MAX/3600:.0f}hr)', 'Count': dur_exceeds_max, 'Pct': dur_exceeds_max/total*100},
    {'Field': 'Duration', 'Rule': f'Extreme (>{DURATION_EXTREME/86400:.0f}d)', 'Count': dur_extreme, 'Pct': dur_extreme/total*100},
    {'Field': 'Distance', 'Rule': 'Null', 'Count': dist_null, 'Pct': dist_null/total*100},
    {'Field': 'Distance', 'Rule': 'Negative', 'Count': dist_negative, 'Pct': dist_negative/total*100},
    {'Field': 'Distance', 'Rule': f'Too Short (<{DISTANCE_MIN}mi)', 'Count': dist_too_short, 'Pct': dist_too_short/total*100},
    {'Field': 'Distance', 'Rule': f'Exceeds Max (>{DISTANCE_MAX}mi)', 'Count': dist_exceeds_max, 'Pct': dist_exceeds_max/total*100},
    {'Field': 'Fare', 'Rule': 'Null', 'Count': fare_null, 'Pct': fare_null/total*100},
    {'Field': 'Fare', 'Rule': 'Negative', 'Count': fare_negative, 'Pct': fare_negative/total*100},
    {'Field': 'Fare', 'Rule': 'Zero', 'Count': fare_zero, 'Pct': fare_zero/total*100},
    {'Field': 'Fare', 'Rule': f'Extreme High (>${FARE_MAX})', 'Count': fare_extreme_high, 'Pct': fare_extreme_high/total*100},
]

# Convert dictionary to dataframe
validation_summary = pd.DataFrame(validation_results)

# Format columns for display
validation_summary['Count'] = validation_summary['Count'].apply(lambda x: f"{x:,}")
validation_summary['Pct'] = validation_summary['Pct'].apply(lambda x: f"{x:.3f}%")

# Display results
print(f"Total records: {total:,}")
print(f"Valid: {valid:,} ({valid/total*100:.2f}%) | Invalid: {invalid:,} ({invalid/total*100:.2f}%)\n")
validation_summary

Total records: 684,376,551
Valid: 683,780,462 (99.91%) | Invalid: 596,089 (0.09%)



Unnamed: 0,Field,Rule,Count,Pct
0,Duration,Null,0,0.000%
1,Duration,Zero/Negative,65,0.000%
2,Duration,Too Short (<60s),60163,0.009%
3,Duration,Exceeds Max (>12hr),57,0.000%
4,Duration,Extreme (>7d),0,0.000%
5,Distance,Null,0,0.000%
6,Distance,Negative,0,0.000%
7,Distance,Too Short (<0.1mi),251660,0.037%
8,Distance,Exceeds Max (>200mi),4411,0.001%
9,Fare,Null,0,0.000%


**Note:** 99.91% of records passed validation (683.8M of 684.4M). Only 596K excluded—thresholds were strict enough to catch errors without removing legitimate edge cases.

In [19]:
# Check invalid rate by zone to verify no bias by location
zone_validity = con.execute(f"""
    SELECT 
        PULocationID as zone_id,
        COUNT(*) as total,
        SUM(CASE WHEN is_valid = FALSE THEN 1 ELSE 0 END) as invalid,
        ROUND(SUM(CASE WHEN is_valid = FALSE THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) as invalid_pct
    FROM '{FLAGGED_FILE}'
    GROUP BY PULocationID
    HAVING invalid_pct > 1.0
    ORDER BY invalid_pct DESC
""").df()

if len(zone_validity) == 0:
    print("✓ No zones exceed 1% invalid rate")
else:
    print(f"⚠ {len(zone_validity)} zones exceed 1% invalid rate:")
    print(zone_validity)

⚠ 1 zones exceed 1% invalid rate:
   zone_id  total  invalid  invalid_pct
0        1     67     15.0        22.39


**Note:** Zone 1 exceeds 1% invalid rate (22%) but has only 67 total trips and is not in the top 100 zones used for analysis. All high-volume zones are within threshold.

### 3.4 Save Valid Records Dataset
(~10 min)

In [18]:
# Create EDA dataset with only valid records, excluding flag columns
con.execute(f"""
    COPY (
        SELECT * EXCLUDE (
            flag_duration_null, 
            flag_duration_zero_negative, 
            flag_duration_too_short, 
            flag_duration_exceeds_max, 
            flag_duration_extreme,
            flag_distance_null, 
            flag_distance_negative, 
            flag_distance_too_short, 
            flag_distance_exceeds_max,
            flag_fare_null, 
            flag_fare_negative, 
            flag_fare_zero, 
            flag_fare_extreme_high,
            is_valid
        )
        FROM '{FLAGGED_FILE}'
        WHERE is_valid = true
    ) TO '{EDA_FILE}' (FORMAT PARQUET)
""")
print(f"EDA dataset: {EDA_FILE.name} ({valid:,} records)")

EDA dataset: fhvhv_valid_data_for_eda.parquet (683,780,462 records)


### 3.5 Save Validation Report
Saves validation summary dataframe to CSV for documentation.

In [None]:
# Save validation report
validation_summary.to_csv(REPORTS_DIR / "validation_report.csv", index=False)
print(f"Saved: validation_report.csv")

✓ Saved: validation_report.csv


## 4. Close Connection
Close database connections and release resources to complete the validation pipeline.

In [21]:
# Close DuckDB connection
con.close()
print("Pipeline complete")

Pipeline complete


## Conclusion

This notebook validated 684 million trip records against thresholds for duration, distance, and fare fields. 99.91% of records passed data quality tests (683.8M valid records).

**Key Findings:**
- High data quality: Only 0.09% of records failed validation (596K invalid out of 684M total)
- Duration issues were low: 60K trips under 60 seconds (0.009%)
- Distance issues were low: 252K trips under 0.1 miles (0.037%), 4K over 200 miles (0.001%)
- Fare anomalies were low: 320K negative fares (0.047%), possibly refunds or cancellations
- Zero-fare trips were retained (may be valid for forecasting)

**Technical Decisions**
- Flagged invalid records rather than delete to preserve full dataset
- Allowed zero-fare trips (possible promos or incentives but may still represent demand)
- Excluded negative fares (not likely to represent valid trips)
- DuckDB settings (`threads=4`, `preserve_insertion_order=false`) reduced flagging time to ~10 minutes

**Outputs:**
- `data/validated/fhvhv_all_data_flagged.parquet` — All 684M records with 13 validation flags
- `data/validated/fhvhv_valid_data_for_eda.parquet` — 683.8M clean records (99.91% of total)
- `data/quality_reports/validation_report.csv` — Detailed validation metrics by rule

**Next Steps:**
Proceed to **02_exploratory_analysis.ipynb** to aggregate valid records to zone-daily level and analyze demand patterns.

## Conclusion

This notebook validated 684 million trip records against thresholds set for duration, distance, and fare. 99.91% passed all checks (683.8M valid records).

**Key Findings**
- Duration: 99.99% valid — only 60K trips under 60 seconds
- Distance: 99.96% valid — only 4K trips over 200 miles  
- Fare: 99.95% valid — 320K negative fares excluded (possibly refunds)
- Zone check: No high-volume zone exceeds 1% invalid rate (Zone 1 at 22% excluded—only 67 trips)

**Technical Decisions**
- Flagged invalid records rather than delete to preserve full dataset
- Retained zero-fare trips (possible promos or incentives but may still represent demand)
- Excluded negative fares (not likely to represent completed trips)
- DuckDB settings (`threads=4`, `preserve_insertion_order=false`) reduced flagging time to ~10 minutes

**Output Files**
- `data/validated/fhvhv_all_data_flagged.parquet` — All 684M records with validation flags
- `data/validated/fhvhv_valid_data_for_eda.parquet` — 683.8M valid records for analysis
- `data/quality_reports/validation_report.csv` — Detailed validation metrics

**Next Steps**

Proceed to **02_exploratory_analysis.ipynb** to analyze demand patterns and engineer features for forecasting.

## Conclusion

This notebook validated [X] million trip records, flagging [Y] quality issues and creating a clean dataset for analysis.

**Key Findings:**
- [Validation results summary]

**Technical Decisions:**
- [Validation thresholds and rationale]
- [Flag-based approach explanation]

**Outputs:**
- `path/to/valid_data` — Clean dataset
- `path/to/validation_report` — Quality report

**Next Steps:**
Proceed to **02_exploratory_analysis.ipynb** to analyze demand patterns.