### FHVHV Data Pipeline - Stage 1: Validation

**Pipeline Position:** Stage 1 of 4
- Stage 0: Data Download 
- Stage 1: Data Validation ← THIS SCRIPT
- Stage 2: Exploratory Analysis
- Stage 3: Modeling

**Overview**

This notebook validates NYC TLC FHVHV trip data (Uber/Lyft) for 2022-2024, implementing a comprehensive validation framework with granular quality checks across trip duration, distance, and fare fields. The pipeline uses DuckDB for memory-efficient processing of 684M records and creates detailed quality flags for monitoring and debugging.

**Fields Validated**

This validation focuses on three quantitative outcome fields critical for demand forecasting: trip duration (trip_time), trip distance (trip_miles), and base fare (base_passenger_fare). These fields receive comprehensive checks including null detection, bounds validation (60s-12hr for duration, 0.1-200mi for distance, $0-$500 for fare), and extreme value flagging.

**Validation Strategy**

The framework implements 13 granular flags to identify specific quality issues (null values, negative numbers, out-of-bounds data, extreme outliers) along with a master validity indicator for quick filtering. This flag-based approach preserves all records for audit trails and debugging rather than deleting problematic data. Configurable thresholds at the top of the notebook make it easy to adapt the framework for future projects, like the planned water quality forecasting analysis.

**Code Reusability**

This script is designed for reuse across different projects:
- Update 3 threshold variables for new datasets
- Generic validation pattern (null, negative, bounds, extremes)
- Modular architecture easily adapted to different datasets

**Inputs:**
- `data/raw/fhvhv_combined.parquet`

**Outputs:**
- `data/validated/fhvhv_all_data_flagged.parquet` - All records with 13 validation flags
- `data/validated/fhvhv_valid_data_for_eda.parquet` - Valid records only (~99.9%)
- `data/quality_reports/validation_report.csv` - Detailed validation metrics

**Runtime Note**

Processing 684M records takes approximately 70 minutes total: ~20 minutes for flagging (Section 3.2) and ~25 minutes for creating the clean EDA dataset (Section 3.5). DuckDB's streaming approach keeps memory usage manageable throughout.

**Next Step:** Run `02_exploratory_analysis.ipynb`

#### 1. Setup

##### 1.1 Import Libraries

In [1]:
import duckdb
import pandas as pd
from pathlib import Path
from datetime import datetime

##### 1.2 Configuration

In [None]:
# Define file paths using relative references so the notebook works from any machine with the same folder structure
INPUT_FILE = Path("../data/final/combined_fhvhv_tripdata.parquet")
OUTPUT_DIR = Path("../data/validated")
REPORTS_DIR = Path("../data/quality_reports")

# Create directories if they don't already exist
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
REPORTS_DIR.mkdir(parents=True, exist_ok=True)

# Define output filenames for flagged and clean datasets
FLAGGED_FILE = OUTPUT_DIR / "fhvhv_all_data_flagged.parquet"
EDA_FILE = OUTPUT_DIR / "fhvhv_valid_data_for_eda.parquet"

# Initialize DuckDB connection and set progress bar options
con = duckdb.connect()
con.execute("SET enable_progress_bar = true")
con.execute("SET progress_bar_time = 2000")

print(f"Input: {INPUT_FILE}")
print(f"File exists: {INPUT_FILE.exists()}")

Input: ..\data\final\combined_fhvhv_tripdata.parquet
File exists: True


#### 2. Data Exploration
This section reviews schema overview, check date range, and identify missing values.

##### 2.1 Column Overview

In [3]:
# Get total record count
total_records = con.execute(f"SELECT COUNT(*) FROM '{INPUT_FILE}'").fetchone()[0]
print(f"Total records: {total_records:,}\n")

# Review column names and data types using DESCRIBE
schema = con.execute(f"""
    DESCRIBE SELECT * FROM '{INPUT_FILE}'
""").df()

print(f"Dataset has {len(schema)} columns:\n")
print(schema.to_string(index=False))

Total records: 684,376,551

Dataset has 24 columns:

         column_name column_type null  key default extra
   hvfhs_license_num     VARCHAR  YES None    None  None
dispatching_base_num     VARCHAR  YES None    None  None
originating_base_num     VARCHAR  YES None    None  None
    request_datetime   TIMESTAMP  YES None    None  None
   on_scene_datetime   TIMESTAMP  YES None    None  None
     pickup_datetime   TIMESTAMP  YES None    None  None
    dropoff_datetime   TIMESTAMP  YES None    None  None
        PULocationID      BIGINT  YES None    None  None
        DOLocationID      BIGINT  YES None    None  None
          trip_miles      DOUBLE  YES None    None  None
           trip_time      BIGINT  YES None    None  None
 base_passenger_fare      DOUBLE  YES None    None  None
               tolls      DOUBLE  YES None    None  None
                 bcf      DOUBLE  YES None    None  None
           sales_tax      DOUBLE  YES None    None  None
congestion_surcharge      DOUBLE  Y

##### 2.2 Date Range Check

In [4]:
# Check date range of pickup_datetime to verify coverage period
date_range = con.execute(f"""
    SELECT 
        MIN(pickup_datetime) as earliest,
        MAX(pickup_datetime) as latest
    FROM '{INPUT_FILE}'
""").df()

print(f"Date range: {date_range['earliest'].iloc[0]} to {date_range['latest'].iloc[0]}")

Date range: 2022-01-01 00:00:00 to 2024-12-31 23:59:59


##### 2.3 Missing Values Check
Identify null values by column. High-null columns will be excluded during aggregation in EDA, not removed here.

In [5]:
# Get list of column names from schema
columns = schema['column_name'].tolist()

# Build SQL to count NULLs per column using CASE WHEN use single scan counts nulls for all columns once
null_count_sql = f"""
    SELECT 
        {', '.join([f"SUM(CASE WHEN {col} IS NULL THEN 1 ELSE 0 END) AS {col}_null_count" for col in columns])}
    FROM '{INPUT_FILE}'
"""

# Execute query and get results as tuple
null_counts = con.execute(null_count_sql).fetchone()

# Calculate null percentages f
null_pct = [(count / total_records) * 100 for count in null_counts]

#Display results 
print(f"{'Column':<25} {'Null Count':<12} {'Null Percentage'}")
print("-" * 50)

for col, count, pct in zip(columns, null_counts, null_pct):
    print(f"{col:<25} {count:<12} {pct:>15.2f}%")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Column                    Null Count   Null Percentage
--------------------------------------------------
hvfhs_license_num         0                       0.00%
dispatching_base_num      0                       0.00%
originating_base_num      183954837              26.88%
request_datetime          0                       0.00%
on_scene_datetime         183891654              26.87%
pickup_datetime           0                       0.00%
dropoff_datetime          0                       0.00%
PULocationID              0                       0.00%
DOLocationID              0                       0.00%
trip_miles                0                       0.00%
trip_time                 0                       0.00%
base_passenger_fare       0                       0.00%
tolls                     0                       0.00%
bcf                       0                       0.00%
sales_tax                 0                       0.00%
congestion_surcharge      0                       0.00


**High-null columns (>10%):**
- `originating_base_num` (27%) - Lyft doesn't report this field
- `on_scene_datetime` (27%) - Optional tracking field
- `airport_fee` (18%) - Only applies to airport trips

**Impact on forecasting:**
These columns are most likely not needed for trip count forecasting and will be excluded during aggregation in EDA.

#### 3. Data Validation & Dataset Creation
This section defines validation thresholds for duration, distance, and fare, then checks and flags all records for valid/invalid status and creates two output datasets. A flagged dataset contains quality indicators for all 684M records, and a clean dataset contains only valid records for downstream analysis.

In [None]:
# Duration thresholds (seconds)
DURATION_MIN = 60          # 1 minute - shorter likely GPS/timing errors
DURATION_MAX = 43200       # 12 hours - reasonable max for rideshare
DURATION_EXTREME = 604800  # 7 days - clear data corruption

# Distance thresholds (miles)
DISTANCE_MIN = 0.1         # 0.1 miles - filters GPS noise
DISTANCE_MAX = 200         # 200 miles - covers NYC to Philadelphia

# Fare thresholds (dollars)
FARE_MIN = 0               # Negative fares are errors (but $0 might be ok)
FARE_MAX = 500             # $500 extreme outliers

print("Validation Thresholds")
print("=" * 50)
print(f"Duration: {DURATION_MIN}s - {DURATION_MAX}s (max: {DURATION_MAX/3600:.1f} hrs)")
print(f"Distance: {DISTANCE_MIN} - {DISTANCE_MAX} miles")
print(f"Fare: ${FARE_MIN} - ${FARE_MAX}")

Validation Thresholds
Duration: 60s - 43200s (max: 12.0 hrs)
Distance: 0.1 - 200 miles
Fare: $0 - $500


**Threshold Selection Notes**

The thresholds were selected in order to balance catching data errors without being overly aggressive:

- **Duration (60s - 12hr):** 60s minimum filters GPS timing errors while keeping legitimate short trips. 12-hour maximum accommodates long-distance rides (NYC to Philadelphia) based on sampling trips in the 8-12 hour range.

- **Distance (0.1 - 200mi):** 0.1 mile minimum filters GPS noise while preserving short local trips. 200-mile maximum covers NYC to Philadelphia service area.

##### 3.2 Flag Dataset (~20 min)
Add validation flags for all quality rules. Each field gets granular flags 
for specific issues (null, negative, out-of-range, etc.) plus a master 
validity flag for quick filtering.

In [7]:
print("Creating flagged dataset...")
print("This will take approximately 20 minutes for 684M records\n")

# Create flagged dataset with comprehensive validation
con.execute(f"""
    COPY (
        SELECT 
            *,
            
            -- ============================================
            -- DURATION FLAGS (using trip_time in seconds)
            -- ============================================
            (trip_time IS NULL) AS flag_duration_null,
            (trip_time <= 0) AS flag_duration_zero_negative,
            (trip_time < {DURATION_MIN}) AS flag_duration_too_short,
            (trip_time > {DURATION_MAX}) AS flag_duration_exceeds_max,
            (trip_time > {DURATION_EXTREME}) AS flag_duration_extreme,
            
            -- ============================================
            -- DISTANCE FLAGS (using trip_miles)
            -- ============================================
            (trip_miles IS NULL) AS flag_distance_null,
            (trip_miles < 0) AS flag_distance_negative,
            (trip_miles < {DISTANCE_MIN}) AS flag_distance_too_short,
            (trip_miles > {DISTANCE_MAX}) AS flag_distance_exceeds_max,
            
            -- ============================================
            -- FARE FLAGS (using base_passenger_fare)
            -- ============================================
            (base_passenger_fare IS NULL) AS flag_fare_null,
            (base_passenger_fare < {FARE_MIN}) AS flag_fare_negative,
            (base_passenger_fare = 0) AS flag_fare_zero,
            (base_passenger_fare > {FARE_MAX}) AS flag_fare_extreme_high,
            
            -- ============================================
            -- MASTER VALIDITY FLAG
            -- Record is valid if ALL critical checks pass
            -- ============================================
            (
                trip_time IS NOT NULL AND
                trip_time >= {DURATION_MIN} AND 
                trip_time <= {DURATION_MAX} AND
                trip_miles IS NOT NULL AND
                trip_miles >= {DISTANCE_MIN} AND
                trip_miles <= {DISTANCE_MAX} AND
                base_passenger_fare IS NOT NULL AND
                base_passenger_fare >= {FARE_MIN} AND
                base_passenger_fare <= {FARE_MAX}
            ) AS is_valid
            
        FROM '{INPUT_FILE}'
    ) TO '{FLAGGED_FILE}' (FORMAT PARQUET)
""")

# Get record count to confirm success
flagged_count = con.execute(f"SELECT COUNT(*) FROM '{FLAGGED_FILE}'").fetchone()[0]
print(f"Flagged dataset created: {flagged_count:,} records")

Creating flagged dataset...
This will take approximately 20 minutes for 684M records



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Flagged dataset created: 684,376,551 records


##### 3.3 Spot Check Flagged Records
Visual verification of flag calculations. Review sample records to confirm flags are applied correctly.

##### 3.4 Validation Counts 
Count invalid records from flagged dataset. Each field has its own subsection matching Section 3 structure.

In [8]:
# Count all validation flags in one query
validation_stats = con.execute(f"""
    SELECT 
        COUNT(*) as total,
        
        -- Duration issues
        SUM(CAST(flag_duration_null AS INTEGER)) as dur_null,
        SUM(CAST(flag_duration_zero_negative AS INTEGER)) as dur_zero_neg,
        SUM(CAST(flag_duration_too_short AS INTEGER)) as dur_too_short,
        SUM(CAST(flag_duration_exceeds_max AS INTEGER)) as dur_exceeds_max,
        SUM(CAST(flag_duration_extreme AS INTEGER)) as dur_extreme,
        
        -- Distance issues
        SUM(CAST(flag_distance_null AS INTEGER)) as dist_null,
        SUM(CAST(flag_distance_negative AS INTEGER)) as dist_negative,
        SUM(CAST(flag_distance_too_short AS INTEGER)) as dist_too_short,
        SUM(CAST(flag_distance_exceeds_max AS INTEGER)) as dist_exceeds_max,
        
        -- Fare issues
        SUM(CAST(flag_fare_null AS INTEGER)) as fare_null,
        SUM(CAST(flag_fare_negative AS INTEGER)) as fare_negative,
        SUM(CAST(flag_fare_zero AS INTEGER)) as fare_zero,
        SUM(CAST(flag_fare_extreme_high AS INTEGER)) as fare_extreme_high,
        
        -- Overall validity
        SUM(CAST(is_valid AS INTEGER)) as valid,
        SUM(CAST(NOT is_valid AS INTEGER)) as invalid
        
    FROM '{FLAGGED_FILE}'
""").fetchone()

# Unpack results
(total, 
 dur_null, dur_zero_neg, dur_too_short, dur_exceeds_max, dur_extreme,
 dist_null, dist_negative, dist_too_short, dist_exceeds_max,
 fare_null, fare_negative, fare_zero, fare_extreme_high,
 valid, invalid) = validation_stats

# Display comprehensive validation report
print("Field Validation Summary")
print("=" * 80)
print()

# Duration section
print("DURATION (trip_time field)")
print(f"  Total records:           {total:>15,}")
print(f"  Valid records:           {total - (dur_null + dur_zero_neg + dur_too_short + dur_exceeds_max):>15,}")
print("  " + "─" * 76)
print(f"  Null values:             {dur_null:>15,} ({dur_null/total*100:>6.3f}%)")
print(f"  Zero/negative:           {dur_zero_neg:>15,} ({dur_zero_neg/total*100:>6.3f}%)")
print(f"  Too short (<{DURATION_MIN}s):        {dur_too_short:>15,} ({dur_too_short/total*100:>6.3f}%)")
print(f"  Exceeds {DURATION_MAX/3600:.0f}hr:            {dur_exceeds_max:>15,} ({dur_exceeds_max/total*100:>6.3f}%)")
print(f"  Extreme (>{DURATION_EXTREME/86400:.0f}d):            {dur_extreme:>15,} ({dur_extreme/total*100:>6.3f}%)")
print()

# Distance section
print("DISTANCE (trip_miles field)")
print(f"  Total records:           {total:>15,}")
print(f"  Valid records:           {total - (dist_null + dist_negative + dist_too_short + dist_exceeds_max):>15,}")
print("  " + "─" * 76)
print(f"  Null values:             {dist_null:>15,} ({dist_null/total*100:>6.3f}%)")
print(f"  Negative:                {dist_negative:>15,} ({dist_negative/total*100:>6.3f}%)")
print(f"  Too short (<{DISTANCE_MIN}mi):      {dist_too_short:>15,} ({dist_too_short/total*100:>6.3f}%)")
print(f"  Exceeds {DISTANCE_MAX}mi:          {dist_exceeds_max:>15,} ({dist_exceeds_max/total*100:>6.3f}%)")
print()

# Fare section
print("FARE (base_passenger_fare field)")
print(f"  Total records:           {total:>15,}")
print(f"  Valid records:           {total - (fare_null + fare_negative + fare_extreme_high):>15,}")
print("  " + "─" * 76)
print(f"  Null values:             {fare_null:>15,} ({fare_null/total*100:>6.3f}%)")
print(f"  Negative:                {fare_negative:>15,} ({fare_negative/total*100:>6.3f}%)")
print(f"  Zero fare:               {fare_zero:>15,} ({fare_zero/total*100:>6.3f}%) [Note: might be legit]")
print(f"  Extreme high (>${FARE_MAX}):     {fare_extreme_high:>15,} ({fare_extreme_high/total*100:>6.3f}%)")
print()

# Overall section
print("OVERALL VALIDITY")
print(f"  Valid (all checks):      {valid:>15,} ({valid/total*100:>6.2f}%)")
print(f"  Invalid (any check):     {invalid:>15,} ({invalid/total*100:>6.2f}%)")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Field Validation Summary

DURATION (trip_time field)
  Total records:               684,376,551
  Valid records:               684,316,266
  ────────────────────────────────────────────────────────────────────────────
  Null values:                           0 ( 0.000%)
  Zero/negative:                        65 ( 0.000%)
  Too short (<60s):                 60,163 ( 0.009%)
  Exceeds 12hr:                         57 ( 0.000%)
  Extreme (>7d):                          0 ( 0.000%)

DISTANCE (trip_miles field)
  Total records:               684,376,551
  Valid records:               684,120,480
  ────────────────────────────────────────────────────────────────────────────
  Null values:                           0 ( 0.000%)
  Negative:                              0 ( 0.000%)
  Too short (<0.1mi):              251,660 ( 0.037%)
  Exceeds 200mi:                    4,411 ( 0.001%)

FARE (base_passenger_fare field)
  Total records:               684,376,551
  Valid records:               684

##### 3.5 Create EDA Dataset (~25 min)
Save data to a parquet file with only valid records, removing flag columns. This creates a clean dataset for exploratory analysis without validation overhead.  

In [9]:
# Create EDA dataset with only valid records, excluding flag columns
con.execute(f"""
    COPY (
        SELECT * EXCLUDE (
            flag_duration_null, 
            flag_duration_zero_negative, 
            flag_duration_too_short, 
            flag_duration_exceeds_max, 
            flag_duration_extreme,
            flag_distance_null, 
            flag_distance_negative, 
            flag_distance_too_short, 
            flag_distance_exceeds_max,
            flag_fare_null, 
            flag_fare_negative, 
            flag_fare_zero, 
            flag_fare_extreme_high,
            is_valid
        )
        FROM '{FLAGGED_FILE}'
        WHERE is_valid = true
    ) TO '{EDA_FILE}' (FORMAT PARQUET)
""")
print(f"EDA dataset: {EDA_FILE.name} ({valid:,} records)")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

EDA dataset: fhvhv_valid_data_for_eda.parquet (683,780,462 records)


##### 3.6 Generate Validation Report
Create detailed CSV report of all validation checks for documentation.

In [None]:
# Build comprehensive report DataFrame
report_data = []

# Duration rows
report_data.append({
    'field': 'duration',
    'rule': 'null',
    'invalid_count': dur_null,
    'invalid_pct': dur_null/total*100,
    'threshold': 'IS NULL'
})
report_data.append({
    'field': 'duration',
    'rule': 'zero_negative',
    'invalid_count': dur_zero_neg,
    'invalid_pct': dur_zero_neg/total*100,
    'threshold': '<= 0'
})
report_data.append({
    'field': 'duration',
    'rule': 'too_short',
    'invalid_count': dur_too_short,
    'invalid_pct': dur_too_short/total*100,
    'threshold': f'< {DURATION_MIN}s'
})
report_data.append({
    'field': 'duration',
    'rule': 'exceeds_max',
    'invalid_count': dur_exceeds_max,
    'invalid_pct': dur_exceeds_max/total*100,
    'threshold': f'> {DURATION_MAX}s'
})
report_data.append({
    'field': 'duration',
    'rule': 'extreme',
    'invalid_count': dur_extreme,
    'invalid_pct': dur_extreme/total*100,
    'threshold': f'> {DURATION_EXTREME}s'
})

# Distance rows
report_data.append({
    'field': 'distance',
    'rule': 'null',
    'invalid_count': dist_null,
    'invalid_pct': dist_null/total*100,
    'threshold': 'IS NULL'
})
report_data.append({
    'field': 'distance',
    'rule': 'negative',
    'invalid_count': dist_negative,
    'invalid_pct': dist_negative/total*100,
    'threshold': '< 0'
})
report_data.append({
    'field': 'distance',
    'rule': 'too_short',
    'invalid_count': dist_too_short,
    'invalid_pct': dist_too_short/total*100,
    'threshold': f'< {DISTANCE_MIN}mi'
})
report_data.append({
    'field': 'distance',
    'rule': 'exceeds_max',
    'invalid_count': dist_exceeds_max,
    'invalid_pct': dist_exceeds_max/total*100,
    'threshold': f'> {DISTANCE_MAX}mi'
})

# Fare rows
report_data.append({
    'field': 'fare',
    'rule': 'null',
    'invalid_count': fare_null,
    'invalid_pct': fare_null/total*100,
    'threshold': 'IS NULL'
})
report_data.append({
    'field': 'fare',
    'rule': 'negative',
    'invalid_count': fare_negative,
    'invalid_pct': fare_negative/total*100,
    'threshold': '< 0'
})
report_data.append({
    'field': 'fare',
    'rule': 'zero',
    'invalid_count': fare_zero,
    'invalid_pct': fare_zero/total*100,
    'threshold': '= 0'
})
report_data.append({
    'field': 'fare',
    'rule': 'extreme_high',
    'invalid_count': fare_extreme_high,
    'invalid_pct': fare_extreme_high/total*100,
    'threshold': f'> ${FARE_MAX}'
})

# Overall totals
report_data.append({
    'field': 'OVERALL',
    'rule': 'VALID',
    'invalid_count': valid,
    'invalid_pct': valid/total*100,
    'threshold': 'All checks pass'
})
report_data.append({
    'field': 'OVERALL',
    'rule': 'INVALID',
    'invalid_count': invalid,
    'invalid_pct': invalid/total*100,
    'threshold': 'Any check fails'
})

# Create DataFrame and add metadata
report = pd.DataFrame(report_data)
report['total_records'] = total
report['processing_date'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

# Reorder columns for better readability
report = report[['field', 'rule', 'threshold', 'invalid_count', 'invalid_pct', 'total_records', 'processing_date']]

# Save report
report.to_csv(REPORTS_DIR / "validation_report.csv", index=False)
print(f"Validation report saved: {REPORTS_DIR / 'validation_report.csv'}")

Validation report saved: ..\data\quality_reports\validation_report.csv


#### 4 Cleanup
Close database connections and release resources to complete the validation pipeline.

In [None]:
# Close DuckDB connection
con.close()
print("Pipeline complete")

 Complete


### Conclusion

**Execution Results**

This validation pipeline processed 684,376,551 NYC rideshare trip records from 2022-2024, implementing 13 quality checks across duration, distance, and fare fields. The analysis showed excellent data quality with 683,780,462 valid records (99.91%) and only 596,089 invalid records (0.09%). Processing completed in approximately 70 minutes.

**Dataset Overview**

The validation pipeline produced three output datasets:
- 684,376,551 total records processed
- 683,780,462 valid records (99.91% pass rate)
- 596,089 invalid records (0.09% with quality issues)
- 13 granular quality flags implemented
- 3 output files created

**Data Quality Observations**

The validation results reveal an exceptionally clean dataset. All three validated fields (duration, distance, fare) show pass rates above 99.95%.

Field-specific findings:
- **Duration:** 99.99% pass rate - only 60,163 trips under 60 seconds and 57 exceeding 12 hours
- **Distance:** 99.96% pass rate - only 4,411 trips over 200 miles
- **Fare:** 99.95% pass rate - 320K negative fares may need investigation, 110K zero-fare trips retained

Three columns show significant nulls but don't impact demand forecasting:
- `originating_base_num` (27% null)
- `on_scene_datetime` (27% null)
- `airport_fee` (18% null)

**Technical Decisions**

Boolean validation flags were implemented to preserve all records while marking quality issues. This strategy retains all data in the original dataset while creating a clean dataset for analysis. Zero-fare trips were intentionally retained as they may represent legitimate demand (promotional rides, driver incentives) despite no payment. Negative fares were excluded as they likely represent refunds or cancellations rather than completed trips.


**Output Files**
- `fhvhv_all_data_flagged.parquet` - All 684M records with 14 validation columns for quality monitoring
- `fhvhv_valid_data_for_eda.parquet` - 683.8M valid records (24 original columns) for analysis
- `validation_report.csv` - Detailed metrics for all 13 validation rules

**Next Steps**

Proceed to **02_exploratory_analysis.ipynb** to:
- Aggregate trips by borough and time period
- Analyze demand patterns (daily, weekly, seasonal)
- Engineer features for forecasting models