### FHVHV Data Pipeline - Stage 0: Download

**Pipeline Position:** Stage 0 of 4
- Stage 0: Data Download ← THIS SCRIPT
- Stage 1: Data Validation
- Stage 2: Exploratory Analysis
- Stage 3: Modeling

**Overview**
This script downloads and combines NYC TLC FHVHV trip data (Uber/Lyft/Via) from the NYC Open Data website for 2022-2024. It retrieves 36 monthly Parquet files and consolidates them into a single dataset containing approximately 684 million rideshare trip records. The resulting dataset will then be used for analyzing and forecasting rideshare demand.

**Data Source**
- **Provider:** NYC Taxi & Limousine Commission (TLC) - the city agency that regulates rideshare services
- **Dataset:** For-Hire Vehicle High Volume (FHVHV) trip records
- **Website:** NYC Open Data Portal (data.cityofnewyork.us)
- **Format:** Monthly Parquet files hosted on CloudFront CDN

**Time Period Rationale**

The 2022-2024 period was chosen for several practical reasons. Three years provides enough historical data to identify seasonal patterns (multiple summers, winters, holidays) while staying focused on recent rideshare trends relevant for forecasting. This timeframe also creates a manageable 18GB dataset that runs efficiently on standard hardware within the project timeline.

**Technical Approach**

- **Download:** Uses Python urllib for HTTP retrieval
- **Processing:** Uses DuckDB for memory-efficient consolidation
- **Storage:** Data is stored in separate raw/final directories for data lineage
- **Output:** Data is output to a single Parquet file  to optimize for future data validation

**Output:** `data/final/combined_fhvhv_tripdata.parquet` (~18GB, 684M records)

**Runtime Note:** 
- Section 2.2 (downloads): ~10-30 minutes (network-dependent)
- Section 3.2 (combine/export): ~50-70 minutes (CPU-bound)
- Total: ~2-3 hours 

**Inputs:**
- None (downloads from NYC TLC)

**Outputs:**
- `data/raw/fhvhv_tripdata_2022-01.parquet` ... (36 monthly files)
- `data/raw/fhvhv_combined.parquet` - Combined dataset
- `data/raw/zone_metadata.csv` - Zone names and boroughs

**Next Step:** Run `01_data_validation.ipynb`

#### 1. Setup

##### 1.1 Import Libraries

In [1]:
import urllib.request
from pathlib import Path
import duckdb

##### 1.2 Configuration
The configuration uses a modular approach for key parameters. This makes it easy to adapt the pipeline for different date ranges or NYC TLC datasets if needed.

In [2]:
# Define root path for all data directories using relative references
DATA_ROOT = Path("..").resolve()

# Define years to download
YEARS = [2022, 2023, 2024]

# Set directory structure for downloads and final data
raw_folder = DATA_ROOT / "data" / "raw"
final_folder = DATA_ROOT / "data" / "final"

# Define external data source
BASE_URL = "https://d37ci6vzurychx.cloudfront.net/trip-data"

# Define dataset  and output filename
DATASET_TYPE = "fhvhv"
OUTPUT_FILENAME = f"combined_{DATASET_TYPE}_tripdata"

# Display configuration
print("Configuration:")
print(f"  Dataset: {DATASET_TYPE.upper()} Trip Data")
print(f"  Years: {YEARS[0]}-{YEARS[-1]} ({len(YEARS)} years)")
print(f"  Total files: {len(YEARS) * 12}")
print(f"  Output: {OUTPUT_FILENAME}.parquet")

Configuration:
  Dataset: FHVHV Trip Data
  Years: 2022-2024 (3 years)
  Total files: 36
  Output: combined_fhvhv_tripdata.parquet


##### 1.3 Create Directory Structure

In [3]:
# Create directories if they don't exist
raw_folder.mkdir(parents=True, exist_ok=True)
final_folder.mkdir(parents=True, exist_ok=True)

print(f" Directories ready")

 Directories ready


#### 2. Download Data
This section first generates a list of files to download based on the selected years, then downloads each monthly FHVHV trip file from NYC Open Data. Files already downloaded are skipped, and error handling catches any files not available.

##### 2.1 Generate Download Tasks

In [5]:
# Create list of filenames to download
download_tasks = [
    f"{DATASET_TYPE}_tripdata_{year}-{month:02d}.parquet"
    for year in YEARS
    for month in range(1, 13)
]

print(f"Generated {len(download_tasks)} download tasks")

Generated 36 download tasks


##### 2.2 Execute Downloads (~10-20 min)
Download each monthly file from NYC Open Data. Files already downloaded will be skipped to avoid re-downloading.

In [6]:
downloaded_files = []
failed_files = []

for i, filename in enumerate(download_tasks, 1):
    url = f"{BASE_URL}/{filename}"
    save_path = raw_folder / filename
    
    # Skip files that already exist
    if save_path.exists():
        downloaded_files.append(save_path)
        continue
    
    # Download with error handling
    try:
        print(f"[{i}/{len(download_tasks)}] {filename}...", end=" ")
        urllib.request.urlretrieve(url, save_path)
        downloaded_files.append(save_path)
        print("")
    except Exception as e:
        print(f"✗ {str(e)[:50]}")
        failed_files.append(filename)

# Summary
print(f"\nDownload complete: {len(downloaded_files)} files available")
if failed_files:
    print(f"Failed: {len(failed_files)} files (may not exist yet)")
    for fname in failed_files[:5]:
        print(f"  - {fname}")


Download complete: 36 files available


##### 2.3 Download Zone MetaData

In [4]:
# What: Download NYC TLC zone metadata
# Why: Provides zone names and boroughs for all 263 zones

zone_metadata_url = "https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv"
zone_metadata_file = raw_folder / "zone_metadata.csv"

print("Downloading zone metadata...")
urllib.request.urlretrieve(zone_metadata_url, zone_metadata_file)

# Verify download
import pandas as pd
zone_metadata = pd.read_csv(zone_metadata_file)
zone_metadata = zone_metadata.rename(columns={'LocationID': 'zone_id'})
zone_metadata.to_csv(zone_metadata_file, index=False)

print(f"Zone metadata saved: {zone_metadata_file}")
print(f"Total zones: {len(zone_metadata)}")
zone_metadata.head()

Downloading zone metadata...
Zone metadata saved: C:\Users\kristi\OneDrive\GitHub Repositories\DataScienceProjects\nyc-fhv-rideshare-forecasting\data\raw\zone_metadata.csv
Total zones: 265


Unnamed: 0,zone_id,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
1,2,Queens,Jamaica Bay,Boro Zone
2,3,Bronx,Allerton/Pelham Gardens,Boro Zone
3,4,Manhattan,Alphabet City,Yellow Zone
4,5,Staten Island,Arden Heights,Boro Zone


#### 3. Combine and Export (~50-70 minutes)
This section combines all monthly files into a single dataset using DuckDB, which streams data rather than loading it all into memory. This is critical for handling the large file size (~18GB) without running out of RAM.

In [7]:
# Initialize DuckDB connection
con = duckdb.connect()
print(" DuckDB connection established")

 DuckDB connection established


In [8]:
# Combine files and export

# Define pattern to match all downloaded files
parquet_pattern = str(raw_folder / f"{DATASET_TYPE}_tripdata_*.parquet")

# Define output path
parquet_path = final_folder / f"{OUTPUT_FILENAME}.parquet"

# Combine and export in single step
print(f"Combining files and saving to: {parquet_path.name}...")
con.execute(f"""
    COPY (
        SELECT * FROM read_parquet('{parquet_pattern}')
    )
    TO '{str(parquet_path)}'
    (FORMAT PARQUET, COMPRESSION SNAPPY)
""")

parquet_size_mb = parquet_path.stat().st_size / 1024**2
print(f" Saved: {parquet_size_mb:.1f} MB ({parquet_size_mb/1024:.2f} GB)")

Combining files and saving to: combined_fhvhv_tripdata.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Saved: 18793.2 MB (18.35 GB)


#### 4. Data Overview
This section verifies the combined dataset is complete and properly structured by checking record counts, date range coverage, column presence, and sample records.

##### 4.1 Dataset Summary

In [9]:
# Query combined file for dataset metrics
summary = con.execute(f"""
    SELECT 
        COUNT(*) as row_count,
        MIN(pickup_datetime) as start_date,
        MAX(pickup_datetime) as end_date
    FROM '{parquet_path}'
""").fetchone()

row_count, start_date, end_date = summary

# Display column information
columns = con.execute(f"DESCRIBE SELECT * FROM '{parquet_path}'").df()
column_names = columns['column_name'].tolist()

# Check for expected columns
expected_columns = ['hvfhs_license_num', 'pickup_datetime', 'dropoff_datetime', 
                   'trip_miles', 'base_passenger_fare']
missing_cols = [col for col in expected_columns if col not in column_names]

print("="*60)
print("DATASET SUMMARY")
print("="*60)
print(f"\nRows:        {row_count:,}")
print(f"Columns:     {len(column_names)}")
print(f"Date range:  {start_date} to {end_date}")
print(f"File size:   {parquet_size_mb:.1f} MB ({parquet_size_mb/1024:.2f} GB)")

if missing_cols:
    print(f"\n  Missing expected columns: {missing_cols}")
else:
    print(f"\n All expected columns present")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DATASET SUMMARY

Rows:        684,376,551
Columns:     24
Date range:  2022-01-01 00:00:00 to 2024-12-31 23:59:59
File size:   18793.2 MB (18.35 GB)

 All expected columns present


##### 4.2 Column Summary

In [10]:
# list columns in dataset with data types
print(f"\nColumn Data Types:")
columns_types = con.execute(f"DESCRIBE SELECT * FROM '{parquet_path}'").df()
for i, row in columns_types.iterrows():
    print(f"  {i+1:2d}. {row['column_name']}: {row['column_type']}")    


Column Data Types:
   1. hvfhs_license_num: VARCHAR
   2. dispatching_base_num: VARCHAR
   3. originating_base_num: VARCHAR
   4. request_datetime: TIMESTAMP
   5. on_scene_datetime: TIMESTAMP
   6. pickup_datetime: TIMESTAMP
   7. dropoff_datetime: TIMESTAMP
   8. PULocationID: BIGINT
   9. DOLocationID: BIGINT
  10. trip_miles: DOUBLE
  11. trip_time: BIGINT
  12. base_passenger_fare: DOUBLE
  13. tolls: DOUBLE
  14. bcf: DOUBLE
  15. sales_tax: DOUBLE
  16. congestion_surcharge: DOUBLE
  17. airport_fee: DOUBLE
  18. tips: DOUBLE
  19. driver_pay: DOUBLE
  20. shared_request_flag: VARCHAR
  21. shared_match_flag: VARCHAR
  22. access_a_ride_flag: VARCHAR
  23. wav_request_flag: VARCHAR
  24. wav_match_flag: VARCHAR


##### 4.3 Sample Records

In [11]:
# Display sample records in transposed format
sample_df = con.execute(f"""
    SELECT * FROM '{parquet_path}' LIMIT 4
""").df()

print(f"Sample Records ({len(sample_df)} rows, {len(sample_df.columns)} columns):\n")
print(sample_df.T.to_string())

Sample Records (4 rows, 24 columns):

                                        0                    1                    2                    3
hvfhs_license_num                  HV0003               HV0003               HV0003               HV0003
dispatching_base_num               B03404               B03404               B03404               B03404
originating_base_num               B03404               B03404               B03404               B03404
request_datetime      2022-01-01 00:05:31  2022-01-01 00:19:27  2022-01-01 00:43:53  2022-01-01 00:15:36
on_scene_datetime     2022-01-01 00:05:40  2022-01-01 00:22:08  2022-01-01 00:57:37  2022-01-01 00:17:08
pickup_datetime       2022-01-01 00:07:24  2022-01-01 00:22:32  2022-01-01 00:57:37  2022-01-01 00:18:02
dropoff_datetime      2022-01-01 00:18:28  2022-01-01 00:30:12  2022-01-01 01:07:32  2022-01-01 00:23:05
PULocationID                          170                  237                  237                  262
DOLocationID     

In [12]:
# Close DuckDB connection and release resources and file locks
con.close()
print("\n Complete")


 Complete


### Conclusion

**Execution Results**

This pipeline successfully downloaded and consolidated NYC TLC FHVHV trip data for the 2022-2024 analysis period. The combined dataset contains 684,376,551 rideshare trip records spanning exactly 3 years, from January 1, 2022 to December 31, 2024.

**Dataset Overview**

This pipeline produced a single consolidated file with:
- 684,376,551 total trip records
- 3-year time span (Jan 2022 - Dec 2024)
- 18.35 GB file size (compressed Parquet)
- 36 monthly source files combined
- 24 data fields per record

**Data Quality Observations**

The initial review looks good - all 24 expected columns are present with correct data types, complete date range coverage, and the key fields (`trip_time`, `trip_miles`, `base_passenger_fare`) are fully populated. File size is also consistent with expectations (~275KB per 1000 records).

Two fields show significant nulls, but this is expected and won't impact demand forecasting:
- `originating_base_num` (27% null) - Lyft doesn't report this field
- `on_scene_datetime` (27% null) - optional driver tracking field


**Technical Decisions**

Pandas was the initial choice for data consolidation, however, it was not powerful enough to handle all files and resulted in out of memory errors.  

DuckDB's more robust data handling proved to be the right solution. It can read multiple Parquet files through glob patterns and export in a single operation, which avoids the memory limitations I hit with pandas. The key advantage is that DuckDB streams data rather than loading everything into RAM, which means it can process datasets larger than available memory. This makes the pipeline reproducible on standard hardware. I ran it on my 32GB desktop in about 2 hours.

**Output Files**

- `data/raw/fhvhv_tripdata_*.parquet` - Individual monthly files (36 files)
- `data/final/combined_fhvhv_tripdata.parquet` - Combined dataset (18.35 GB)

**Next Steps**

Proceed to **01_data_validation.ipynb** to:
- Implement quality checks on duration, distance, and fare fields
- Flag invalid records while preserving full dataset for audit
- Create clean dataset for exploratory analysis