# NYC Rideshare Forecasting Pipeline - Part 1: Data Download

**Author:** K Flowers  
**GitHub:** [github.com/KRFlowers](https://github.com/KRFlowers)  
**Date:** December 2025

This notebook downloads 36 months of NYC Taxi & Limousine Commission (TLC) High Volume FHV (Uber/Lyft) trip records. Data is for the years 2022-2024. This is the first notebook in a pipeline where data will be validated, analyzed for demand patterns, and used to build zone-level forecasting models. 

**Pipeline Position:** Notebook 1 of 4: Data Download

- 00_data_download.ipynb ← **this notebook**
- 01_data_validation.ipynb
- 02_exploratory_analysis.ipynb
- 03_demand_forecasting.ipynb

**Objective:** The notebook acquires and consolidates the initial dataset that will be used for validation, exploration, and modeling stages.

**Technical Approach:**
- Build list of monthly file URLs for 2022–2024
- Download each Parquet file (skip existing)
- Consolidate files using DuckDB
- Download zone metadata for later analysis

**Inputs:**
- 2022–2024 NYC TLC FHVHV trip data (publicly available from NYC Open Data)

**Outputs:**
- `data\raw\combined_fhvhv_tripdata.parquet`  (~18GB total)
- `data\raw\zone_metadata.csv` — Zone reference data

**Runtime:** ~1 hour first run; skips existing files on rerun

## 1. Configure Environment

### 1.1 Import Libraries

In [None]:
import urllib.request
from pathlib import Path
import duckdb
import pandas as pd

Libraries imported successfully


### 1.2 Set Display and Plot Options

In [None]:
# Pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.4f}'.format)

### 1.3 Set Paths and Constants

In [None]:
# Project Constants
PROJECT_YEARS = [2022, 2023, 2024]
TLC_DATASET = 'fhvhv'

# Paths
PROJECT_ROOT = Path("..").resolve()
RAW_DIR = PROJECT_ROOT / "data" / "raw"  # Source files + combined dataset

# Notebook Configuration
BASE_URL = "https://d37ci6vzurychx.cloudfront.net/trip-data"
OUTPUT_FILE = RAW_DIR / f"combined_{TLC_DATASET}_tripdata.parquet"

# Create Directory
RAW_DIR.mkdir(parents=True, exist_ok=True)


✓ Config loaded: FHVHV 2022-2024


## 2. Download Data
Creates file list, downloads monthly trip files, and retrieves zone metadata.

### 2.1 Create File List

In [None]:
# Generate list of files to download using year and month combinations in PROJECT_YEARS
files_to_download = [
    f"{TLC_DATASET}_tripdata_{year}-{month:02d}.parquet"
    for year in PROJECT_YEARS
    for month in range(1, 13)
]

print(f"Generated list of {len(files_to_download)} files to download")

✓ Generated list of 36 files to download


### 2.2 Download Trip Data Files

In [None]:
# Download files in list of files_to_download created above 
downloaded_files = []
failed_files = []

for i, filename in enumerate(files_to_download, 1):
    url = f"{BASE_URL}/{filename}"
    save_path = RAW_DIR / filename
    
    # Skip files that already exist
    if save_path.exists():
        downloaded_files.append(save_path)
        continue
    
    # Download with error handling
    try:
        print(f"[{i}/{len(files_to_download)}] {filename}...", end=" ")
        urllib.request.urlretrieve(url, save_path)
        downloaded_files.append(save_path)
        print("")
    except Exception as e:
        print(f"✗ {str(e)[:50]}")
        failed_files.append(filename)

# Summary
print(f"\nDownload complete: {len(downloaded_files)} files available")
if failed_files:
    print(f"Failed: {len(failed_files)} files (may not exist yet)")
    for fname in failed_files[:5]:
        print(f"  - {fname}")

[1/36] fhvhv_tripdata_2022-01.parquet... 
[2/36] fhvhv_tripdata_2022-02.parquet... 
[3/36] fhvhv_tripdata_2022-03.parquet... 
[4/36] fhvhv_tripdata_2022-04.parquet... 
[5/36] fhvhv_tripdata_2022-05.parquet... 
[6/36] fhvhv_tripdata_2022-06.parquet... 
[7/36] fhvhv_tripdata_2022-07.parquet... 
[8/36] fhvhv_tripdata_2022-08.parquet... 
[9/36] fhvhv_tripdata_2022-09.parquet... 
[10/36] fhvhv_tripdata_2022-10.parquet... 
[11/36] fhvhv_tripdata_2022-11.parquet... 
[12/36] fhvhv_tripdata_2022-12.parquet... 
[13/36] fhvhv_tripdata_2023-01.parquet... 
[14/36] fhvhv_tripdata_2023-02.parquet... 
[15/36] fhvhv_tripdata_2023-03.parquet... 
[16/36] fhvhv_tripdata_2023-04.parquet... 
[17/36] fhvhv_tripdata_2023-05.parquet... 
[18/36] fhvhv_tripdata_2023-06.parquet... 
[19/36] fhvhv_tripdata_2023-07.parquet... 
[20/36] fhvhv_tripdata_2023-08.parquet... 
[21/36] fhvhv_tripdata_2023-09.parquet... 
[22/36] fhvhv_tripdata_2023-10.parquet... 
[23/36] fhvhv_tripdata_2023-11.parquet... 
[24/36] fhvhv_tripda

### 2.3 Download Zone Metadata

In [8]:
# Download NYC TLC zone metadata for zone names and boroughs 

zone_metadata_url = "https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv"
zone_metadata_file = RAW_DIR / "zone_metadata.csv"

if not zone_metadata_file.exists():
    urllib.request.urlretrieve(zone_metadata_url, zone_metadata_file)
    
    # Rename LocationID column to zone_id for consistency
    zone_metadata = pd.read_csv(zone_metadata_file)
    zone_metadata = zone_metadata.rename(columns={'LocationID': 'zone_id'})
    zone_metadata.to_csv(zone_metadata_file, index=False)
    print(f"Zone metadata saved: {zone_metadata_file}")
else:
    zone_metadata = pd.read_csv(zone_metadata_file)
    print(f"Zone metadata exists, skipping download")

print(f"Total zones: {len(zone_metadata)}")
zone_metadata.head()

Zone metadata saved: C:\Users\kristi\OneDrive\GitHub Repositories\DataScienceProjects\nyc-fhv-rideshare-forecasting\data\raw\zone_metadata.csv
Total zones: 265


Unnamed: 0,zone_id,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
1,2,Queens,Jamaica Bay,Boro Zone
2,3,Bronx,Allerton/Pelham Gardens,Boro Zone
3,4,Manhattan,Alphabet City,Yellow Zone
4,5,Staten Island,Arden Heights,Boro Zone


## 3. Combine Data 
Use DuckDB to combines all monthly files into a single dataset. 
(~5-10 minutes)

### 3.1 Initialize DuckDB Connection 

In [None]:
# Initialize DuckDB connection with memory-efficient settings
con = duckdb.connect()
con.execute("SET threads=4")
con.execute("SET preserve_insertion_order=false")

✓ DuckDB connection established


### 3.2 Combine Data Files
Prepare file list for combining all downloaded monthly trip files.

In [None]:
# Create SQL-formatted file list from downloaded files
file_list_sql = ", ".join([f"'{str(f)}'" for f in downloaded_files])

print(f"Prepared {len(downloaded_files)} files for combination:")
print(f"  Output: {OUTPUT_FILE.name}")

Combining 36 files and saving to: combined_fhvhv_tripdata.parquet...
Saved: 18830.1 MB (18.39 GB)


In [None]:
# Combine all monthly files into single parquet file
print(f"Combining {len(downloaded_files)} files...")
con.execute(f"""
    COPY (
        SELECT * FROM read_parquet([{file_list_sql}])
    )
    TO '{str(OUTPUT_FILE)}'
    (FORMAT PARQUET, COMPRESSION SNAPPY)
""")

# Verify export
parquet_file_size_mb = OUTPUT_FILE.stat().st_size / 1024**2
print(f"Export complete: {parquet_file_size_mb:.1f} MB ({parquet_file_size_mb/1024:.2f} GB)")
print(f"  Location: {OUTPUT_FILE}")

## 4. Verify Dataset Structure
Confirms the combined dataset has the correct schema, expected row counts, and date range coverage and validates presence of fields required for downstream analysis.

### 4.1 Validate Schema and Critical Fields

In [None]:
# Query dataset metrics
summary = con.execute(f"""
    SELECT 
        COUNT(*) as row_count,
        MIN(pickup_datetime) as start_date,
        MAX(pickup_datetime) as end_date
    FROM '{OUTPUT_FILE}'
""").fetchone()

row_count, start_date, end_date = summary

# Get schema information
columns = con.execute(f"DESCRIBE SELECT * FROM '{OUTPUT_FILE}'").df()
column_names = columns['column_name'].tolist()

# Verify critical columns needed analysis are present
expected_columns = [
    'hvfhs_license_num',
    'pickup_datetime',        
    'dropoff_datetime',       
    'PULocationID',           
    'DOLocationID',           
    'trip_miles',             
    'trip_time',              
    'base_passenger_fare'     
]

# Assert all present
missing = [col for col in expected_columns if col not in column_names]
assert len(missing) == 0, f"Schema validation failed - missing columns: {missing}"

print(f"Schema validation complete: {row_count:,} records")

In [None]:
# Display dataset summary and schema validation results
print("DATASET STRUCTURE VERIFICATION")
print("_"*60)
print(f"\nRows:        {row_count:,}")
print(f"Columns:     {len(column_names)}")
print(f"Date range:  {start_date} to {end_date}")
print(f"File size:   {parquet_file_size_mb:.1f} MB ({parquet_file_size_mb/1024:.2f} GB)")

print(f"\nCritical pipeline fields validated:")
for col in expected_columns:
    status = "✓" if col in column_names else "✗ MISSING"
    print(f"  {status} {col}")

### 4.2 Review Full Schema

In [None]:
# List all columns with data types
con.execute(f"DESCRIBE SELECT * FROM '{OUTPUT_FILE}'").df()[['column_name', 'column_type']]

Unnamed: 0,column_name,column_type
0,hvfhs_license_num,VARCHAR
1,dispatching_base_num,VARCHAR
2,originating_base_num,VARCHAR
3,request_datetime,TIMESTAMP
4,on_scene_datetime,TIMESTAMP
5,pickup_datetime,TIMESTAMP
6,dropoff_datetime,TIMESTAMP
7,PULocationID,BIGINT
8,DOLocationID,BIGINT
9,trip_miles,DOUBLE


### 4.3 Preview Sample Records

In [None]:
# Display sample records in transposed format
sample_df = con.execute(f"""
    SELECT * FROM '{OUTPUT_FILE}' LIMIT 4
""").df()

# Preview sample records
con.execute(f"SELECT * FROM '{OUTPUT_FILE}' LIMIT 5").df()

Unnamed: 0,hvfhs_license_num,dispatching_base_num,originating_base_num,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,trip_time,base_passenger_fare,tolls,bcf,sales_tax,congestion_surcharge,airport_fee,tips,driver_pay,shared_request_flag,shared_match_flag,access_a_ride_flag,wav_request_flag,wav_match_flag
0,HV0003,B03404,B03404,2022-01-01 00:05:31,2022-01-01 00:05:40,2022-01-01 00:07:24,2022-01-01 00:18:28,170,161,1.18,664,24.9,0.0,0.75,2.21,2.75,0.0,0.0,23.03,N,N,,N,N
1,HV0003,B03404,B03404,2022-01-01 00:19:27,2022-01-01 00:22:08,2022-01-01 00:22:32,2022-01-01 00:30:12,237,161,0.82,460,11.97,0.0,0.36,1.06,2.75,0.0,0.0,12.32,N,N,,N,N
2,HV0003,B03404,B03404,2022-01-01 00:43:53,2022-01-01 00:57:37,2022-01-01 00:57:37,2022-01-01 01:07:32,237,161,1.18,595,29.82,0.0,0.89,2.65,2.75,0.0,0.0,23.3,N,N,,N,N
3,HV0003,B03404,B03404,2022-01-01 00:15:36,2022-01-01 00:17:08,2022-01-01 00:18:02,2022-01-01 00:23:05,262,229,1.65,303,7.91,0.0,0.24,0.7,2.75,0.0,0.0,6.3,N,N,,N,N
4,HV0003,B03404,B03404,2022-01-01 00:25:45,2022-01-01 00:26:01,2022-01-01 00:28:01,2022-01-01 00:35:42,229,141,1.65,461,9.44,0.0,0.28,0.84,2.75,0.0,0.0,7.44,N,N,,N,N


In [32]:
# Close DuckDB connection and release resources and file locks
con.close()
print("\n Complete")


 Complete


**Dataset Summary:**
- 684 million records
- 36 files downloaded (Jan 2022 - Dec 2024)
- All expected columns present

Data is ready for validation.

## Conclusion

This notebook downloaded and consolidated 36 monthly files of NYC TLC FHVHV trip data, creating a single 18GB dataset with 684 million records spanning 2022-2024.

**Key Findings:**
- Successfully downloaded all 36 monthly files (Jan 2022 - Dec 2024)
- Date range verified: 2022-01-01 to 2024-12-31
- All expected critical columns present in combined dataset

**Technical Decisions:**
- Used DuckDB instead of Pandas to handle 18GB dataset (avoided memory errors)
- Optimized DuckDB settings (`threads=4`, `preserve_insertion_order=false`) reduced combination time from 30+ to 6 minutes
- Renamed `LocationID` to `zone_id` in metadata for downstream consistency
- Parquet format with Snappy compression for efficient storage

**Outputs:**
- `data/raw/combined_fhvhv_tripdata.parquet` — Combined dataset (18.4 GB, 684M records)
- `data/raw/zone_metadata.csv` — Zone reference data (263 zones)

**Next Steps:**
Proceed to **01_data_validation.ipynb** to validate data quality and flag records for analysis.