# NYC Rideshare Forecasting Pipeline - Part 1: Data Download

This notebook downloads 36 months of NYC FHV (Uber/Lyft) trip records for demand forecasting analysis. Using DuckDB for memory-efficient processing, it consolidates monthly Parquet files into a single 18GB dataset with zone metadata. This data forms the foundation for validation, EDA, and forecasting in notebooks 01–03.

**Pipeline Position:** Notebook 1 of 4 -- Data Download

- 00_data_download.ipynb ← **this notebook**
- 01_data_validation.ipynb
- 02_exploratory_analysis.ipynb
- 03_demand_forecasting.ipynb

**Objective:** The notebook acquires and consolidates the initial dataset that will be used for validation, exploration, and modeling stages.

**Technical Approach:**
- Build list of monthly file URLs for 2022–2024
- Download each Parquet file (skip existing)
- Consolidate files using DuckDB
- Download zone metadata for later analysis

**Inputs:**
- 36 monthly files -- 2022–2024 NYC TLC FHVHV trip data (from NYC Open Data)

**Outputs:**
- `data\raw\combined_fhvhv_tripdata.parquet` -  Combined trip data (~18GB)
- `data\raw\zone_metadata.csv` - Zone reference data


**Limitations:** Dataset size (18 GB) requires DuckDB or similar for processing. 

**Runtime:** ~1 hour first run; skips existing files on rerun

## 1. Configure Environment

### 1.1 Import Libraries

In [1]:
import urllib.request
from pathlib import Path
import duckdb
import pandas as pd

### 1.2 Set Display and Plot Options

In [2]:
# Pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.4f}'.format)

### 1.3 Set Paths and Constants

In [3]:
# Project Constants
PROJECT_YEARS = [2022, 2023, 2024]
TLC_DATASET = 'fhvhv'

# Paths
PROJECT_ROOT = Path("..").resolve()
RAW_DIR = PROJECT_ROOT / "data" / "raw"  # Source files + combined dataset

# Notebook Configuration
BASE_URL = "https://d37ci6vzurychx.cloudfront.net/trip-data"
OUTPUT_FILE = RAW_DIR / f"combined_{TLC_DATASET}_tripdata.parquet"

# Create Directory
RAW_DIR.mkdir(parents=True, exist_ok=True)


## 2. Download Data
Creates the list of monthly files to download, downloads files, and retrieves zone metadata.

### 2.1 Create File List

In [4]:
# Generate list of files to download using year and month combinations in PROJECT_YEARS
files_to_download = [
    f"{TLC_DATASET}_tripdata_{year}-{month:02d}.parquet"
    for year in PROJECT_YEARS
    for month in range(1, 13)
]

print(f"File list created: {len(files_to_download)} files to download")

File list created: 36 files to download


### 2.2 Download Trip Data Files

In [5]:
# Download files in files_to_download list
downloaded_files = []
failed_files = []

for i, filename in enumerate(files_to_download, 1):
    url = f"{BASE_URL}/{filename}"
    save_path = RAW_DIR / filename
    
    # Skip files that already exist
    if save_path.exists():
        downloaded_files.append(save_path)
        continue
    
    # Download with error handling
    try:
        print(f"[{i}/{len(files_to_download)}] {filename}...", end=" ")
        urllib.request.urlretrieve(url, save_path)
        downloaded_files.append(save_path)
        print("")
    except Exception as e:
        print(f"FAILED: {str(e)[:50]}")
        failed_files.append(filename)


print(f"\nDownload complete: {len(downloaded_files)} files downloaded successfully.")
if failed_files:
    print(f"Failed: {len(failed_files)} files (may not exist yet)")
    for fname in failed_files[:5]:
        print(f"  - {fname}")


Download complete: 36 files downloaded successfully.


### 2.3 Download Zone Metadata

In [6]:
# Download NYC TLC zone metadata for zone names and boroughs 
zone_metadata_url = "https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv"
zone_metadata_file = RAW_DIR / "zone_metadata.csv"

if not zone_metadata_file.exists():
    urllib.request.urlretrieve(zone_metadata_url, zone_metadata_file)
    
    # Rename LocationID column to zone_id for consistency
    zone_metadata_df = pd.read_csv(zone_metadata_file)
    zone_metadata_df = zone_metadata_df.rename(columns={'LocationID': 'zone_id'})
    zone_metadata_df.to_csv(zone_metadata_file, index=False)
    print(f"Zone metadata saved: {zone_metadata_file}")
else:
    zone_metadata_df = pd.read_csv(zone_metadata_file)
    print(f"Zone metadata exists, skipping download")

print(f"Total zones in metadata: {len(zone_metadata_df)}")

print("\nSample of zone metadata:")
zone_metadata_df.head()

Zone metadata exists, skipping download
Total zones in metadata: 265

Sample of zone metadata:


Unnamed: 0,zone_id,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
1,2,Queens,Jamaica Bay,Boro Zone
2,3,Bronx,Allerton/Pelham Gardens,Boro Zone
3,4,Manhattan,Alphabet City,Yellow Zone
4,5,Staten Island,Arden Heights,Boro Zone


## 3. Combine Data 
Combine all monthly files into a single dataset using DuckDB
(~5-10 minutes)

### 3.1 Initialize DuckDB Connection 

In [7]:
# Initialize DuckDB connection, use memory-efficient settings
con = duckdb.connect()
con.execute("SET threads=4")
con.execute("SET preserve_insertion_order=false")

print("DuckDB connection initialized")

DuckDB connection initialized


### 3.2 Combine Data Files

In [8]:
# Create SQL-formatted file list from downloaded files
file_list_sql = ", ".join([f"'{str(f)}'" for f in downloaded_files])

print(f"\nCreated SQL file list for DuckDB: {len(downloaded_files)} files")


Created SQL file list for DuckDB: 36 files


In [9]:
# Combine all monthly files into single parquet file
print(f"Combining {len(downloaded_files)} files...")
con.execute(f"""
    COPY (
        SELECT * FROM read_parquet([{file_list_sql}])
    )
    TO '{str(OUTPUT_FILE)}'
    (FORMAT PARQUET, COMPRESSION SNAPPY)
""")

# Verify export
parquet_file_size_mb = OUTPUT_FILE.stat().st_size / 1024**2
print(f"Exported {OUTPUT_FILE.name} ({parquet_file_size_mb:,.0f} MB)")

Combining 36 files...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Exported combined_fhvhv_tripdata.parquet (18,825 MB)


## 4. Verify Dataset Structure
Confirms the combined dataset has the correct schema, expected row counts, date range coverage and fields required for downstream analysis.

### 4.1 Validate Schema and Critical Fields

In [10]:
# Query dataset metrics
summary = con.execute(f"""
    SELECT 
        COUNT(*) as row_count,
        MIN(pickup_datetime) as start_date,
        MAX(pickup_datetime) as end_date
    FROM '{OUTPUT_FILE}'
""").fetchone()

row_count, start_date, end_date = summary

# Get schema information
columns = con.execute(f"DESCRIBE SELECT * FROM '{OUTPUT_FILE}'").df()
column_names = columns['column_name'].tolist()

# Verify fields required for downstream analysis are present
expected_columns = [
    'hvfhs_license_num',
    'pickup_datetime',        
    'dropoff_datetime',       
    'PULocationID',           
    'DOLocationID',           
    'trip_miles',             
    'trip_time',              
    'base_passenger_fare'     
]

# Assert all present
missing = [col for col in expected_columns if col not in column_names]
assert len(missing) == 0, f"Schema validation failed - missing columns: {missing}"

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [11]:
# Display dataset summary and schema validation results
print("DATASET STRUCTURE VERIFICATION")
print("_"*60)
print(f"\nRows:        {row_count:,}")
print(f"Columns:     {len(column_names)}")
print(f"Date range:  {start_date} to {end_date}")
print(f"File size:   {parquet_file_size_mb:.1f} MB ({parquet_file_size_mb/1024:.2f} GB)")

print(f"\nRequired pipeline fields present:")
for col in expected_columns:
    status = "OK" if col in column_names else "MISSING"
    print(f"  [{status}] {col}")

DATASET STRUCTURE VERIFICATION
____________________________________________________________

Rows:        684,376,551
Columns:     24
Date range:  2022-01-01 00:00:00 to 2024-12-31 23:59:59
File size:   18824.7 MB (18.38 GB)

Required pipeline fields present:
  [OK] hvfhs_license_num
  [OK] pickup_datetime
  [OK] dropoff_datetime
  [OK] PULocationID
  [OK] DOLocationID
  [OK] trip_miles
  [OK] trip_time
  [OK] base_passenger_fare


### 4.2 Review Full Schema

In [12]:
# List all columns with data types
con.execute(f"DESCRIBE SELECT * FROM '{OUTPUT_FILE}'").df()[['column_name', 'column_type']]

Unnamed: 0,column_name,column_type
0,hvfhs_license_num,VARCHAR
1,dispatching_base_num,VARCHAR
2,originating_base_num,VARCHAR
3,request_datetime,TIMESTAMP
4,on_scene_datetime,TIMESTAMP
5,pickup_datetime,TIMESTAMP
6,dropoff_datetime,TIMESTAMP
7,PULocationID,BIGINT
8,DOLocationID,BIGINT
9,trip_miles,DOUBLE


### 4.3 Preview Sample Records

In [13]:
# Preview sample records
con.execute(f"SELECT * FROM '{OUTPUT_FILE}' LIMIT 5").df()

Unnamed: 0,hvfhs_license_num,dispatching_base_num,originating_base_num,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,trip_time,base_passenger_fare,tolls,bcf,sales_tax,congestion_surcharge,airport_fee,tips,driver_pay,shared_request_flag,shared_match_flag,access_a_ride_flag,wav_request_flag,wav_match_flag
0,HV0003,B03404,B03404,2022-01-01 00:05:31,2022-01-01 00:05:40,2022-01-01 00:07:24,2022-01-01 00:18:28,170,161,1.18,664,24.9,0.0,0.75,2.21,2.75,0.0,0.0,23.03,N,N,,N,N
1,HV0003,B03404,B03404,2022-01-01 00:19:27,2022-01-01 00:22:08,2022-01-01 00:22:32,2022-01-01 00:30:12,237,161,0.82,460,11.97,0.0,0.36,1.06,2.75,0.0,0.0,12.32,N,N,,N,N
2,HV0003,B03404,B03404,2022-01-01 00:43:53,2022-01-01 00:57:37,2022-01-01 00:57:37,2022-01-01 01:07:32,237,161,1.18,595,29.82,0.0,0.89,2.65,2.75,0.0,0.0,23.3,N,N,,N,N
3,HV0003,B03404,B03404,2022-01-01 00:15:36,2022-01-01 00:17:08,2022-01-01 00:18:02,2022-01-01 00:23:05,262,229,1.65,303,7.91,0.0,0.24,0.7,2.75,0.0,0.0,6.3,N,N,,N,N
4,HV0003,B03404,B03404,2022-01-01 00:25:45,2022-01-01 00:26:01,2022-01-01 00:28:01,2022-01-01 00:35:42,229,141,1.65,461,9.44,0.0,0.28,0.84,2.75,0.0,0.0,7.44,N,N,,N,N


In [14]:
# Close DuckDB connection and release resources and file locks
con.close()
print("\nDuckDB connection closed")


DuckDB connection closed


## Conclusion

**Download Result:**

The raw dataset is consolidated and ready for validation. 36 months of NYC
FHV trip data (684M records, 18.4 GB) are combined into a single parquet file with
all critical fields verified.

**Key Findings:**
- **Coverage:** All 36 monthly files downloaded (Jan 2022 to Dec 2024)
- **Date range:** 2022-01-01 to 2024-12-31 with no gaps
- **Schema:** All 8 critical pipeline fields present and typed correctly
- **Zone metadata:** 265 zones available for downstream analysis

**Download Approach:**
- Built monthly file URL list for 2022-2024
- Downloaded Parquet files with skip-existing logic
- Consolidated 36 files into one using DuckDB
- Downloaded and standardized zone metadata

**Outputs:**
- `data/raw/combined_fhvhv_tripdata.parquet` -- Combined dataset (18.4 GB, 684M records)
- `data/raw/zone_metadata.csv` -- Zone reference data (265 zones)

**Next Steps:**
Proceed to **01_data_validation.ipynb** to validate data quality and flag
records for analysis.


**Author:** K Flowers  
**GitHub:** [github.com/KRFlowers](https://github.com/KRFlowers)  
**Date:** December 2025


## References

- NYC Taxi and Limousine Commission. (2025). TLC Trip Record Data. https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page