# COMP 3610: Big Data Analytics
## Assignment 1

Build an end-to-end data pipeline that ingests, transforms, and
analyzes the NYC Yellow Taxi Trip dataset, culminating in an interactive visualization
dashboard. This assignment integrates the skills covered in weeks 1-3 of the course: Python
data engineering, SQL querying, and data visualization.

### Part 1: Data Ingestion

#### Step 1: Download Data Files
Download `taxi_zone_lookup.csv` and `yellow_tripdata_2024-01.parquet` using the python `requests` library.

In [4]:
import os
import requests

DATA_DIR: str = './data/raw'

# Define the request URLs to send the GET requests to:
YELLOW_TRIP_DATA_URL: str = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet'
YELLOW_TRIP_FILENAME: str = 'yellow_tripdata_2024-01.parquet'

TAXI_ZONE_LOOKUP_URL: str = 'https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv'
TAXI_ZONE_FILENAME: str   = 'taxi_zone_lookup.csv'

# Ensure the data directory exists
os.makedirs(DATA_DIR, exist_ok=True)

# Fetch taxi_zone_lookup.csv
try:
    r = requests.get(TAXI_ZONE_LOOKUP_URL)
    r.raise_for_status()

    with open(f'{DATA_DIR}/taxi_zone_lookup.csv', 'wb') as f:
        f.write(r.content)

    print('Successfully downloaded taxi_zone_lookup.csv')
except requests.RequestException as e:
    print(f'Failed to fetch taxi_zone_lookup.csv: {e}')

# Fetch yellow_tripdata_2024-01.parquet (using data streaming)
try:
    r = requests.get(YELLOW_TRIP_DATA_URL, stream=True)
    r.raise_for_status()

    with open(f'{DATA_DIR}/yellow_tripdata_2024-01.parquet', 'wb') as f:
        for chunk in r.iter_content(chunk_size=8192):
            if chunk:
                f.write(chunk)

    print('Successfully downloaded yellow_tripdata_2024-01.parquet')
except requests.RequestException as e:
    print(f'Failed to fetch yellow_tripdata_2024-01.parquet: {e}')


Successfully downloaded taxi_zone_lookup.csv
Successfully downloaded yellow_tripdata_2024-01.parquet


#### Step 2: Data Validation

Validate the downloaded dataset against the expected schema below.

##### Schema

| Column | Description |
|--------|-------------|
| `tpep_pickup_datetime` | Timestamp when the meter was engaged |
| `tpep_dropoff_datetime` | Timestamp when the meter was disengaged |
| `PULocationID` | TLC Taxi Zone ID for pickup location (join with lookup table) |
| `DOLocationID` | TLC Taxi Zone ID for dropoff location (join with lookup table) |
| `passenger_count` | Number of passengers (driver-entered) |
| `trip_distance` | Trip distance in miles (from taximeter) |
| `fare_amount` | Time-and-distance fare calculated by the meter |
| `tip_amount` | Tip amount (auto-populated for credit card payments only) |
| `total_amount` | Total amount charged to passengers (excludes cash tips) |
| `payment_type` | 1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown |

##### Requirements

- Verify all expected columns exist in the dataset
- Check that date columns are valid datetime types
- Report total row count and print a summary to the console
- Raise an exception or exit with an error message if validation fails

In [14]:
# Using pyarrow.parquet to read only the metadata from the file
import pyarrow.parquet as pq
import pyarrow as pa

# Read yellow trip parquet file
yt_file = pq.ParquetFile(os.path.join(DATA_DIR, YELLOW_TRIP_FILENAME))

# Report row and column counts
print(f'Rows: {yt_file.metadata.num_rows} row/s\nColumns: {yt_file.metadata.num_columns} column/s')

# Verify yellow trip data schema
try:
    required_columns = [
        'tpep_pickup_datetime', 
        'tpep_dropoff_datetime', 
        'PULocationID', 
        'DOLocationID', 
        'passenger_count', 
        'trip_distance',
        'fare_amount', 
        'tip_amount', 
        'total_amount', 
        'payment_type'
        ]
    
    schema = yt_file.schema_arrow

    # Verify that the schema contains the column
    for col in required_columns:
        if col not in schema.names:
            raise IndexError(f'column "{col}" could not be found in the schema')
        
    # Verify that the datetime columns are actually datetime types
    for col in required_columns[:2]:
        field = schema.field(col)
        if not pa.types.is_timestamp(field.type):
            raise TypeError(f'"{col}" is not a datetime type, got {field.type}')
    
    print(f'Validated {YELLOW_TRIP_FILENAME} successfully!')

except Exception as e:
    print(f'Invalid schema for yellow trip data: {e}')


Rows: 2964624 row/s
Columns: 19 column/s
Validated yellow_tripdata_2024-01.parquet successfully!
