# üß≠ NYC Yellow Taxi Data Analysis with Trino and Python

This notebook demonstrates how to connect to a **Trino cluster** from Python, query the **NYC Yellow Taxi dataset**, and perform exploratory data analysis (EDA) using **Pandas** and **Altair**.

The workflow includes:
1. Setting up a Trino SQLAlchemy connection  
2. Inspecting data schema and available tables  
3. Sampling and querying data efficiently  
4. Performing time-based feature engineering  
5. Visualizing relationships and trends

## üõ†Ô∏è 1. Setup and Imports

We import the required libraries:
- `sqlalchemy` for Trino SQL connections  
- `pandas` for data manipulation
- `altair` for visualization

In [None]:
from sqlalchemy import create_engine
import pandas as pd
import altair as alt
# Disable CDN loading, embed JS libraries inside the notebook
alt.renderers.enable('default')

## üîå 2. Connect to Trino

We connect to the Trino cluster using SQLAlchemy.

Connection details:
- **User:** `trino`  
- **Host:** `trino-default.okdp.sandbox`  
- **Catalog:** `lakehouse`  
- **Schema:** `nyc_tripdata`  
- **Protocol:** HTTPS with `verify=False` (disabled cert verification)

In [None]:
engine = create_engine(
    "trino://trino@trino-default.okdp.sandbox/lakehouse/nyc_tripdata",
    connect_args={"http_scheme": "https", "verify": False}
)
engine

## üß± 3. Inspect Table Schema

We use `DESCRIBE` to explore the `yellow` taxi table and verify available columns and data types.

In [None]:
pd.read_sql("DESCRIBE lakehouse.nyc_tripdata.yellow", engine)

## üìã 4. List Available Tables

We can list all tables in the schema using the `SHOW TABLES` command.

In [None]:
tables = pd.read_sql("SHOW TABLES FROM lakehouse.nyc_tripdata", engine)
print(f"Loaded {len(tables)} tables from schema nyc_tripdata")
tables.head()

## üîç 5. Query a Random Sample

We use a **daily random sampling** strategy with a SQL window function to ensure representation from every day across multiple months.  
Instead of `TABLESAMPLE`, this approach provides more uniform coverage and reproducibility for analysis.

We extract up to **100 random trips per day** across the first three months of 2025, capped at 3,000 total rows to keep the query lightweight.


In [None]:
query = """
    WITH daily_sample AS (
      SELECT
        *,
        row_number() OVER (
          PARTITION BY date(tpep_pickup_datetime)
          ORDER BY rand()
        ) AS rn
      FROM lakehouse.nyc_tripdata.yellow
      WHERE month IN ('2025-01', '2025-02', '2025-03')
    )
    SELECT *
    FROM daily_sample
    WHERE rn <= 100
    LIMIT 3000
"""

df = pd.read_sql(query, engine)
df.head()

‚úÖ **Notes:**
- Uses `ROW_NUMBER()` and `rand()` to ensure **random yet balanced daily examples**.  
- The `month` filter restricts data to **Q1 2025** for seasonal exploration.  
- The `LIMIT` cap ensures fast execution even on large datasets.  

üí° **Tip:**  
- You can adjust `rn <= 100` to control how many random records per day are included (e.g., `rn <= 10` for faster, smaller sampling).


## üïí 6. Time-Based Feature Engineering

We extract the **hour of day** and **day of week** for temporal analysis.

In [None]:
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['hour'] = df['tpep_pickup_datetime'].dt.hour
df['day'] = df['tpep_pickup_datetime'].dt.day_name()

df[['tpep_pickup_datetime', 'hour', 'day']].head()

In [None]:
df['day'].value_counts()

## üßπ 7. Data Quality Check

We identify potential anomalies such as missing or incorrect passenger counts.

In [None]:
df.query("passenger_count == 0")[['trip_distance', 'fare_amount', 'total_amount']].describe()

In [None]:
df.query("passenger_count == 0 and fare_amount > 0")[['tpep_pickup_datetime', 'trip_distance', 'fare_amount']]

In [None]:
df = df[df['passenger_count'] > 0]

## üìä 8. Visualize Fare vs. Distance

We create a scatterplot using **Altair**, mapping:
- X-axis ‚Üí trip distance  
- Y-axis ‚Üí fare amount  
- Color ‚Üí passenger count  

Each point represents one taxi trip.

In [None]:
highlight = alt.selection_point(fields=['passenger_count'], bind='legend')

chart = (
    alt.Chart(df)
    .mark_circle(size=40)
    .encode(
        x='trip_distance:Q',
        y='fare_amount:Q',
        color=alt.condition(
            highlight,
            alt.Color('passenger_count:O', scale=alt.Scale(scheme='tableau10')),
            alt.value('lightgray')
        ),
        tooltip=['tpep_pickup_datetime', 'trip_distance', 'fare_amount', 'passenger_count']
    )
    .add_params(highlight)
    .properties(title='NYC Yellow Taxi ‚Äî Interactive Highlight by Passenger Count')
)
chart

üí° **Interaction:**
- Click a legend color to highlight trips for the passenger group.

üé® **Interpretation:**
- Fares increase roughly linearly with distance.
- Outliers may indicate fixed-fare routes (e.g., airport trips).
- Dense regions near zero could reflect short-distance rides.

## ‚è∞ 9. Trips by Hour of Day

In [None]:
hourly = (
    df.groupby('hour')['trip_distance']
    .count()
    .reset_index(name='trip_count')
)

alt.Chart(hourly).mark_bar(color="#4C78A8").encode(
    x=alt.X('hour:O', title='Hour of Day'),
    y=alt.Y('trip_count:Q', title='Number of Trips'),
    tooltip=[
        alt.Tooltip('hour:O', title='Hour of Day'),
        alt.Tooltip('trip_count:Q', title='Trips')
    ]
).properties(
    title='NYC Trips by Hour of Day'
)


## üìÖ 10. Average Fare by Day of Week

In [None]:
daily = (
    df.groupby('day')['fare_amount']
    .mean()
    .reset_index()
    .sort_values(by='fare_amount', ascending=False)
)

alt.Chart(daily).mark_bar(color="#F58518").encode(
    x=alt.X('day:N', sort=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']),
    y=alt.Y('fare_amount:Q', title='Average Fare ($)'),
    tooltip=[
        alt.Tooltip('day:N', title='Day of Week'),
        alt.Tooltip('fare_amount:Q', title='Average Fare ($)', format='.2f')
    ]
).properties(
    title='Average NYC Taxi Fare by Day of Week'
)

## ‚úÖ 11. Summary
- Connected securely to **Trino** using SQLAlchemy
- Queried and sampled NYC Taxi trip data efficiently
- Performed data quality checks and removed anomalies
- Derived time-based features and visual insights