
1. Налаштування Unity Catalog:
  * Створіть метастор (metastore), якщо його ще не існує.
  * Підключіть метастор до робочого простору.
  * Створіть:
    * Каталог <ім’я прізвище студента>_nyc_catalog
    * Схему trips_schema
    * Таблицю raw_trips

✅ Завдання 2: Імпорт, уніфікація та об’єднання

1. Імпорт даних:
  * Зчитайте yellow_tripdata_*.csv і green_tripdata_*.csv у Spark DataFrame.

2. Уніфікація схем:
  * Уніфікуйте назви колонок та типи даних.
  * Додайте колонку taxi_type зі значеннями yellow або green.

3. Фільтрація аномалій:
  * Видаліть записи:
    * Відстань < 0.1 км
    * Тариф < \$2
    * Тривалість < 1 хв

4. Збагачення колонками:
  * Додайте колонки:
    * pickup_hour
    * pickup_day_of_week
    * duration_min

5. JOIN з taxi_zone_lookup:
  * Додайте pickup_zone та dropoff_zone через join.

6. Збереження в Delta Lake:
* Запишіть результат у Unity Catalog:
* Формат: Delta Lake

✅ Завдання 3: Зведена аналітика zone_summary

1. Агрегація zone_summary:
  * Створіть датафрейм з колонками:
    * pickup_zone
    * total_trips
    * avg_trip_distance
    * avg_total_amount
    * avg_tip_amount
    * yellow_share / green_share
    * max_trip_distance
    * min_tip_amount
    * total_trip\amount

2. Збереження результатів:
  * Формат: Delta
  * Таблиця: zone_summary
  * Розміщення: Unity Catalog + S3 шлях (як external location або managed)

✅ Завдання 4: Додаткові аналітичні розрахунки

1. Агрегація по днях тижня:
  * Використайте raw_trips або zone_summary
  * Розрахуйте:
    * Total_trips_per_day
    * Avg_duration_per_zone
    * high_fare_share (fare > $30)

2. Збереження результату:
  * Таблиця: zone_days_summary
  * Формат: Delta
  * Розміщення: Unity Catalog + S3 шлях (як external location або managed)

In [0]:
spark.version

'3.5.2'

In [0]:
sc = spark.sparkContext

In [0]:
from enum import Enum


class TaxiColors(str, Enum):
    GREEN = "green"
    YELLOW = "yellow"

In [0]:
import os

TAXI_DIR_PATH = "s3a://robot-dreams-source-data/home-work-1-unified/nyc_taxi/"
YELLOW_TAXI_DIR_PATH = os.path.join(TAXI_DIR_PATH, TaxiColors.YELLOW)
GREEN_TAXI_DIR_PATH = os.path.join(TAXI_DIR_PATH, TaxiColors.GREEN)

In [0]:
dbutils.fs.unmount(
  "/mnt/lavreniuk/yellow_taxi",
)

dbutils.fs.unmount(
  "/mnt/lavreniuk/green_taxi",
)

/mnt/lavreniuk/yellow_taxi has been unmounted.
/mnt/lavreniuk/green_taxi has been unmounted.


True

In [0]:
dbutils.fs.mount(
  YELLOW_TAXI_DIR_PATH,
  "/mnt/lavreniuk/yellow_taxi",
)

True

In [0]:
dbutils.fs.mount(
 GREEN_TAXI_DIR_PATH,
  "/mnt/lavreniuk/green_taxi",
)

True

In [0]:
display(dbutils.fs.ls("/mnt/lavreniuk/"))

path,name,size,modificationTime
dbfs:/mnt/lavreniuk/green_taxi/,green_taxi/,0,0
dbfs:/mnt/lavreniuk/taxi_zone_lookup.csv/,taxi_zone_lookup.csv/,0,0
dbfs:/mnt/lavreniuk/yellow_taxi/,yellow_taxi/,0,0


In [0]:
# Creating schemas and read whole dataframes

from pyspark.sql.types import StructType, StructField, StringType, DoubleType, LongType, TimestampNTZType, IntegerType
from pyspark.sql.functions import col

df_yellow = (
    spark.read.options(
        mergeSchema="true",
        recursiveFileLookup="true",
    )
    .parquet("/mnt/lavreniuk/yellow_taxi")
    .withColumnRenamed("tpep_pickup_datetime", "pickup_datetime")
    .withColumnRenamed("tpep_dropoff_datetime", "dropoff_datetime")
    .withColumnRenamed("airport_fee", "fee")
)

df_green = (
    spark.read.options(
        mergeSchema="true",
        recursiveFileLookup="true",
    )
    .parquet("/mnt/lavreniuk/green_taxi")
    .withColumnRenamed("lpep_pickup_datetime", "pickup_datetime")
    .withColumnRenamed("lpep_dropoff_datetime","dropoff_datetime")
    .withColumnRenamed("Airport_fee", "fee")
    .drop("tpep_pickup_datetime", "tpep_dropoff_datetime") 
)

In [0]:
from pyspark.sql.functions import lit


df_green = df_green.withColumn("taxi_type", lit(TaxiColors.GREEN))
df_yellow = df_yellow.withColumn("taxi_type", lit(TaxiColors.YELLOW))

In [0]:
# Use unioning by name, I dont sure col positions are equal

raw_trips_df = df_yellow.unionByName(df_green)
print(df_green.count() + df_yellow.count() == raw_trips_df.count(), raw_trips_df.count())

True 823006231


In [0]:
for _col in raw_trips_df.columns:
    print(_col)
    try:
        raw_trips_df.filter(col(_col).isNotNull()).select(_col).count()
    except Exception as e:
        print(e)

VendorID
passenger_count
trip_distance
RatecodeID
store_and_fwd_flag
PULocationID
DOLocationID
payment_type
fare_amount
extra
mta_tax
tip_amount
tolls_amount
improvement_surcharge
total_amount
congestion_surcharge
pickup_datetime
dropoff_datetime
fee
taxi_type


In [0]:
RAW_TRIPS_TABLE_PATH = "s3a://lavreniuk-hw2/data/raw-trips/"

raw_trips_df.write \
  .format("parquet") \
  .mode("overwrite") \
  .option("path", RAW_TRIPS_TABLE_PATH) \
  .saveAsTable("dima_lavreniuk_nyc_catalog.trips_schema.raw_trips")

In [0]:
raw_trips_table_df = spark.table("dima_lavreniuk_nyc_catalog.trips_schema.raw_trips")

In [0]:
# Вилучити поїздки з відстанню < 0.1 км, тарифом < $2, тривалістю < 1 хв.

from pyspark.sql.functions import unix_timestamp, dayofweek, date_trunc, hour
from pyspark.sql.functions import round as sp_round

raw_trips_df_filtered = raw_trips_table_df.withColumns({
    "pickup_hour": hour("pickup_datetime"),
    "pickup_day_of_week": dayofweek(col("pickup_datetime")),
    "duration_min": sp_round(
        (unix_timestamp(col("dropoff_datetime")) - unix_timestamp(col("pickup_datetime"))) / 60,
        4
    ) # long casting doesnt work with not-tz
}).filter(
    (col("trip_distance") >= 0.1) &
    (col("total_amount") >= 2) &
    (col("duration_min") >= 1 )
)

raw_trips_df_filtered.count()

750062658

In [0]:
TAXI_ZONE_LOOKUP_URL = os.path.join(TAXI_DIR_PATH, "taxi_zone_lookup.csv")
dbutils.fs.unmount("/mnt/lavreniuk/taxi_zone_lookup.csv")
dbutils.fs.mount(
  TAXI_ZONE_LOOKUP_URL,
  "/mnt/lavreniuk/taxi_zone_lookup.csv",
)

/mnt/lavreniuk/taxi_zone_lookup.csv has been unmounted.


True

In [0]:
taxi_zone_df = spark.read.csv(TAXI_ZONE_LOOKUP_URL, header=True)
taxi_zone_df.show(5)

+----------+-------------+--------------------+------------+
|LocationID|      Borough|                Zone|service_zone|
+----------+-------------+--------------------+------------+
|         1|          EWR|      Newark Airport|         EWR|
|         2|       Queens|         Jamaica Bay|   Boro Zone|
|         3|        Bronx|Allerton/Pelham G...|   Boro Zone|
|         4|    Manhattan|       Alphabet City| Yellow Zone|
|         5|Staten Island|       Arden Heights|   Boro Zone|
+----------+-------------+--------------------+------------+
only showing top 5 rows


In [0]:
# Виконати JOIN з taxi_zone_lookup.csv, додавши поля pickup_zone, dropoff_zone.
from pyspark.sql.functions import broadcast

trips_df_joined = raw_trips_df_filtered.join(
        broadcast(
            taxi_zone_df.select(
                col("Zone").alias("pickup_zone"),
                col("LocationID").alias("PULocationID"),
            )
        ),
        how="left",
        on="PULocationID"
    ).join(
        broadcast(
            taxi_zone_df.select(
                col("Zone").alias("dropoff_zone"),
                col("LocationID").alias("DOLocationID"),
            )
        ),
        how="left",
        on="DOLocationID" 
    )

trips_df_joined.show(5)

+------------+------------+--------+---------------+-------------+----------+------------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-------------------+-------------------+----+---------+-----------+------------------+------------+--------------------+----------------+
|DOLocationID|PULocationID|VendorID|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|    pickup_datetime|   dropoff_datetime| fee|taxi_type|pickup_hour|pickup_day_of_week|duration_min|         pickup_zone|    dropoff_zone|
+------------+------------+--------+---------------+-------------+----------+------------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-------------------+-------------------+----+---------+-----------+------------

In [0]:
EXTENTED_TRIPS_TABLE_PATH = "s3a://lavreniuk-hw2/data/extended-trips/"

trips_df_joined.write \
  .format("parquet") \
  .mode("overwrite") \
  .option("path", EXTENTED_TRIPS_TABLE_PATH) \
  .saveAsTable("dima_lavreniuk_nyc_catalog.trips_schema.extended_trips")

In [0]:
extended_trips_table = spark.table("dima_lavreniuk_nyc_catalog.trips_schema.extended_trips")

In [0]:
from pyspark.sql.functions import count, avg, col
from pyspark.sql.functions import max as sp_max, min as sp_min, sum as sp_sum

zone_summary = (
    extended_trips_table.groupBy("pickup_zone").agg(
        count("*").alias("total_trips"),
        avg("trip_distance").alias("avg_trip_distance"),
        avg("total_amount").alias("avg_total_amount"),
        avg("tip_amount").alias("avg_tip_amount"),
        (avg((col("taxi_type") == "yellow").cast("int")) * 100).alias("yellow_share"),
        (avg((col("taxi_type") == "green").cast("int")) * 100).alias("green_share"),
        sp_max("trip_distance").alias("max_trip_distance"),
        sp_min("tip_amount").alias("min_tip_amount"),
    )
)

zone_summary.show(5)

+--------------------+-----------+------------------+------------------+------------------+------------------+------------------+-----------------+--------------+
|         pickup_zone|total_trips| avg_trip_distance|  avg_total_amount|    avg_tip_amount|      yellow_share|       green_share|max_trip_distance|min_tip_amount|
+--------------------+-----------+------------------+------------------+------------------+------------------+------------------+-----------------+--------------+
|Governor's Island...|       1867| 3.901039100160685| 18.74173004820567|1.9904392072844128|             100.0|               0.0|             29.1|           0.0|
|           Homecrest|      34190| 30.87262386662754| 25.87658847616139|0.7470394852295992| 36.75636150921322| 63.24363849078678|        235036.33|           0.0|
|              Corona|      64329|24.972252327876866|28.682097343342683|1.7341660837258468|49.284148673226696| 50.71585132677331|        250984.47|           0.0|
|    Bensonhurst West|

In [0]:
%sql
-- Випадково створив не той формат
DROP TABLE `dima_lavreniuk_nyc_catalog`.`trips_schema`.`zone_summary`;
DROP TABLE `dima_lavreniuk_nyc_catalog`.`trips_schema`.`zone_days_summary`;

In [0]:
ZONE_SUMMARY_TABLE_PATH = "s3a://lavreniuk-hw2/data/zone-summary/"

zone_summary.write \
  .format("delta") \
  .mode("overwrite") \
  .option("path", ZONE_SUMMARY_TABLE_PATH) \
  .saveAsTable("dima_lavreniuk_nyc_catalog.trips_schema.zone_summary")

In [0]:
from pyspark.sql.functions import when, col

zone_days_statstic_df = (
    extended_trips_table.groupBy("pickup_day_of_week", "pickup_zone").agg(
        count("*").alias("total_trips"),
        (avg((col("fare_amount") > 30).cast("int")) * 100).alias("high_fare_share"),
    ).withColumn(
        "pickup_day_of_week", 
        (when(col("pickup_day_of_week") == 1, "Sunday")
            .when(col("pickup_day_of_week") == 2, "Monday")
            .when(col("pickup_day_of_week") == 3, "Tuesday")
            .when(col("pickup_day_of_week") == 4, "Wednesday")
            .when(col("pickup_day_of_week") == 5, "Thursday")
            .when(col("pickup_day_of_week") == 6, "Friday")
            .when(col("pickup_day_of_week") == 7, "Saturday")
        )
    )
)

zone_days_statstic_df.show(5)

+------------------+-------------------+-----------+------------------+
|pickup_day_of_week|        pickup_zone|total_trips|   high_fare_share|
+------------------+-------------------+-----------+------------------+
|           Tuesday|   Prospect Heights|      30788| 7.571131609718072|
|           Tuesday|      Dyker Heights|       2402|27.352206494587843|
|           Tuesday|  Crotona Park East|       2641|22.832260507383566|
|         Wednesday|Lincoln Square East|    3208239| 2.859387969537182|
|         Wednesday|  Crotona Park East|       2633|21.800227876946447|
+------------------+-------------------+-----------+------------------+
only showing top 5 rows


In [0]:
ZONE_DAYS_SUMMARY_TABLE_PATH = "s3a://lavreniuk-hw2/data/zone-days-summary/"

zone_days_statstic_df.write \
  .format("delta") \
  .mode("overwrite") \
  .option("path", ZONE_DAYS_SUMMARY_TABLE_PATH) \
  .saveAsTable("dima_lavreniuk_nyc_catalog.trips_schema.zone_days_summary")

In [0]:
%sql

select * from dima_lavreniuk_nyc_catalog.trips_schema.zone_summary limit 5;

pickup_zone,total_trips,avg_trip_distance,avg_total_amount,avg_tip_amount,yellow_share,green_share,max_trip_distance,min_tip_amount
Governor's Island/Ellis Island/Liberty Island,1867,3.901039100160685,18.74173004820567,1.9904392072844128,100.0,0.0,29.1,0.0
Homecrest,34190,30.87262386662754,25.87658847616139,0.7470394852295992,36.75636150921322,63.24363849078678,235036.33,0.0
Corona,64329,24.972252327876863,28.682097343342683,1.7341660837258468,49.284148673226696,50.71585132677331,250984.47,0.0
Bensonhurst West,34719,55.11959993087384,29.322031740544546,0.8070805610760681,39.74768858550073,60.25231141449928,350696.98,0.0
Westerleigh,801,9.987215980024969,44.46600499375781,2.9903245942571783,76.02996254681648,23.97003745318352,124.4,0.0


In [0]:
%sql

GRANT ALL PRIVILEGES ON CATALOG `dima_lavreniuk_nyc_catalog` TO `deniskulemza1@gmail.com`;