Before reading any files, I first listed the contents of the raw data volume to confirm that the last-mile delivery parquet files are available and accessible. This helps ensure the ingestion path is correct before proceeding.

In [0]:
dbutils.fs.ls("/Volumes/capstone_project/logistics/last_mile_raw")

In the below cell, I loaded the pickup data of 5 different cities from parquet files into a Spark DataFrame. Each city is read separately so that the data sources remain clear and traceable.

In [0]:
df_shanghai = spark.read.parquet(
   "/Volumes/capstone_project/logistics/last_mile_raw/pickup_sh-00000-of-00001-79fabe8088e723a2.parquet")

df_hangzhou = spark.read.parquet(
    "/Volumes/capstone_project/logistics/last_mile_raw/pickup_hz-00000-of-00001-2641abebfe50648a.parquet")

df_chongqing = spark.read.parquet(
    "/Volumes/capstone_project/logistics/last_mile_raw/pickup_cq-00000-of-00001-a172031e5392f9d3.parquet")	

df_jilin = spark.read.parquet(
    "/Volumes/capstone_project/logistics/last_mile_raw/pickup_jl-00000-of-00001-9b430a56a935f284.parquet")

df_yantai = spark.read.parquet(
    "/Volumes/capstone_project/logistics/last_mile_raw/pickup_yt-00000-of-00001-6d21a4dccd28ee03.parquet") 

Since the raw datasets do not contain a city column, I added one manually to each DataFrame. This preserves source context once the datasets are combined and enables city-level analysis later.

In [0]:
from pyspark.sql.functions import lit

df_shanghai = df_shanghai.withColumn("city", lit("Shanghai"))
df_hangzhou = df_hangzhou.withColumn("city", lit("Hangzhou"))
df_chongqing = df_chongqing.withColumn("city", lit("Chongqing"))
df_jilin = df_jilin.withColumn("city", lit("Jilin"))
df_yantai = df_yantai.withColumn("city", lit("Yantai"))

After adding the city column, I merged all city-level DataFrames into a single unified DataFrame using unionByName to ensure correct column alignment.

In [0]:
df_all = (
    df_shanghai
    .unionByName(df_hangzhou)
    .unionByName(df_chongqing)
    .unionByName(df_jilin)
    .unionByName(df_yantai)
)

Before writing the data, I explicitly set the catalog and schema to ensure the table is created in the correct project namespace.

In [0]:
%sql
USE CATALOG capstone_project;
USE SCHEMA logistics;

In the below cell, I added an ingestion timestamp to track when the data entered the system. This metadata is useful for auditing and pipeline monitoring.

In [0]:
from pyspark.sql.functions import current_timestamp

df_bronze = (
    df_all
    .withColumn("ingestion_timestamp", current_timestamp())
)

Finally, I wrote the combined dataset to a Delta table in overwrite mode. This creates the Bronze layer, which stores raw ingested data with minimal transformation.

In [0]:
df_bronze.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("bronze_last_mile_deliveries")