### Analyze the Query Plan

In [0]:
# Analyze how Spark plans to execute the query on your specific table
print("Analyzing Query Plan for Original Table:")
spark.sql("SELECT * FROM workspace.ecommerce_analysis.silver_events WHERE event_type='purchase'").explain(True)

# View the physical storage details
display(spark.sql("DESCRIBE DETAIL workspace.ecommerce_analysis.silver_events").select("location", "numFiles", "sizeInBytes"))

Analyzing Query Plan for Original Table:
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('event_type = purchase)
   +- 'UnresolvedRelation [workspace, ecommerce_analysis, silver_events], [], false

== Analyzed Logical Plan ==
event_timestamp: timestamp, event_type: string, product_id: int, product_name: string, category_code: string, brand: string, price: decimal(10,2), user_id: int, user_session: string, ingestion_ts: timestamp
Project [event_timestamp#13197, event_type#13198, product_id#13199, product_name#13200, category_code#13201, brand#13202, price#13203, user_id#13204, user_session#13205, ingestion_ts#13206]
+- Filter (event_type#13198 = purchase)
   +- SubqueryAlias workspace.ecommerce_analysis.silver_events
      +- Relation workspace.ecommerce_analysis.silver_events[event_timestamp#13197,event_type#13198,product_id#13199,product_name#13200,category_code#13201,brand#13202,price#13203,user_id#13204,user_session#13205,ingestion_ts#13206] parquet

== Optimized Logical Plan ==


location,numFiles,sizeInBytes
,8,1185263493


### Physical Data Partitioning

In [0]:
# Step 2: Create the table and generate the event_date column
spark.sql("""
  CREATE TABLE IF NOT EXISTS workspace.ecommerce_analysis.silver_events_optimized
  USING DELTA
  PARTITIONED BY (event_date)
  AS SELECT 
    *, 
    to_date(event_timestamp) AS event_date 
  FROM workspace.ecommerce_analysis.silver_events
""")

print("Table created with new event_date column and partitioned.")

Table created with new event_date column and partitioned.


### Multi-Dimensional Clustering (Z-ORDER)

In [0]:
# Optimize the layout of the new table
spark.sql("OPTIMIZE workspace.ecommerce_analysis.silver_events_optimized ZORDER BY (user_id, product_id)")

print("ZORDER optimization applied.")

ZORDER optimization applied.


### Benchmark the Performance

In [0]:
import time

# Benchmark the Original Table
start = time.time()
spark.sql("SELECT * FROM workspace.ecommerce_analysis.silver_events WHERE user_id=12345").count()
print(f"Original Table Time: {time.time()-start:.2f}s")

# Benchmark the Optimized Table (The one you just created)
start = time.time()
spark.sql("SELECT * FROM workspace.ecommerce_analysis.silver_events_optimized WHERE user_id=12345").count()
print(f"Optimized Table Time: {time.time()-start:.2f}s")

Original Table Time: 0.63s
Optimized Table Time: 0.54s


### Serverless Auto-Caching

In [0]:
# On Serverless, we don't use .cache(). 
# Instead, we run a 'warm-up' query to trigger automatic caching.
spark.table("workspace.ecommerce_analysis.silver_events_optimized").count()

print("Optimized table is ready. Serverless compute will handle caching automatically.")

Optimized table is ready. Serverless compute will handle caching automatically.
