# Optimizing Data File Layout in Delta Lake


## Introduction
Delta Lake performance heavily depends on **how data files are organized and stored**.
Techniques:
- **Partitioning** – physical split by low-cardinality column
- **Z-Ordering (Multi-dimensional Clustering)** – groups similar data into optimized files
- **Liquid Clustering** – modern replacement for Z-Ordering, incremental & flexible

## Partitioning
Best for **low-cardinality columns** (like `year`, `month`, `region`).
Avoid high-cardinality partitioning (like `user_id`) – leads to too many small files.

Issue: **Changing partitions** requires a **full rewrite** of the table (costly for large data).

In [0]:
CREATE OR REPLACE TABLE sales_partitioned (
id INT,
product STRING,
amount DOUBLE,
year INT,
month INT
)
USING DELTA
PARTITIONED BY (year); -- low cardinality column


INSERT INTO sales_partitioned VALUES
(1, 'Laptop', 1200.50, 2023, 1),
(2, 'Tablet', 450.00, 2024, 2),
(3, 'Phone', 800.25, 2024, 5);


-- Query will only scan data from partition year = 2024
SELECT * FROM sales_partitioned WHERE year = 2024;

In [0]:
%sql
-- Performance comparison: Show query execution plan
EXPLAIN SELECT * FROM sales_partitioned WHERE year = 2024;

## Z-Ordering (Indexing-like)
- Organizes data within files instead of creating directories
- Useful for **high-cardinality columns** (like `customer_id`, `zipcode`)
- Not incremental → must rerun `OPTIMIZE` regularly

In [0]:
%sql
-- Create table without partitions
CREATE OR REPLACE TABLE sales_zorder (
id INT,
product STRING,
amount DOUBLE,
customer_id STRING
) USING DELTA;


-- Insert some sample rows
INSERT INTO sales_zorder VALUES
(1, 'Laptop', 1200.50, 'CUST1001'),
(2, 'Tablet', 450.00, 'CUST1002'),
(3, 'Phone', 800.25, 'CUST2001'),
(4, 'TV', 1500.00, 'CUST2002');


-- Z-Order by customer_id to cluster data files efficiently
OPTIMIZE sales_zorder
ZORDER BY (customer_id);
-- Now queries like `WHERE customer_id = 'CUST2001'` will scan fewer files.

In [0]:
-- Performance comparison before/after Z-Ordering
EXPLAIN SELECT * FROM sales_zorder WHERE customer_id = 'CUST2001';

## Liquid Clustering
- Next-gen replacement for Z-Ordering
- Works at **table level**
- Supports **incremental clustering** with `OPTIMIZE`
- You can **change clustering keys** without rewriting data
- Not compatible with Partitioning or Z-Ordering

In [0]:
%sql
-- Create a table with liquid clustering
CREATE OR REPLACE TABLE sales_liquid (
id INT,
product STRING,
amount DOUBLE,
region STRING,
customer_id STRING
)
USING DELTA
CLUSTER BY (region, customer_id);


-- Insert rows
INSERT INTO sales_liquid VALUES
(1, 'Laptop', 1200.50, 'US', 'CUST1001'),
(2, 'Tablet', 450.00, 'EU', 'CUST1002'),
(3, 'Phone', 800.25, 'APAC', 'CUST2001'),
(4, 'TV', 1500.00, 'US', 'CUST2002');


-- Incremental clustering
OPTIMIZE sales_liquid;

In [0]:
-- Performance comparison query
EXPLAIN SELECT * FROM sales_liquid WHERE region = 'US' AND customer_id = 'CUST2002';

### Auto Liquid Clustering
Databricks can automatically decide clustering keys (Unity Catalog only)

In [0]:
%sql
SELECT current_catalog();

-- Switch to a UC catalog
USE CATALOG azure_databricks_ws;
USE SCHEMA default;

In [0]:
CREATE OR REPLACE TABLE azure_databricks_ws.default.sales_auto (
id INT,
product STRING,
amount DOUBLE,
customer_id STRING,
region STRING
)
USING DELTA
CLUSTER BY AUTO;

-- Or update an existing table
ALTER TABLE azure_databricks_ws.default.sales_auto CLUSTER BY AUTO;

DESCRIBE EXTENDED azure_databricks_ws.default.sales_auto; 

In [0]:
-- Performance comparison with auto clustering
EXPLAIN SELECT * FROM sales_auto WHERE customer_id = 'CUST1001';

## Summary
- **Partitioning** – Best for low-cardinality, but costly to change
- **Z-Ordering** – Good for high-cardinality, not incremental
- **Liquid Clustering** – Flexible, incremental, modern replacement for Z-Ordering