# Z-Ordering

Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake in data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Apache Spark needs to read. To Z-Ordering data, you specify the columns to order on in the ZORDER BY.

You can specify multiple columns for ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with each extra column. Z-Ordering on columns that do not have statistics collected on them would be ineffective and a waste of resources. This is because data skipping requires column-local stats such as min, max, and count. You can configure statistics collection on certain columns by reordering columns in the schema, or you can increase the number of columns to collect statistics on.


This demo 1 table with 3 files will be created to simulate data colocation.  

In [None]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

delta_table_name = 'demo.zorder_demo'
spark.sql(f"DROP TABLE IF EXISTS {delta_table_name}")

spark.range(5000, 10001, 1 , 1) \
        .withColumn("order_id", col("id")) \
        .withColumn("customer_id", col("id")) \
        .withColumn("info", lit("here is some data")) \
.write.format("delta").saveAsTable(delta_table_name)

spark.range(7500,12500,1,1) \
    .withColumn("order_id", col("id")) \
    .withColumn("customer_id", col("id")) \
    .withColumn("info", lit("here is some data")) \
.write.mode("append").format("delta").saveAsTable(delta_table_name)

spark.range(10000,15000,1,1) \
    .withColumn("order_id", col("id")) \
    .withColumn("customer_id", col("id")) \
    .withColumn("info", lit("here is some data")) \
.write.mode("append").format("delta").saveAsTable(delta_table_name)

You can see that value 10000 appears in all files. Using a function **input_file_name()** we can get the file path that the data is stored.  

In [None]:
%%sql
SELECT * , input_file_name()  FROM zorder_demo WHERE order_id = 10000

## Checking statistics

In [None]:
schema = StructType([StructField("numRecords", IntegerType(), False),
                StructField("minValues", StringType(), False),
                StructField("maxValues", StringType(), False), 
                StructField("nullCount", StringType(), False)])

logFile = spark.read.json("Tables/zorder_demo/_delta_log/*.json")
stats_logFile = logFile.withColumn("parsed_stats", from_json(logFile["add.stats"], schema))

display(stats_logFile.select("add.path", "parsed_stats.numRecords","parsed_stats.minValues","parsed_stats.maxValues","parsed_stats.nullCount").where("add is not null"))

By default maxFileSize is **1GB**

In [None]:
spark.conf.get("spark.microsoft.delta.optimize.maxFileSize")

For demo purposes only, I will reduce the maxFileSize to show Z-Ordering

In [None]:
spark.conf.set("spark.microsoft.delta.optimize.maxFileSize", 1024*50)

## Run Z-Order

In [None]:
import delta
deltaTable = delta.DeltaTable.forName(spark, "zorder_demo")
deltaTable.optimize().executeZOrderBy("order_id","customer_id")

In [None]:
%%sql

OPTIMIZE zorder_demo ZORDER BY (order_id, customer_id)

## Checking Delta log

In [None]:
display(deltaTable.history())

> Get the latest JSON file to inspect it

In [None]:
mssparkutils.fs.ls("Tables/zorder_demo/_delta_log/")

In [None]:
logFile_ = spark.read.json("Tables/zorder_demo/_delta_log/00000000000000000003*.json")

schema = StructType([StructField("numRecords", IntegerType(), False),
                StructField("minValues", StringType(), False),
                StructField("maxValues", StringType(), False), 
                StructField("nullCount", StringType(), False)])

stats_logFile_ = logFile_.withColumn("parsed_stats", from_json(logFile_["add.stats"], schema))

display(stats_logFile_.select("add.path", "parsed_stats.numRecords","parsed_stats.minValues","parsed_stats.maxValues","parsed_stats.nullCount").where("add is not null"))

Note that values of _**order_id**_ and _**customer_id**_ are colocated much near to each other. 

Now, rather than read three files, it will only read one!

In [None]:
%%sql
SELECT *, input_file_name() 
FROM demo.zorder_demo
WHERE order_id = 10000

> Reading from previous version to compare

In [None]:
%%sql
SELECT * , input_file_name() 
FROM zorder_demo VERSION AS OF 2 
WHERE order_id = 10000

# Clean up

In [None]:
spark.sql(f"DROP TABLE IF EXISTS {delta_table_name}")