# Delta Lake Lab 
## Unit 7: ZORDER & DATA SKIPPING

In the previous unit, we -
1. Learned how to time travel

In this unit, we will-
1. Learn about Z-Ordering and how it further optimizes data skipping

Z-Ordering is a (multi-dimensional clustering) technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake in data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Apache Spark needs to read. To Z-Order data, you specify the columns to order on in the ZORDER BY clause.

Data skipping information is collected automatically when you write data into a Delta Lake table. Delta Lake takes advantage of this information (minimum and maximum values for each column) at query time to provide faster queries. You do not need to configure data skipping; the feature is activated whenever applicable. However, its effectiveness depends on the layout of your data. For best results, apply Z-Ordering.

### 1. Imports

In [1]:
import pandas as pd

from pyspark.sql.functions import month, date_format
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession

from delta.tables import *

import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [2]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark

22/11/01 22:12:07 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 3. Declare variables

In [3]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  delta-lake-lab


In [4]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  delta-lake-lab


In [5]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  885979867746


In [None]:
ACCOUNT_NAME = "akhjain"

In [6]:
DATA_LAKE_ROOT_PATH= f"gs://dll-data-bucket-{PROJECT_NUMBER}-{ACCOUNT_NAME}"
DELTA_LAKE_DIR_ROOT = f"{DATA_LAKE_ROOT_PATH}/delta-consumable"
print(DELTA_LAKE_DIR_ROOT)

gs://dll-data-bucket-885979867746/delta-consumable


In [7]:
# Lets take a look at the data lake before the zordering 
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part*

gs://dll-data-bucket-885979867746/delta-consumable/part-00000-0bd1b0e2-4659-438c-be86-355bb31b4ac9-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-10887990-f51d-46ad-a768-2f8135d5fd95-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-2517e153-6923-4599-8b2e-5746fcf10973-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-363ee39f-ff75-4e4d-8938-d866d43955ca-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-72b5547b-66aa-4431-8822-5dff110e3c01-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-99635b87-214b-4afa-895e-8c124bf54009-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-9a73bff5-5ed8-4ce2-9383-b6f2865cbc6a-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-b28e7cf4-80c3-4054-8cef-3cff4fe805ea-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumab

The author's output was-
```
gs://dll-data-bucket-885979867746/delta-consumable/:
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-228eb829-1144-4a6d-a0e2-4fd39d9e9f57-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-29d7051f-ea28-4bd9-a8fd-8d9f8e38f163-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-3eb98e04-4353-4e5f-a8a2-17a570111981-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-47c79787-64f2-453c-8474-52cbcaeef3c2-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-48902d8f-0572-4735-b0d7-95b1927cb294-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-598882f5-6a6b-489f-a551-6e782e67702f-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-61eebddf-adb1-4088-9520-6a953e6d3fff-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-7b773887-eac1-481b-aa71-e57bd469f977-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-89dfe8e5-2d40-49f6-b0a0-4b320db62d14-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-b5620589-f197-4417-8fcb-ce013d82deb9-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-c36f0527-e995-41d2-a8ea-776cc865f816-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00001-487147d6-23f9-4a07-ae73-7cddbe8dfd06-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00001-8d9a7ec0-3a4d-43e4-8e33-379ffb2e4a3a-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00001-91311981-1505-4747-a8c9-6b3616efb4c0-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00001-ad3ddbfe-cec8-4877-8b05-3d40da1079ba-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00002-15ea05a6-24a6-41e3-b9d1-7073429eda5f-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00002-1717bbe6-6038-404c-aea4-e83d93eb0fa8-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00002-d43dcdd6-5fb8-4b31-baa2-5458bf321285-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00003-4c2e181f-c1eb-458f-aa08-9be435e56bb5-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00003-c15a6560-6d08-41f9-9ff8-546a02d4fca7-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00003-e6f811e0-d076-4619-bb1f-4e1ffd7caeb0-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00004-7663ab3e-1fbb-4ef9-bb08-80d4e6e3d9f4-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00005-c9b6ff00-72cb-4264-8f2a-d01f6c27b759-c000.snappy.parquet
```

In [8]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

29


### 4. ZORDER

In [9]:
spark.sql("OPTIMIZE loan_db.loans_by_state_delta ZORDER BY (addr_state)").show(truncate=False)

ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Runtime version 4.8 used for parser compilation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Runtime version 4.8 used for parser compilation does not match the current runtime version 4.9.3
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/spark/conf/ivysettings.xml will be used
                                                                                

+--------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|path                                              |metrics                                                                                                                                                                             |
+--------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|gs://dll-data-bucket-885979867746/delta-consumable|{1, 25, {7959, 7959, 7959.0, 1, 7959}, {397, 2069, 1585.08, 25, 39627}, 1, {all, {0, 0}, {25, 39627}, 0, {25, 39627}, 1, null}, 1, 25, 0, false, 0, 0, 1667340764026, 0, 8, 0, null}|
+--------------------------------------------------+------------

In [18]:
# Lets take a look at the data lake post the zordering. There is one extra file, that appears to be a file that has all the data in it.  
!gsutil ls -lh $DELTA_LAKE_DIR_ROOT/part* | sort

     397 B  2022-11-01T21:47:59Z  gs://dll-data-bucket-885979867746/delta-consumable/part-00000-10887990-f51d-46ad-a768-2f8135d5fd95-c000.snappy.parquet
     725 B  2022-11-01T21:47:16Z  gs://dll-data-bucket-885979867746/delta-consumable/part-00000-363ee39f-ff75-4e4d-8938-d866d43955ca-c000.snappy.parquet
     725 B  2022-11-01T21:47:35Z  gs://dll-data-bucket-885979867746/delta-consumable/part-00000-9a73bff5-5ed8-4ce2-9383-b6f2865cbc6a-c000.snappy.parquet
     973 B  2022-11-01T21:46:55Z  gs://dll-data-bucket-885979867746/delta-consumable/part-00000-99635b87-214b-4afa-895e-8c124bf54009-c000.snappy.parquet
     978 B  2022-11-01T21:30:30Z  gs://dll-data-bucket-885979867746/delta-consumable/part-00000-bd1648aa-3db7-43e9-beb7-db0593eefbb5-c000.snappy.parquet
     993 B  2022-11-01T21:48:06Z  gs://dll-data-bucket-885979867746/delta-consumable/part-00000-2517e153-6923-4599-8b2e-5746fcf10973-c000.snappy.parquet
   1.7 KiB  2022-11-01T22:06:40Z  gs://dll-data-bucket-885979867746/delta-consumab

22/11/01 22:17:10 WARN JavaUtils: Attempt to delete using native Unix OS command failed for path = /var/tmp/spark/local-dir/blockmgr-53e14031-71b8-46b9-a00f-e3a41ef8e152. Falling back to Java IO way
java.io.IOException: Failed to delete: /var/tmp/spark/local-dir/blockmgr-53e14031-71b8-46b9-a00f-e3a41ef8e152
	at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingUnixNative(JavaUtils.java:171)
	at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:110)
	at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:91)
	at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1206)
	at org.apache.spark.storage.DiskBlockManager.$anonfun$doStop$1(DiskBlockManager.scala:374)
	at org.apache.spark.storage.DiskBlockManager.$anonfun$doStop$1$adapted(DiskBlockManager.scala:370)
	at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328)
	at org.apache.spark.storage.DiskBlockManager.doStop(DiskBlockManager.scala:370)
	at org.apache.spar

In [11]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

30


In [12]:
# Lets take a look at the transaction log post the zordering 
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log

gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/:
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000001.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000002.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000003.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000004.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000005.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000006.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000007.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000008.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/0000000000000000000

In [13]:
# And review what is in the delta log
!gsutil cat $DELTA_LAKE_DIR_ROOT/_delta_log/00000000000000000011.json

{"add":{"path":"part-00000-bccf5ec4-c5ff-496d-bc26-572ec880d175-c000.snappy.parquet","partitionValues":{},"size":1981,"modificationTime":1667340427125,"dataChange":true,"stats":"{\"numRecords\":408,\"minValues\":{\"addr_state\":\"AK\",\"count\":0,\"collateral_value\":0.0},\"maxValues\":{\"addr_state\":\"WY\",\"count\":10134,\"collateral_value\":1.0134172359090886E8},\"nullCount\":{\"addr_state\":0,\"count\":0,\"collateral_value\":0}}"}}
{"add":{"path":"part-00001-c0cd2879-5233-4b39-a601-0d09972de653-c000.snappy.parquet","partitionValues":{},"size":1645,"modificationTime":1667340427032,"dataChange":true,"stats":"{\"numRecords\":153,\"minValues\":{\"addr_state\":\"AK\",\"count\":0,\"collateral_value\":0.0},\"maxValues\":{\"addr_state\":\"WY\",\"count\":3168,\"collateral_value\":3.1680389069416154E7},\"nullCount\":{\"addr_state\":0,\"count\":0,\"collateral_value\":0}}"}}
{"add":{"path":"part-00002-befeded6-328d-4445-a2b7-b3885ea6abf5-c000.snappy.parquet","partitionValues":{},"size":1953,"

### THIS CONCLUDES THIS LAB. PROCEED TO THE NEXT NOTEBOOK.