# Delta Lake Lab 
## Unit 9: Table Restore + Storage optimization with OPTIMIZE and VACUUM

In the previous unit we-
1. Learned about table clones

In this unit, we will learn about-
1. Table storage optimization with the OPTIMIZE command
2. Restoring a table to a prior version
2. Table storage optimization with the VACUUM command


### 1. Imports

In [1]:
import pandas as pd

from pyspark.sql.functions import month, date_format
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession

from delta.tables import *

import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [None]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark

### 3. Declare variables

In [2]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  delta-lake-lab


In [3]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  delta-lake-lab


In [4]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  885979867746


In [None]:
ACCOUNT_NAME = "akhjain"

In [5]:
DATA_LAKE_ROOT_PATH= f"gs://dll-data-bucket-{PROJECT_NUMBER}-{ACCOUNT_NAME}"
DELTA_LAKE_DIR_ROOT = f"{DATA_LAKE_ROOT_PATH}/delta-consumable"
print(DELTA_LAKE_DIR_ROOT)

gs://dll-data-bucket-885979867746/delta-consumable


### 4. File listing

In [6]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

gs://dll-data-bucket-885979867746/delta-consumable/:
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-0d0c8ea0-982f-4f67-ab9d-62f94e57db11-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-27993337-bc5b-4c93-9ab0-b77f48ac9160-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-293e9d10-a628-4cf0-b86c-f9f289913756-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-33f34593-184c-40b8-adfe-73facf9f043f-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-595b5ba1-408f-404d-91ee-7bc396235870-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-5c57d030-c1f7-4f7b-b5d8-2100c4426482-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-670d672a-c41d-4391-a47e-90d45e589ac2-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-8572a416-efe4-47bf-ab8b-755973ae5a7a-c000.snappy.par

In [7]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

30


### 5. Optimize 
Delta Lake can improve the speed of read queries from a table by coalescing small files into larger ones through the "OPTIMIZE TABLE" command.
It is important to schedule periodic runs of this command for performance in a busy environment with too many small files.

Note: Optimize does not remove the small files, they still need to be deleted. It merely creates larger files that are indexed.

- Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no effect.

- Bin-packing aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number of tuples per file. However, the two measures are most often correlated.

- Python and Scala APIs for executing OPTIMIZE operation are available from Delta Lake 2.0 and above.

- Set Spark session configuration spark.databricks.delta.optimize.repartition.enabled=true to use repartition(1) instead of coalesce(1) for better performance when compacting many small files.

In [8]:
spark.sql("OPTIMIZE loan_db.loans_by_state_delta").show(truncate=False)

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/spark/conf/ivysettings.xml will be used
                                                                                

+--------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
|path                                              |metrics                                                                                                              |
+--------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
|gs://dll-data-bucket-885979867746/delta-consumable|{0, 0, {null, null, 0.0, 0, 0}, {null, null, 0.0, 0, 0}, 0, null, 0, 1, 1, false, 0, 0, 1666482433123, 0, 8, 0, null}|
+--------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+



In [9]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

gs://dll-data-bucket-885979867746/delta-consumable/:
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-0d0c8ea0-982f-4f67-ab9d-62f94e57db11-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-27993337-bc5b-4c93-9ab0-b77f48ac9160-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-293e9d10-a628-4cf0-b86c-f9f289913756-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-33f34593-184c-40b8-adfe-73facf9f043f-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-595b5ba1-408f-404d-91ee-7bc396235870-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-5c57d030-c1f7-4f7b-b5d8-2100c4426482-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-670d672a-c41d-4391-a47e-90d45e589ac2-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-8572a416-efe4-47bf-ab8b-755973ae5a7a-c000.snappy.par

In [10]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

30


Note that the parquet file count has not reduced. 

### 6. Restore to a prior version of the table

In [11]:
spark.sql("RESTORE TABLE loan_db.loans_by_state_delta TO VERSION AS OF 5").show(truncate=False)

                                                                                

DataFrame[table_size_after_restore: bigint, num_of_files_after_restore: bigint, num_removed_files: bigint, num_restored_files: bigint, removed_files_size: bigint, restored_files_size: bigint]

In [12]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

30


In [13]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* 

gs://dll-data-bucket-885979867746/delta-consumable/part-00000-0d0c8ea0-982f-4f67-ab9d-62f94e57db11-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-27993337-bc5b-4c93-9ab0-b77f48ac9160-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-293e9d10-a628-4cf0-b86c-f9f289913756-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-33f34593-184c-40b8-adfe-73facf9f043f-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-595b5ba1-408f-404d-91ee-7bc396235870-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-5c57d030-c1f7-4f7b-b5d8-2100c4426482-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-670d672a-c41d-4391-a47e-90d45e589ac2-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-8572a416-efe4-47bf-ab8b-755973ae5a7a-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumab

### 7. Vacuum - for deleting small files
Vacuum purges the small files that are not indexed.

In [14]:
spark.sql("VACUUM loan_db.loans_by_state_delta").show(truncate=False)

                                                                                

Deleted 0 files and directories in a total of 1 directories.
+--------------------------------------------------+
|path                                              |
+--------------------------------------------------+
|gs://dll-data-bucket-885979867746/delta-consumable|
+--------------------------------------------------+





In [15]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l



30


In [16]:
spark.sql("VACUUM loan_db.loans_by_state_delta RETAIN 0 HOURS")

IllegalArgumentException: requirement failed: Are you sure you would like to vacuum files with such a low retention period? If you have
writers that are currently writing to this table, there is a risk that you may corrupt the
state of your Delta table.

If you are certain that there are no operations being performed on this table, such as
insert/upsert/delete/optimize, then you may turn off this check by setting:
spark.databricks.delta.retentionDurationCheck.enabled = false

If you are not sure, please use a value not less than "168 hours".
       

Delta does not allow you to VACUUM with 0 retention unless you have the below property set-

In [17]:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled" , False)

Now VACUUM should work..

In [18]:
spark.sql("VACUUM loan_db.loans_by_state_delta RETAIN 0 HOURS")

                                                                                

Deleted 28 files and directories in a total of 1 directories.


DataFrame[path: string]

In [19]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

2


In [20]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

gs://dll-data-bucket-885979867746/delta-consumable/:
gs://dll-data-bucket-885979867746/delta-consumable/
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-9e3ca1ea-93a4-455a-9988-5e1250281ac0-c000.snappy.parquet
gs://dll-data-bucket-885979867746/delta-consumable/part-00000-c2a74fad-c560-4efd-8b2e-5a8a7ddcc62e-c000.snappy.parquet

gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/:
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000000.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000001.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000002.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000003.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000004.json
gs://dll-data-bucket-885979867746/delta-consumable/_delta_log/00000000000000000005.json
g

### THIS CONCLUDES THIS LAB. DONT FORGET TO SHUT DOWN THE LAB RESOURCES.