# OPTIMIZE Command

Delta Lake can improve the speed of read queries from a table by coalescing small files into larger ones

## Compaction (bin-packing)

Bin-packing optimization is **_idempotent_**, meaning that if it is run twice on the same dataset, the second run has no effect.

Bin-packing aims to produce **evenly-balanced data files** with respect to their size on disk, but not necessarily number of tuples per file. However, the two measures are most often correlated.


In [None]:
%%sql
DROP TABLE IF EXISTS demo.optimize_demo;
CREATE TABLE demo.optimize_demo (id int);

In [None]:
%%sql
INSERT INTO demo.optimize_demo VALUES(1);
INSERT INTO demo.optimize_demo VALUES(2);
INSERT INTO demo.optimize_demo VALUES(3);
INSERT INTO demo.optimize_demo VALUES(4);
INSERT INTO demo.optimize_demo VALUES(5);
INSERT INTO demo.optimize_demo VALUES(6);
INSERT INTO demo.optimize_demo VALUES(7);
INSERT INTO demo.optimize_demo VALUES(8);
INSERT INTO demo.optimize_demo VALUES(9);

> Looking at storage level

In [None]:
mssparkutils.fs.ls("Tables/optimize_demo")

> Counting how many files there are

In [None]:
sc.binaryFiles("Tables/optimize_demo/").count()

## Run the command

Delta Lake can **improve the speed of read** queries from a table by coalescing small files into larger ones.

In [None]:
from delta.tables import DeltaTable

delta_table = DeltaTable.forName(spark, 'demo.optimize_demo')
delta_table.optimize().executeCompaction() 

> OR

In [None]:
%%sql
OPTIMIZE optimize_demo;

## Delta Log

OPTIMIZE operation is add to the delta log and you can track it

In [None]:
display(delta_table.history())

In [None]:
deltalog = spark.read.json("Tables/optimize_demo/_delta_log/00000000000000000010.json")
# Remove information
display(deltalog.select("remove.path").where("remove is not null"))

In [None]:
# Add information
display(deltalog.select("add.path").where("add is not null"))

In [None]:
display(spark.read.table("demo.optimize_demo"))

In [None]:
%%sql
SELECT * FROM demo.optimize_demo

## Vacuum

In [None]:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
delta_table.vacuum(0) 

In [None]:
mssparkutils.fs.ls("Tables/optimize_demo")

In [None]:
sc.binaryFiles("Tables/optimize_demo/").count()

In [None]:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "true")

## Subset of data

You must have a table partitioned to be able to run OPTIMIZE command in a subset of data

In [None]:
from pyspark.sql.functions import expr, lit, col
from pyspark.sql.types import *


spark.sql("DROP TABLE IF EXISTS optimize_demo")

df = spark.range(0,10000) \
    .withColumn("date", expr("cast(concat('2023-06-', cast(rand(5) * 30 as int) + 1) as date)")) 

df.write.partitionBy("date").format("delta").saveAsTable("optimize_demo")

In [None]:
delta_table.optimize().where("date='2023-06-01'").executeCompaction() 

In [None]:
%%sql
OPTIMIZE optimize_demo WHERE date='2023-06-01'

# Clean up

In [None]:
%%sql
DROP TABLE IF EXISTS demo.optimize_demo;