# Lecture 13. Apply Advanced Delta Features (Hands On)


In this notebook, we will see some advanced concepts in Delta Lake, such as the time travel feature. 
In addition, we will see Optimize and Vacuum commands.



In [0]:
%sql
USE CATALOG hive_metastore

## Delta Time Travel

Let us review again our table history.

In [0]:
%sql
DESCRIBE HISTORY employees

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
5,2024-10-12T04:55:00Z,2895352578531874,suryapulika38@gmail.com,UPDATE,"Map(predicate -> [""StartsWith(name#10523, A)""])",,List(4341422527294408),1011-150700-u18wk0fi,4.0,WriteSerializable,False,"Map(numRemovedFiles -> 2, numRemovedBytes -> 2142, numCopiedRows -> 1, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 1375, scanTimeMs -> 899, numAddedFiles -> 2, numUpdatedRows -> 2, numAddedBytes -> 2142, rewriteTimeMs -> 466)",,Databricks-Runtime/13.3.x-scala2.12
4,2024-10-12T04:53:58Z,2895352578531874,suryapulika38@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(4341422527294408),1011-150700-u18wk0fi,3.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1059)",,Databricks-Runtime/13.3.x-scala2.12
3,2024-10-12T04:53:57Z,2895352578531874,suryapulika38@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(4341422527294408),1011-150700-u18wk0fi,2.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1066)",,Databricks-Runtime/13.3.x-scala2.12
2,2024-10-12T04:53:56Z,2895352578531874,suryapulika38@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(4341422527294408),1011-150700-u18wk0fi,1.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 2, numOutputBytes -> 1080)",,Databricks-Runtime/13.3.x-scala2.12
1,2024-10-12T04:53:55Z,2895352578531874,suryapulika38@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(4341422527294408),1011-150700-u18wk0fi,0.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 2, numOutputBytes -> 1076)",,Databricks-Runtime/13.3.x-scala2.12
0,2024-10-12T04:52:39Z,2895352578531874,suryapulika38@gmail.com,CREATE TABLE,"Map(partitionBy -> [], description -> null, isManaged -> true, properties -> {}, statsOnLoad -> false)",,List(4341422527294408),1011-150700-u18wk0fi,,WriteSerializable,True,Map(),,Databricks-Runtime/13.3.x-scala2.12


So, here we can see the 5 versions of our table.

In Delta Lake, we can easily query previous versions of our table, 
and this feature of time travel is possible thanks to those extra data files 
that had been marked as removed in our transaction log.

So, let us say we want to access our data before the update operation, which is the version number `4`.

We can simply use `SELECT` query with `VERSION AS OF` keyword, and we specify the version number, in our case, it is version `4`.

(We can instead use timestamp to query data at the time of interest.)

In [0]:
%sql
SELECT * FROM employees VERSION AS OF 4

id,name,salary
3,John,2999.3
4,Thomas,4000.3
1,Adam,3500.0
2,Sarah,4020.5
5,Anna,2500.0
6,Kim,6200.3


So, here we see our data before the update operation.

Another alternative syntax is to use @v followed by the version number.

In [0]:
%sql
SELECT * FROM employees@v4

id,name,salary
3,John,2999.3
4,Thomas,4000.3
1,Adam,3500.0
2,Sarah,4020.5
5,Anna,2500.0
6,Kim,6200.3


Now, imagine this scenario,

We deleted our data and we need to restore them.



In [0]:
%sql
DELETE FROM employees

num_affected_rows
6


Let us confirm this.

In [0]:
%sql
SELECT * FROM employees

id,name,salary


So our table data has been indeed removed.

We can simply roll back to a previous version before deletion using the RESTORE TABLE command.

In [0]:
%sql
RESTORE TABLE employees TO VERSION AS OF 5

table_size_after_restore,num_of_files_after_restore,num_removed_files,num_restored_files,removed_files_size,restored_files_size
4281,4,0,4,0,4281


Great! data has been restored.

Let us confirm this.

In [0]:
%sql
SELECT * FROM employees

id,name,salary
3,John,2999.3
4,Thomas,4000.3
1,Adam,3600.0
2,Sarah,4020.5
5,Anna,2600.0
6,Kim,6200.3


Let us explore what really happened in our table.

In [0]:
%sql
DESCRIBE HISTORY employees

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
7,2024-10-12T07:17:52Z,2895352578531874,suryapulika38@gmail.com,RESTORE,"Map(version -> 5, timestamp -> null)",,List(4341422527294472),1011-150700-u18wk0fi,6.0,Serializable,False,"Map(numRestoredFiles -> 4, removedFilesSize -> 0, numRemovedFiles -> 0, restoredFilesSize -> 4281, numOfFilesAfterRestore -> 4, tableSizeAfterRestore -> 4281)",,Databricks-Runtime/13.3.x-scala2.12
6,2024-10-12T07:16:01Z,2895352578531874,suryapulika38@gmail.com,DELETE,"Map(predicate -> [""true""])",,List(4341422527294472),1011-150700-u18wk0fi,5.0,WriteSerializable,False,"Map(numRemovedFiles -> 4, numRemovedBytes -> 4281, numCopiedRows -> 0, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 300, numDeletedRows -> 6, scanTimeMs -> 280, numAddedFiles -> 0, numAddedBytes -> 0, rewriteTimeMs -> 0)",,Databricks-Runtime/13.3.x-scala2.12
5,2024-10-12T04:55:00Z,2895352578531874,suryapulika38@gmail.com,UPDATE,"Map(predicate -> [""StartsWith(name#10523, A)""])",,List(4341422527294408),1011-150700-u18wk0fi,4.0,WriteSerializable,False,"Map(numRemovedFiles -> 2, numRemovedBytes -> 2142, numCopiedRows -> 1, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 1375, scanTimeMs -> 899, numAddedFiles -> 2, numUpdatedRows -> 2, numAddedBytes -> 2142, rewriteTimeMs -> 466)",,Databricks-Runtime/13.3.x-scala2.12
4,2024-10-12T04:53:58Z,2895352578531874,suryapulika38@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(4341422527294408),1011-150700-u18wk0fi,3.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1059)",,Databricks-Runtime/13.3.x-scala2.12
3,2024-10-12T04:53:57Z,2895352578531874,suryapulika38@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(4341422527294408),1011-150700-u18wk0fi,2.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1066)",,Databricks-Runtime/13.3.x-scala2.12
2,2024-10-12T04:53:56Z,2895352578531874,suryapulika38@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(4341422527294408),1011-150700-u18wk0fi,1.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 2, numOutputBytes -> 1080)",,Databricks-Runtime/13.3.x-scala2.12
1,2024-10-12T04:53:55Z,2895352578531874,suryapulika38@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(4341422527294408),1011-150700-u18wk0fi,0.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 2, numOutputBytes -> 1076)",,Databricks-Runtime/13.3.x-scala2.12
0,2024-10-12T04:52:39Z,2895352578531874,suryapulika38@gmail.com,CREATE TABLE,"Map(partitionBy -> [], description -> null, isManaged -> true, properties -> {}, statsOnLoad -> false)",,List(4341422527294408),1011-150700-u18wk0fi,,WriteSerializable,True,Map(),,Databricks-Runtime/13.3.x-scala2.12


As you can see, the RESTORE command has been recorded as a transaction.

So, as you can see, Delta time travel is really a powerful feature.

## OPTIMIZE Command

Let us now talk about the `OPTIMIZE` command and how to compact small files, and do Z order indexing.

Since the spark work in parallel, you usually end up by writing too many small files. 


In [0]:
%sql
DESCRIBE DETAIL employees

format,id,name,description,location,createdAt,lastModified,partitionColumns,clusteringColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion,tableFeatures,statistics
delta,f417c9dd-30ac-49c7-b6ae-14e820c51596,hive_metastore.default.employees,,dbfs:/user/hive/warehouse/employees,2024-10-12T04:52:39.544Z,2024-10-12T07:17:52Z,List(),List(),4,4281,Map(),1,2,"List(appendOnly, invariants)",Map()


Having many small data files negatively affect the performance of the Delta table. 
In our case, we have 4 small data files.

To resolve this issue, 
we can use `OPTIMIZE` command that combine files toward an optimal size.

`OPTIMIZE` will replace existing data files by combining records and rewriting the results.

In [0]:
%sql
OPTIMIZE employees ZORDER BY id

path,metrics
dbfs:/user/hive/warehouse/employees,"List(1, 4, List(1155, 1155, 1155.0, 1, 1155), List(1059, 1080, 1070.25, 4, 4281), 0, List(minCubeSize(107374182400), List(0, 0), List(4, 4281), 0, List(4, 4281), 1, null), 1, 4, 0, false, 0, 0, 1728717602888, 1728717608223, 4, 1, null, List(0, 0), 3, 3, 219, 0, null)"


Here, we can see that our 4 data files have been soft deleted, 
and a new file has been added that compacts those 4 files.

In addition, you may noticed that we added the Z order indexing with our `OPTIMIZE` command.

**Z order indexing speeds up data retrieval when filtering on provided fields, by grouping data with similar values within the same data files.**

In our case, we do Z order by the ID column.
However, on such a small data set, it does not provide any benefit.

Let us confirm the output of the `OPTIMIZE` command by running `DESCRIBE DETAIL` on our table.

In [0]:
%sql
DESCRIBE DETAIL employees

format,id,name,description,location,createdAt,lastModified,partitionColumns,clusteringColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion,tableFeatures,statistics
delta,f417c9dd-30ac-49c7-b6ae-14e820c51596,hive_metastore.default.employees,,dbfs:/user/hive/warehouse/employees,2024-10-12T04:52:39.544Z,2024-10-12T07:20:06Z,List(),List(),1,1155,Map(),1,2,"List(appendOnly, invariants)",Map()



Here, we can see that the number of files in the current version is only one.

Let us see how the `OPTIMIZE` operation has been recorded in our table history.


In [0]:
%sql
DESCRIBE HISTORY employees

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
8,2024-10-12T07:20:06Z,2895352578531874,suryapulika38@gmail.com,OPTIMIZE,"Map(predicate -> [], zOrderBy -> [""id""], batchId -> 0, auto -> false)",,List(4341422527294472),1011-150700-u18wk0fi,7.0,SnapshotIsolation,False,"Map(numRemovedFiles -> 4, numRemovedBytes -> 4281, p25FileSize -> 1155, numDeletionVectorsRemoved -> 0, minFileSize -> 1155, numAddedFiles -> 1, maxFileSize -> 1155, p75FileSize -> 1155, p50FileSize -> 1155, numAddedBytes -> 1155)",,Databricks-Runtime/13.3.x-scala2.12
7,2024-10-12T07:17:52Z,2895352578531874,suryapulika38@gmail.com,RESTORE,"Map(version -> 5, timestamp -> null)",,List(4341422527294472),1011-150700-u18wk0fi,6.0,Serializable,False,"Map(numRestoredFiles -> 4, removedFilesSize -> 0, numRemovedFiles -> 0, restoredFilesSize -> 4281, numOfFilesAfterRestore -> 4, tableSizeAfterRestore -> 4281)",,Databricks-Runtime/13.3.x-scala2.12
6,2024-10-12T07:16:01Z,2895352578531874,suryapulika38@gmail.com,DELETE,"Map(predicate -> [""true""])",,List(4341422527294472),1011-150700-u18wk0fi,5.0,WriteSerializable,False,"Map(numRemovedFiles -> 4, numRemovedBytes -> 4281, numCopiedRows -> 0, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 300, numDeletedRows -> 6, scanTimeMs -> 280, numAddedFiles -> 0, numAddedBytes -> 0, rewriteTimeMs -> 0)",,Databricks-Runtime/13.3.x-scala2.12
5,2024-10-12T04:55:00Z,2895352578531874,suryapulika38@gmail.com,UPDATE,"Map(predicate -> [""StartsWith(name#10523, A)""])",,List(4341422527294408),1011-150700-u18wk0fi,4.0,WriteSerializable,False,"Map(numRemovedFiles -> 2, numRemovedBytes -> 2142, numCopiedRows -> 1, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 1375, scanTimeMs -> 899, numAddedFiles -> 2, numUpdatedRows -> 2, numAddedBytes -> 2142, rewriteTimeMs -> 466)",,Databricks-Runtime/13.3.x-scala2.12
4,2024-10-12T04:53:58Z,2895352578531874,suryapulika38@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(4341422527294408),1011-150700-u18wk0fi,3.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1059)",,Databricks-Runtime/13.3.x-scala2.12
3,2024-10-12T04:53:57Z,2895352578531874,suryapulika38@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(4341422527294408),1011-150700-u18wk0fi,2.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1066)",,Databricks-Runtime/13.3.x-scala2.12
2,2024-10-12T04:53:56Z,2895352578531874,suryapulika38@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(4341422527294408),1011-150700-u18wk0fi,1.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 2, numOutputBytes -> 1080)",,Databricks-Runtime/13.3.x-scala2.12
1,2024-10-12T04:53:55Z,2895352578531874,suryapulika38@gmail.com,WRITE,"Map(mode -> Append, statsOnLoad -> false, partitionBy -> [])",,List(4341422527294408),1011-150700-u18wk0fi,0.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 2, numOutputBytes -> 1076)",,Databricks-Runtime/13.3.x-scala2.12
0,2024-10-12T04:52:39Z,2895352578531874,suryapulika38@gmail.com,CREATE TABLE,"Map(partitionBy -> [], description -> null, isManaged -> true, properties -> {}, statsOnLoad -> false)",,List(4341422527294408),1011-150700-u18wk0fi,,WriteSerializable,True,Map(),,Databricks-Runtime/13.3.x-scala2.12



As expected, `OPTIMIZE` command created another version in our table, meaning that the version 5 is the most recent version of our table.

Let us now explore the data files in our table directory.


In [0]:
%fs ls 'dbfs:/user/hive/warehouse/employees'

path,name,size,modificationTime
dbfs:/user/hive/warehouse/employees/_delta_log/,_delta_log/,0,1728708759000
dbfs:/user/hive/warehouse/employees/part-00000-1ce54e7c-9ef3-4343-b8ed-7787e775fe0b-c000.snappy.parquet,part-00000-1ce54e7c-9ef3-4343-b8ed-7787e775fe0b-c000.snappy.parquet,1076,1728708835000
dbfs:/user/hive/warehouse/employees/part-00000-2d623fad-d89c-4565-9b90-bc1e6f5a0e80-c000.snappy.parquet,part-00000-2d623fad-d89c-4565-9b90-bc1e6f5a0e80-c000.snappy.parquet,1155,1728717605000
dbfs:/user/hive/warehouse/employees/part-00000-5bb80f0a-fe61-493a-892c-7c5f4945e249-c000.snappy.parquet,part-00000-5bb80f0a-fe61-493a-892c-7c5f4945e249-c000.snappy.parquet,1080,1728708836000
dbfs:/user/hive/warehouse/employees/part-00000-65976bba-d807-4f83-9b35-bfa88b852228-c000.snappy.parquet,part-00000-65976bba-d807-4f83-9b35-bfa88b852228-c000.snappy.parquet,1066,1728708837000
dbfs:/user/hive/warehouse/employees/part-00000-7fec7995-32fc-4591-b2ce-c38399443529-c000.snappy.parquet,part-00000-7fec7995-32fc-4591-b2ce-c38399443529-c000.snappy.parquet,1076,1728708900000
dbfs:/user/hive/warehouse/employees/part-00000-b0d96a56-9d8f-4eba-93a0-481be949baac-c000.snappy.parquet,part-00000-b0d96a56-9d8f-4eba-93a0-481be949baac-c000.snappy.parquet,1059,1728708838000
dbfs:/user/hive/warehouse/employees/part-00001-f3a4dc82-ba54-4812-ab87-dfca50f7ba8f-c000.snappy.parquet,part-00001-f3a4dc82-ba54-4812-ab87-dfca50f7ba8f-c000.snappy.parquet,1066,1728708900000



Here, we can see that there are 7 data files.

But, we know that our current table version referencing only one file (after the `OPTIMIZE` operation)

It means that other data files are unused files, and we can simply clean them up.



## VACUUM Command

We can manually remove old data files using the VACUUM command.

Let us run this command and see what will happen.

In [0]:
%sql
VACUUM employees

path
dbfs:/user/hive/warehouse/employees


If we check the table directory again, we see that nothing happened!

In [0]:
%fs ls 'dbfs:/user/hive/warehouse/employees'

path,name,size,modificationTime
dbfs:/user/hive/warehouse/employees/_delta_log/,_delta_log/,0,1728708759000
dbfs:/user/hive/warehouse/employees/part-00000-1ce54e7c-9ef3-4343-b8ed-7787e775fe0b-c000.snappy.parquet,part-00000-1ce54e7c-9ef3-4343-b8ed-7787e775fe0b-c000.snappy.parquet,1076,1728708835000
dbfs:/user/hive/warehouse/employees/part-00000-2d623fad-d89c-4565-9b90-bc1e6f5a0e80-c000.snappy.parquet,part-00000-2d623fad-d89c-4565-9b90-bc1e6f5a0e80-c000.snappy.parquet,1155,1728717605000
dbfs:/user/hive/warehouse/employees/part-00000-5bb80f0a-fe61-493a-892c-7c5f4945e249-c000.snappy.parquet,part-00000-5bb80f0a-fe61-493a-892c-7c5f4945e249-c000.snappy.parquet,1080,1728708836000
dbfs:/user/hive/warehouse/employees/part-00000-65976bba-d807-4f83-9b35-bfa88b852228-c000.snappy.parquet,part-00000-65976bba-d807-4f83-9b35-bfa88b852228-c000.snappy.parquet,1066,1728708837000
dbfs:/user/hive/warehouse/employees/part-00000-7fec7995-32fc-4591-b2ce-c38399443529-c000.snappy.parquet,part-00000-7fec7995-32fc-4591-b2ce-c38399443529-c000.snappy.parquet,1076,1728708900000
dbfs:/user/hive/warehouse/employees/part-00000-b0d96a56-9d8f-4eba-93a0-481be949baac-c000.snappy.parquet,part-00000-b0d96a56-9d8f-4eba-93a0-481be949baac-c000.snappy.parquet,1059,1728708838000
dbfs:/user/hive/warehouse/employees/part-00001-f3a4dc82-ba54-4812-ab87-dfca50f7ba8f-c000.snappy.parquet,part-00001-f3a4dc82-ba54-4812-ab87-dfca50f7ba8f-c000.snappy.parquet,1066,1728708900000


The data files are still there.

This is because we need to specify a retention period, and by default this retention period is 7 days.
That means that `VACUUM` operation will prevent us from deleting files less than 7 days old, 
just to ensure that no longer running operations are still referencing any of the files to be deleted.

If we try to execute VACUUM command with a retention of zero hour for keeping only the current version,
this will not work because again, the default threshold is 7 days.


In [0]:
%sql
VACUUM employees RETAIN 0 HOURS


In this demo, we will do a work around for demonstration purposes only.
And of course, you should not do this in production.

The idea is to turn off the retention duration check.



In [0]:
%sql
SET spark.databricks.delta.retentionDurationCheck.enabled = false;

key,value
spark.databricks.delta.retentionDurationCheck.enabled,False


Now we can run our VACUUM command.


In [0]:
%sql
VACUUM employees RETAIN 0 HOURS

path
dbfs:/user/hive/warehouse/employees


Let us explore again the table directory.

In [0]:
%fs ls 'dbfs:/user/hive/warehouse/employees'

path,name,size,modificationTime
dbfs:/user/hive/warehouse/employees/_delta_log/,_delta_log/,0,1728708759000
dbfs:/user/hive/warehouse/employees/part-00000-2d623fad-d89c-4565-9b90-bc1e6f5a0e80-c000.snappy.parquet,part-00000-2d623fad-d89c-4565-9b90-bc1e6f5a0e80-c000.snappy.parquet,1155,1728717605000


Files have been successfully deleted.

6 data files have been removed.

Those files were really useful for Delta Lake time travel feature.

So, now we are no longer able to access old data versions.

We can easily confirm this by querying an old table version.


In [0]:
%sql
SELECT * FROM employees@v1

And indeed we got here a file not found exception, because the data files for this version are no longer exist.


## Dropping Tables

Finally, let us permanently delete the table with its data from the Lakehouse.

For this, we use the DROP TABLE command, like in SQL.

In [0]:
%sql
DROP TABLE employees


The table has been successfully deleted.

Let us confirm this by trying to query this table again.

In [0]:
%sql
SELECT * FROM employees

Indeed Table not found.

And the table directory as well has been completely deleted.

In [0]:
%fs ls 'dbfs:/user/hive/warehouse/employees'