-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

<i18n value="d2b35611-0c56-4262-b664-3a89a1d62662"/>


# Advanced Delta Lake Features

Now that you feel comfortable performing basic data tasks with Delta Lake, we can discuss a few features unique to Delta Lake.

Note that while some of the keywords used here aren't part of standard ANSI SQL, all Delta Lake operations can be run on Databricks using SQL

## Learning Objectives
By the end of this lesson, you should be able to:
* Use **`OPTIMIZE`** to compact small files
* Use **`ZORDER`** to index tables
* Describe the directory structure of Delta Lake files
* Review a history of table transactions
* Query and roll back to previous table version
* Clean up stale data files with **`VACUUM`**

**Resources**
* <a href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html" target="_blank">Delta Optimize - Databricks Docs</a>
* <a href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-vacuum.html" target="_blank">Delta Vacuum - Databricks Docs</a>

<i18n value="75224cfc-51b5-4c3d-8eb3-4db08469c99f"/>


## Run Setup
The first thing we're going to do is run a setup script. It will define a username, userhome, and database that is scoped to each user.

In [0]:
%run ../Includes/Classroom-Setup-02.3

Resetting the learning environment:
| dropping the schema "munirsheikhcloudseekho_0lj9_da_dewd"...(1 seconds)
| removing the working directory "dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks"...(0 seconds)

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/data-engineering-with-databricks/v02"

Validating the locally installed datasets:
| listing local files...(9 seconds)
| completed (9 seconds total)

Creating & using the schema "munirsheikhcloudseekho_0lj9_da_dewd"...(0 seconds)
Predefined tables in "munirsheikhcloudseekho_0lj9_da_dewd":
| -none-

Predefined paths variables:
| DA.paths.working_dir: dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks
| DA.paths.user_db:     dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db
| DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/data-engineering-with-databricks/v02
| DA.paths.check

<i18n value="7e85feea-be41-41f7-9cd7-df2c140d6286"/>


## Creating a Delta Table with History

The cell below condenses all the transactions from the previous lesson into a single cell. (Except for the **`DROP TABLE`**!)

As you're waiting for this query to run, see if you can identify the total number of transactions being executed.

In [0]:
%sql
CREATE TABLE students
  (id INT, name STRING, value DOUBLE);
  
INSERT INTO students VALUES (1, "Yve", 1.0);
INSERT INTO students VALUES (2, "Omar", 2.5);
INSERT INTO students VALUES (3, "Elia", 3.3);

INSERT INTO students
VALUES 
  (4, "Ted", 4.7),
  (5, "Tiffany", 5.5),
  (6, "Vini", 6.3);
  
UPDATE students 
SET value = value + 1
WHERE name LIKE "T%";

DELETE FROM students 
WHERE value > 6;

CREATE OR REPLACE TEMP VIEW updates(id, name, value, type) AS VALUES
  (2, "Omar", 15.2, "update"),
  (3, "", null, "delete"),
  (7, "Blue", 7.7, "insert"),
  (11, "Diya", 8.8, "update");
  
MERGE INTO students b
USING updates u
ON b.id=u.id
WHEN MATCHED AND u.type = "update"
  THEN UPDATE SET *
WHEN MATCHED AND u.type = "delete"
  THEN DELETE
WHEN NOT MATCHED AND u.type = "insert"
  THEN INSERT *;

num_affected_rows,num_updated_rows,num_deleted_rows,num_inserted_rows
3,1,1,1


<i18n value="5f6b0330-42f2-4307-9ff2-0b534947b286"/>


## Examine Table Details

Databricks uses a Hive metastore by default to register databases, tables, and views.

Using **`DESCRIBE EXTENDED`** allows us to see important metadata about our table.

In [0]:
%sql
DESCRIBE EXTENDED students

col_name,data_type,comment
id,int,
name,string,
value,double,
,,
# Detailed Table Information,,
Catalog,spark_catalog,
Database,munirsheikhcloudseekho_0lj9_da_dewd,
Table,students,
Type,MANAGED,
Location,dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students,


<i18n value="5495f382-2841-4cf5-b872-db4dd3828ee5"/>


**`DESCRIBE DETAIL`** is another command that allows us to explore table metadata.

In [0]:
%sql
DESCRIBE DETAIL students

format,id,name,description,location,createdAt,lastModified,partitionColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion
delta,8c6ca3ab-6193-4c1d-94b8-75dbf479dba2,spark_catalog.munirsheikhcloudseekho_0lj9_da_dewd.students,,dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students,2022-11-13T04:19:00.005+0000,2022-11-13T04:19:30.000+0000,List(),4,4236,Map(),1,2


<i18n value="4ab0fa4f-72cb-4f3b-8ea3-228b13be1baf"/>


Note the **`Location`** field.

While we've so far been thinking about our table as just a relational entity within a database, a Delta Lake table is actually backed by a collection of files stored in cloud object storage.

<i18n value="10e37764-bbfd-4669-a967-addd58041d47"/>


## Explore Delta Lake Files

We can see the files backing our Delta Lake table by using a Databricks Utilities function.

**NOTE**: It's not important right now to know everything about these files to work with Delta Lake, but it will help you gain a greater appreciation for how the technology is implemented.

In [0]:
%python
display(dbutils.fs.ls(f"{DA.paths.user_db}/students"))

path,name,size,modificationTime
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/_delta_log/,_delta_log/,0,0
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/part-00000-32e36ca5-b31a-45cd-a300-4df7d379044d-c000.snappy.parquet,part-00000-32e36ca5-b31a-45cd-a300-4df7d379044d-c000.snappy.parquet,1063,1668313170000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/part-00000-38f04125-ec62-4798-9c6f-20bb79048f62-c000.snappy.parquet,part-00000-38f04125-ec62-4798-9c6f-20bb79048f62-c000.snappy.parquet,1089,1668313153000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/part-00000-72b1d335-df13-422b-b373-bfb0101d7f52-c000.snappy.parquet,part-00000-72b1d335-df13-422b-b373-bfb0101d7f52-c000.snappy.parquet,1055,1668313163000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/part-00000-763bd822-436b-4c70-9a0e-19aea9b0d69f-c000.snappy.parquet,part-00000-763bd822-436b-4c70-9a0e-19aea9b0d69f-c000.snappy.parquet,1063,1668313147000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/part-00000-7e8f0164-c83b-4006-820c-e9ad49c52cf5-c000.snappy.parquet,part-00000-7e8f0164-c83b-4006-820c-e9ad49c52cf5-c000.snappy.parquet,1089,1668313158000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/part-00000-98cb9d63-d994-425e-8714-70b68d42c385-c000.snappy.parquet,part-00000-98cb9d63-d994-425e-8714-70b68d42c385-c000.snappy.parquet,1063,1668313150000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/part-00000-ccdb2791-8233-492d-a8d7-678bf38d58cd-c000.snappy.parquet,part-00000-ccdb2791-8233-492d-a8d7-678bf38d58cd-c000.snappy.parquet,1055,1668313144000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/part-00002-43bb9c30-9485-44de-9dc9-9221416ace27-c000.snappy.parquet,part-00002-43bb9c30-9485-44de-9dc9-9221416ace27-c000.snappy.parquet,1063,1668313170000


<i18n value="075483eb-7ddd-46ef-bbb1-33ee7005923b"/>


Note that our directory contains a number of Parquet data files and a directory named **`_delta_log`**.

Records in Delta Lake tables are stored as data in Parquet files.

Transactions to Delta Lake tables are recorded in the **`_delta_log`**.

We can peek inside the **`_delta_log`** to see more.

In [0]:
%python
display(dbutils.fs.ls(f"{DA.paths.user_db}/students/_delta_log"))

path,name,size,modificationTime
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/_delta_log/00000000000000000000.crc,00000000000000000000.crc,2006,1668313143000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/_delta_log/00000000000000000000.json,00000000000000000000.json,1005,1668313141000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/_delta_log/00000000000000000001.crc,00000000000000000001.crc,2524,1668313146000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/_delta_log/00000000000000000001.json,00000000000000000001.json,1048,1668313145000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/_delta_log/00000000000000000002.crc,00000000000000000002.crc,3039,1668313149000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/_delta_log/00000000000000000002.json,00000000000000000002.json,1050,1668313148000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/_delta_log/00000000000000000003.crc,00000000000000000003.crc,3554,1668313152000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/_delta_log/00000000000000000003.json,00000000000000000003.json,1050,1668313150000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/_delta_log/00000000000000000004.crc,00000000000000000004.crc,4068,1668313156000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students/_delta_log/00000000000000000004.json,00000000000000000004.json,1049,1668313154000


<i18n value="1bcbb8d1-f871-451a-ad16-762dfa91c0a3"/>


Each transaction results in a new JSON file being written to the Delta Lake transaction log. Here, we can see that there are 8 total transactions against this table (Delta Lake is 0 indexed).

<i18n value="c2fbd6d7-ea8e-4000-9702-e21408f3ef78"/>


## Reasoning about Data Files

We just saw a lot of data files for what is obviously a very small table.

**`DESCRIBE DETAIL`** allows us to see some other details about our Delta table, including the number of files.

In [0]:
%sql
DESCRIBE DETAIL students

format,id,name,description,location,createdAt,lastModified,partitionColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion
delta,8c6ca3ab-6193-4c1d-94b8-75dbf479dba2,spark_catalog.munirsheikhcloudseekho_0lj9_da_dewd.students,,dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students,2022-11-13T04:19:00.005+0000,2022-11-13T04:19:30.000+0000,List(),4,4236,Map(),1,2


<i18n value="adf1dc55-37a4-4376-86df-78895bfcf6b8"/>


Here we see that our table currently contains 4 data files in its present version. So what are all those other Parquet files doing in our table directory? 

Rather than overwriting or immediately deleting files containing changed data, Delta Lake uses the transaction log to indicate whether or not files are valid in a current version of the table.

Here, we'll look at the transaction log corresponding the **`MERGE`** statement above, where records were inserted, updated, and deleted.

In [0]:
%python
display(spark.sql(f"SELECT * FROM json.`{DA.paths.user_db}/students/_delta_log/00000000000000000007.json`"))

add,commitInfo,remove
,"List(1113-035301-4efipd3u, Databricks-Runtime/11.3.x-scala2.12, false, WriteSerializable, List(4094000743660164), MERGE, List(4399, 2, 4, 0, 2, 2, 0, 1, 1, 1, 1890, 2283), List([{""predicate"":""(u.type = 'update')"",""actionType"":""update""},{""predicate"":""(u.type = 'delete')"",""actionType"":""delete""}], [{""predicate"":""(u.type = 'insert')"",""actionType"":""insert""}], (b.id = u.id)), 6, 1668313169658, 9ceb6fe3-934a-4dd5-9695-9c4f66ab2a0c, 2682279945671776, munirsheikhcloudseekho@gmail.com)",
,,"List(true, 1668313169564, true, part-00000-763bd822-436b-4c70-9a0e-19aea9b0d69f-c000.snappy.parquet, 1063, List(1668313147000000, 1668313147000000, 1668313147000000, 268435456))"
,,"List(true, 1668313169564, true, part-00000-98cb9d63-d994-425e-8714-70b68d42c385-c000.snappy.parquet, 1063, List(1668313150000000, 1668313150000000, 1668313150000000, 268435456))"
"List(true, 1668313170000, part-00000-32e36ca5-b31a-45cd-a300-4df7d379044d-c000.snappy.parquet, 1063, {""numRecords"":1,""minValues"":{""id"":2,""name"":""Omar"",""value"":15.2},""maxValues"":{""id"":2,""name"":""Omar"",""value"":15.2},""nullCount"":{""id"":0,""name"":0,""value"":0}}, List(1668313170000000, 1668313170000000, 1668313147000000, 268435456))",,
"List(true, 1668313170000, part-00002-43bb9c30-9485-44de-9dc9-9221416ace27-c000.snappy.parquet, 1063, {""numRecords"":1,""minValues"":{""id"":7,""name"":""Blue"",""value"":7.7},""maxValues"":{""id"":7,""name"":""Blue"",""value"":7.7},""nullCount"":{""id"":0,""name"":0,""value"":0}}, List(1668313170000001, 1668313170000001, 1668313147000000, 268435456))",,


<i18n value="85e8bce8-c168-4ac6-9835-f694cab5b43c"/>


The **`add`** column contains a list of all the new files written to our table; the **`remove`** column indicates those files that no longer should be included in our table.

When we query a Delta Lake table, the query engine uses the transaction logs to resolve all the files that are valid in the current version, and ignores all other data files.

<i18n value="c69bbf45-e75e-419f-a149-fd18f76daab6"/>


## Compacting Small Files and Indexing

Small files can occur for a variety of reasons; in our case, we performed a number of operations where only one or several records were inserted.

Files will be combined toward an optimal size (scaled based on the size of the table) by using the **`OPTIMIZE`** command.

**`OPTIMIZE`** will replace existing data files by combining records and rewriting the results.

When executing **`OPTIMIZE`**, users can optionally specify one or several fields for **`ZORDER`** indexing. While the specific math of Z-order is unimportant, it speeds up data retrieval when filtering on provided fields by colocating data with similar values within data files.

In [0]:
%sql
OPTIMIZE students
ZORDER BY id

path,metrics
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db/students,"List(1, 4, List(1098, 1098, 1098.0, 1, 1098), List(1055, 1063, 1059.0, 4, 4236), 0, List(minCubeSize(107374182400), List(0, 0), List(4, 4236), 0, List(4, 4236), 1, null), 1, 4, 0, false, 0, 0, 1668313455561, 1668313462455, 8, 1, null)"


<i18n value="15475907-e307-491c-9bab-4d8afc363ec5"/>


Given how small our data is, **`ZORDER`** does not provide any benefit, but we can see all of the metrics that result from this operation.

<i18n value="5684dfb4-0b33-49f1-a4f8-cb2f8d88bf09"/>


## Reviewing Delta Lake Transactions

Because all changes to the Delta Lake table are stored in the transaction log, we can easily review the <a href="https://docs.databricks.com/spark/2.x/spark-sql/language-manual/describe-history.html" target="_blank">table history</a>.

In [0]:
%sql
DESCRIBE HISTORY students

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
8,2022-11-13T04:24:20.000+0000,2682279945671776,munirsheikhcloudseekho@gmail.com,OPTIMIZE,"Map(predicate -> [], zOrderBy -> [""id""], batchId -> 0, auto -> false)",,List(4094000743660164),1113-035301-4efipd3u,7.0,SnapshotIsolation,False,"Map(numRemovedFiles -> 4, numRemovedBytes -> 4236, p25FileSize -> 1098, minFileSize -> 1098, numAddedFiles -> 1, maxFileSize -> 1098, p75FileSize -> 1098, p50FileSize -> 1098, numAddedBytes -> 1098)",,Databricks-Runtime/11.3.x-scala2.12
7,2022-11-13T04:19:30.000+0000,2682279945671776,munirsheikhcloudseekho@gmail.com,MERGE,"Map(predicate -> (b.id = u.id), matchedPredicates -> [{""predicate"":""(u.type = 'update')"",""actionType"":""update""},{""predicate"":""(u.type = 'delete')"",""actionType"":""delete""}], notMatchedPredicates -> [{""predicate"":""(u.type = 'insert')"",""actionType"":""insert""}])",,List(4094000743660164),1113-035301-4efipd3u,6.0,WriteSerializable,False,"Map(numTargetRowsCopied -> 0, numTargetRowsDeleted -> 1, numTargetFilesAdded -> 2, executionTimeMs -> 4399, numTargetRowsInserted -> 1, scanTimeMs -> 2283, numTargetRowsUpdated -> 1, numOutputRows -> 2, numTargetChangeFilesAdded -> 0, numSourceRows -> 4, numTargetFilesRemoved -> 2, rewriteTimeMs -> 1890)",,Databricks-Runtime/11.3.x-scala2.12
6,2022-11-13T04:19:24.000+0000,2682279945671776,munirsheikhcloudseekho@gmail.com,DELETE,"Map(predicate -> [""(spark_catalog.munirsheikhcloudseekho_0lj9_da_dewd.students.value > 6.0D)""])",,List(4094000743660164),1113-035301-4efipd3u,5.0,WriteSerializable,False,"Map(numRemovedFiles -> 1, numCopiedRows -> 1, numAddedChangeFiles -> 0, executionTimeMs -> 1817, numDeletedRows -> 2, scanTimeMs -> 817, numAddedFiles -> 1, rewriteTimeMs -> 1000)",,Databricks-Runtime/11.3.x-scala2.12
5,2022-11-13T04:19:19.000+0000,2682279945671776,munirsheikhcloudseekho@gmail.com,UPDATE,"Map(predicate -> StartsWith(name#18517, T))",,List(4094000743660164),1113-035301-4efipd3u,4.0,WriteSerializable,False,"Map(numRemovedFiles -> 1, numCopiedRows -> 1, numAddedChangeFiles -> 0, executionTimeMs -> 2152, scanTimeMs -> 1114, numAddedFiles -> 1, numUpdatedRows -> 2, rewriteTimeMs -> 1037)",,Databricks-Runtime/11.3.x-scala2.12
4,2022-11-13T04:19:14.000+0000,2682279945671776,munirsheikhcloudseekho@gmail.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(4094000743660164),1113-035301-4efipd3u,3.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 3, numOutputBytes -> 1089)",,Databricks-Runtime/11.3.x-scala2.12
3,2022-11-13T04:19:10.000+0000,2682279945671776,munirsheikhcloudseekho@gmail.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(4094000743660164),1113-035301-4efipd3u,2.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1063)",,Databricks-Runtime/11.3.x-scala2.12
2,2022-11-13T04:19:08.000+0000,2682279945671776,munirsheikhcloudseekho@gmail.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(4094000743660164),1113-035301-4efipd3u,1.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1063)",,Databricks-Runtime/11.3.x-scala2.12
1,2022-11-13T04:19:05.000+0000,2682279945671776,munirsheikhcloudseekho@gmail.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(4094000743660164),1113-035301-4efipd3u,0.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1055)",,Databricks-Runtime/11.3.x-scala2.12
0,2022-11-13T04:19:01.000+0000,2682279945671776,munirsheikhcloudseekho@gmail.com,CREATE TABLE,"Map(isManaged -> true, description -> null, partitionBy -> [], properties -> {})",,List(4094000743660164),1113-035301-4efipd3u,,WriteSerializable,True,Map(),,Databricks-Runtime/11.3.x-scala2.12


<i18n value="56de8919-b5d0-4d1f-81d8-ccf22fdf6da0"/>


As expected, **`OPTIMIZE`** created another version of our table, meaning that version 8 is our most current version.

Remember all of those extra data files that had been marked as removed in our transaction log? These provide us with the ability to query previous versions of our table.

These time travel queries can be performed by specifying either the integer version or a timestamp.

**NOTE**: In most cases, you'll use a timestamp to recreate data at a time of interest. For our demo we'll use version, as this is deterministic (whereas you may be running this demo at any time in the future).

In [0]:
%sql
SELECT * 
FROM students VERSION AS OF 3

id,name,value
2,Omar,2.5
3,Elia,3.3
1,Yve,1.0


<i18n value="0499f01b-7700-4381-80cc-9b4fb093017a"/>


What's important to note about time travel is that we're not recreating a previous state of the table by undoing transactions against our current version; rather, we're just querying all those data files that were indicated as valid as of the specified version.

<i18n value="f569a57f-24cc-403a-88ab-709b4f1a7548"/>


## Rollback Versions

Suppose you're typing up query to manually delete some records from a table and you accidentally execute this query in the following state.

In [0]:
%sql
DELETE FROM students

num_affected_rows
-1


<i18n value="b7d46e40-1c41-4e8a-8f25-25325da065cb"/>


Note that when we see a **`-1`** for number of rows affected by a delete, this means an entire directory of data has been removed.

Let's confirm this below.

In [0]:
%sql
SELECT * FROM students

id,name,value


<i18n value="0477fb25-7248-4552-98a1-ffee4cd7b5b0"/>


Deleting all the records in your table is probably not a desired outcome. Luckily, we can simply rollback this commit.

In [0]:
%sql
RESTORE TABLE students TO VERSION AS OF 8 

table_size_after_restore,num_of_files_after_restore,num_removed_files,num_restored_files,removed_files_size,restored_files_size
1098,1,0,1,0,1098


In [0]:
%sql
SELECT * FROM students

id,name,value
2,Omar,15.2
7,Blue,7.7
4,Ted,5.7
1,Yve,1.0


<i18n value="4fbc3b91-8b73-4644-95cb-f9ca2f1ac6a3"/>


Note that a **`RESTORE`** <a href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-restore.html" target="_blank">command</a> is recorded as a transaction; you won't be able to completely hide the fact that you accidentally deleted all the records in the table, but you will be able to undo the operation and bring your table back to a desired state.

<i18n value="789ca5cf-5eb1-4a81-a595-624994a512f1"/>


## Cleaning Up Stale Files

Databricks will automatically clean up stale files in Delta Lake tables.

While Delta Lake versioning and time travel are great for querying recent versions and rolling back queries, keeping the data files for all versions of large production tables around indefinitely is very expensive (and can lead to compliance issues if PII is present).

If you wish to manually purge old data files, this can be performed with the **`VACUUM`** operation.

Uncomment the following cell and execute it with a retention of **`0 HOURS`** to keep only the current version:

In [0]:
%sql
-- VACUUM students RETAIN 0 HOURS

<i18n value="6a3b0b37-1387-4b41-86bf-3f181ddc1562"/>


By default, **`VACUUM`** will prevent you from deleting files less than 7 days old, just to ensure that no long-running operations are still referencing any of the files to be deleted. If you run **`VACUUM`** on a Delta table, you lose the ability time travel back to a version older than the specified data retention period.  In our demos, you may see Databricks executing code that specifies a retention of **`0 HOURS`**. This is simply to demonstrate the feature and is not typically done in production.  

In the following cell, we:
1. Turn off a check to prevent premature deletion of data files
1. Make sure that logging of **`VACUUM`** commands is enabled
1. Use the **`DRY RUN`** version of vacuum to print out all records to be deleted

In [0]:
%sql
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
SET spark.databricks.delta.vacuum.logging.enabled = true;

VACUUM students RETAIN 0 HOURS DRY RUN

<i18n value="be50e096-ba08-43be-8056-d56ad5ae7914"/>


By running **`VACUUM`** and deleting the 10 files above, we will permanently remove access to versions of the table that require these files to materialize.

In [0]:
%sql
VACUUM students RETAIN 0 HOURS

<i18n value="a847e55a-0ecf-4b10-85ab-5aa8566ff4e1"/>


Check the table directory to show that files have been successfully deleted.

In [0]:
%python
display(dbutils.fs.ls(f"{DA.paths.user_db}/students"))

<i18n value="b854a50f-635b-4cdc-8f18-38c5ab595648"/>

 
Run the following cell to delete the tables and files associated with this lesson.

In [0]:
%python
DA.cleanup()

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>