# Lecture 12. Advanced Delta Lake Features


In this video, we will talk about the advanced features of the Delta Lake.

We will cover 
   - the time travel capabilities
   - how to optimize Delta tables by compacting small files and applying indexing
   - how to clean up unused data files in the table directory.



### Reference
- [Documentation > Develop on Databricks > SQL language reference > OPTIMIZE](https://docs.databricks.com/sql/language-manual/delta-optimize.html)
- [Documentation > Develop on Databricks > SQL language reference > VACUUM](https://docs.databricks.com/sql/language-manual/delta-vacuum.html)

## Time travel feature with Delta

- Every operation on the table is automatically versioned, which provides the full audit trail of all the changes that have happened on the table.

  You can look at the history of the table in SQL using `DESCRIBE HISTORY` Command.

- In addition, we can query older versions of the table.

  * This can be done in two different ways, either using a timestamp. So in a `SELECT` statement we use the keyword `TIMESTAMP AS OF` and we provide the timestamp or date string.
  
    ```SQL
    SELECT * FROM my_table TIMESTAMP AS OF "2019-01-01"
    ```

  * The second way is to use a version number.

    Since every operation on the table has a version number, 
    you can use this version number to travel back time as well.

    Here using the keyword `VERSION AS OF`, or simply `@v`, which is the short syntax.

    ```SQL
    SELECT * FROM my_table VERSION AS OF 36
    SELECT * FROM my_table@v36
    ```

- Time travel also makes it easy to do rollbacks in case of bad writes.

  For example, if your pipeline job had a bug that accidentally deleted user information, 
  you can easily fix this using the `RESTORE TABLE` command, 
  either to restore the table to a specific timestamp or to a specific version number.

  ```SQL
  RESTORE TABLE my_table TO TIMESTAMP AS OF "2019-01-01"
  RESTORE TABLE my_table TO VERSION AS OF 36
  ```



## `OPTIMIZE`

### Compaction

The second important feature here is compacting small files.

In fact, Delta Lake can improve the speed of read queries from a table.

One way to improve this speed is by compacting small files into larger ones. 
You trigger compaction simply by running the `OPTIMIZE` command.

```sql
OPTIMIZE my_table
```

For example, if you have many small files by running the OPTIMIZE command, 
they will be compacted in one or more larger files which improves the table performance.

<div style="text-align: center;">
<img src="../../assets/images/Presentation-Images/Delta Lake table compaction.jpg" style="width:360px" >
</div> 




### Indexing

With OPTIMIZE, we can also do z-order indexing. 

*Z-Order indexing* in Delta Lake is about co-locating and reorganizing column information in the same set of files.

This can be done by adding the ZORDER BY keyword to the OPTIMIZE command, followed by one or more column name.

```sql
OPTIMIZE my_table ZORDER BY column_name
```

So, for example, if you have a numerical column in the data files, ID for example, by applying the Z order on this column, the first compacted file will contain values from 1 to 50, while the other 
one will contain values from 51 to 100.

<div style="text-align: center;">
<img src="../../assets/images/Presentation-Images/Delta Lake table indexing.jpg" style="width:360px" >
</div> 

Z Order indexing is used by data skipping algorithm to extremely reduce the amount of data that need to be read.

In our example, if you query an ID, say 30, 
Delta is sure now that ID 30 is in file number one, 
so it can easily skip the scanning the file number two, 
which will save a huge amount of time.



## Vacuum a Delta table

Cleaning up unused data files like 

  * uncommitted files and 
  * files that are no longer in the latest state 

of the transaction log for the table

In fact, Delta Lake allows you to do garbage collection 
by using `VACUUM` Command. 

With `VACUUM` command, 
you just need to specify the threshold of retention period for the files, 
so you delete all the files older than this threshold.

```SQL
VACUUM table_name [retention period]
```

By default, the threshold is 7 days. 
This means that vacuum operation will prevent you from deleting files less than 7 days old.

Just to be sure that no longer running operations are still referencing any of the files to be deleted.

But remember, once you run a vacuum on a Delta table, you lose the ability to time and travel back to a version older than the specified retention period, simply because the data files are no longer exist.

**Note: Vacuum = no time travel**