#Delta Manage and Optimizations

Databricks Delta has nifty optimizations to speed up your queries.

In [2]:
%run ./Reference/Setup

In [3]:
deltaIotPath = userhome + "/delta/iot-pipeline/"
deltaDataPath = userhome + "/delta/customer-data/"

## SMALL FILE PROBLEM

Historical and new data is often written in very small files and directories. 

This data may be spread across a data center or even across the world (that is, not co-located).

The result is that a query on this data may be very slow due to
* network latency 
* volume of file metatadata 

The solution is to compact many small files into one larger file.
Databricks Delta has a mechanism for compacting small files.

In [5]:
display(dbutils.fs.ls(deltaIotPath + "/date=2018-06-01/"))

path,name,size
dbfs:/user/nagaraj.sengodan@hotmail.com/delta/iot-pipeline/date=2018-06-01/part-00000-28841414-d103-46f3-9f79-dce8405fff7c.c000.snappy.parquet,part-00000-28841414-d103-46f3-9f79-dce8405fff7c.c000.snappy.parquet,985
dbfs:/user/nagaraj.sengodan@hotmail.com/delta/iot-pipeline/date=2018-06-01/part-00001-31266dd3-424b-456f-a5d1-f25918634b44.c000.snappy.parquet,part-00001-31266dd3-424b-456f-a5d1-f25918634b44.c000.snappy.parquet,960
dbfs:/user/nagaraj.sengodan@hotmail.com/delta/iot-pipeline/date=2018-06-01/part-00001-c39cffad-597d-4f63-a0b9-5265e53bf57c.c000.snappy.parquet,part-00001-c39cffad-597d-4f63-a0b9-5265e53bf57c.c000.snappy.parquet,14261
dbfs:/user/nagaraj.sengodan@hotmail.com/delta/iot-pipeline/date=2018-06-01/part-00002-001574d8-afc7-4d07-80cc-5d281c01218a.c000.snappy.parquet,part-00002-001574d8-afc7-4d07-80cc-5d281c01218a.c000.snappy.parquet,953
dbfs:/user/nagaraj.sengodan@hotmail.com/delta/iot-pipeline/date=2018-06-01/part-00003-44cdba01-9686-4957-9a70-ba42e2740e27.c000.snappy.parquet,part-00003-44cdba01-9686-4957-9a70-ba42e2740e27.c000.snappy.parquet,985
dbfs:/user/nagaraj.sengodan@hotmail.com/delta/iot-pipeline/date=2018-06-01/part-00004-8f3a12fa-fb71-43d7-a824-8da0bf40337b.c000.snappy.parquet,part-00004-8f3a12fa-fb71-43d7-a824-8da0bf40337b.c000.snappy.parquet,960
dbfs:/user/nagaraj.sengodan@hotmail.com/delta/iot-pipeline/date=2018-06-01/part-00005-bbe2a22e-8c5f-4d53-8a19-c0d39397d13d.c000.snappy.parquet,part-00005-bbe2a22e-8c5f-4d53-8a19-c0d39397d13d.c000.snappy.parquet,966
dbfs:/user/nagaraj.sengodan@hotmail.com/delta/iot-pipeline/date=2018-06-01/part-00006-3afb2563-8147-4585-a293-2bd7854027af.c000.snappy.parquet,part-00006-3afb2563-8147-4585-a293-2bd7854027af.c000.snappy.parquet,979
dbfs:/user/nagaraj.sengodan@hotmail.com/delta/iot-pipeline/date=2018-06-01/part-00007-60c3ff57-2fe0-40cc-a1ec-aa44e553413f.c000.snappy.parquet,part-00007-60c3ff57-2fe0-40cc-a1ec-aa44e553413f.c000.snappy.parquet,960
dbfs:/user/nagaraj.sengodan@hotmail.com/delta/iot-pipeline/date=2018-06-01/part-00008-3af9b6c0-7ce7-4916-9824-637f8d204c0e.c000.snappy.parquet,part-00008-3af9b6c0-7ce7-4916-9824-637f8d204c0e.c000.snappy.parquet,960


In [6]:
%sql
SELECT * FROM demo_iot_data_delta where deviceId=379

action,time,date,deviceId
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379


-sandbox
### Data Skipping and ZORDER

Databricks Delta uses two mechanisms to speed up queries.

<b>Data Skipping</b> is a performance optimization that aims at speeding up queries that contain filters (WHERE clauses). 

For example, we have a data set that is partitioned by `date`. 

A query using `WHERE date > 2018-06-01` would not access data that resides in partitions that correspond to dates prior to `2018-06-01`.

<b>ZOrdering</b> is a technique to colocate related information in the same set of files. 

ZOrdering maps multidimensional data to one dimension while preserving locality of the data points. 

Given a column that you want to perform ZORDER on, say `OrderColumn`, Delta
* takes existing parquet files within a partition
* maps the rows within the parquet files according to `OrderColumn` using the algorithm described <a href="https://en.wikipedia.org/wiki/Z-order_curve" target="_blank">here</a>
* (in the case of only one column, the mapping above becomes a linear sort)
* rewrites the sorted data into new parquet files

In [8]:
%sql
OPTIMIZE demo_iot_data_delta
ZORDER by (deviceId)

path
""


In [9]:
%sql
SELECT * FROM demo_iot_data_delta where deviceId=379

action,time,date,deviceId
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379


-sandbox
## VACUUM

To save on storage costs you should occasionally clean up invalid files using the `VACUUM` command. 

Invalid files are small files compacted into a larger file with the `OPTIMIZE` command.

The  syntax of the `VACUUM` command is 
>`VACUUM name-of-table RETAIN number-of HOURS;`

The `number-of` parameter is the <b>retention interval</b>, specified in hours.

Databricks does not recommend you set a retention interval shorter than seven days because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table.

The scenario here is:
0. User A starts a query off uncompacted files, then
0. User B invokes a `VACUUM` command, which deletes the uncompacted files
0. User A's query fails because the underlying files have disappeared

Invalid files can also result from updates/upserts/deletions.

More details are provided here: <a href="https://docs.databricks.com/delta/optimizations.html#garbage-collection" target="_blank"> Garbage Collection</a>.

In [11]:
len(dbutils.fs.ls(deltaIotPath + "/date=2018-06-01"))

In [12]:
%sql

VACUUM demo_iot_data_delta RETAIN 0 HOURS;

path
dbfs:/user/nagaraj.sengodan@hotmail.com/delta/iot-pipeline


In [13]:
len(dbutils.fs.ls(deltaIotPath + "/date=2018-06-01"))

Count number of files before `VACUUM` for `Country=Sweden`.

In [15]:
preNumFiles = len(dbutils.fs.ls(deltaDataPath + "/Country=Sweden"))
print (preNumFiles)

In [16]:
%sql
VACUUM customer_data_delta RETAIN 0 HOURS;

path
dbfs:/user/nagaraj.sengodan@hotmail.com/delta/customer-data
