# Hitchhiker's Guide to Delta Lake (Scala)

This tutorial has been adapted for more clarity from its original counterpart [here](https://docs.delta.io/latest/quick-start.html). This notebook helps you quickly explore the main features of [Delta Lake](https://github.com/delta-io/delta). It provides code snippets that show how to read from and write to Delta Lake tables from interactive, batch, and streaming queries.

Here's what we will cover:
* Create a table
* Understanding meta-data
* Read data
* Update table data
* Overwrite table data
* Conditional update without overwrite
* Read older versions of data using Time Travel
* Write a stream of data to a table
* Read a stream of changes from a table

## Configuration
Make sure you modify this as appropriate.

In [27]:
val sessionId = scala.util.Random.nextInt(1000000)
val deltaTablePath = s"/delta/delta-table-$sessionId";

sessionId: Int = 259512
deltaTablePath: String = /delta/delta-table-259512

## Create a table
To create a Delta Lake table, write a DataFrame out in the **delta** format. You can use existing Spark SQL code and change the format from parquet, csv, json, and so on, to delta.

These operations create a new Delta Lake table using the schema that was inferred from your DataFrame. For the full set of options available when you create a new Delta Lake table, see Create a table and Write to a table (subsequent cells in this notebook).

In [28]:
val data = spark.range(0, 5)
data.show
data.write.format("delta").save(deltaTablePath)

data: org.apache.spark.sql.Dataset[Long] = [id: bigint]
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+

## Understanding Meta-data

In Delta Lake, meta-data is no different from data i.e., it is stored next to the data. Therefore, an interesting side-effect here is that you can peek into meta-data using regular Spark APIs. 

In [29]:
spark.read.text(s"$deltaTablePath/_delta_log/").collect.foreach(println)

[{"commitInfo":{"timestamp":1604881272963,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists","partitionBy":"[]"},"isBlindAppend":true,"operationMetrics":{"numFiles":"6","numOutputBytes":"2407","numOutputRows":"5"}}}]
[{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}]
[{"metaData":{"id":"761ad554-f686-4858-b929-959c5d310e1e","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1604881271656}}]
[{"add":{"path":"part-00000-16cb00b0-82f2-4060-8477-6adf6f4fade8-c000.snappy.parquet","partitionValues":{},"size":262,"modificationTime":1604881272000,"dataChange":true}}]
[{"add":{"path":"part-00003-30b7f9c6-53d8-4e13-aa87-68eef0b118f0-c000.snappy.parquet","partitionValues":{},"size":429,"modificationTime":1604881272000,"dataChange":true}}]
[{"add":{"path":"part-00006-a0f3564c-5cf7-4244-8458-16d53fcfefb9-c00

## Read data

You read data in your Delta Lake table by specifying the path to the files: "/tmp/delta-table-scala"

In [30]:
val df = spark.read.format("delta").load(deltaTablePath)
df.show()

df: org.apache.spark.sql.DataFrame = [id: bigint]
+---+
| id|
+---+
|  2|
|  1|
|  3|
|  0|
|  4|
+---+

## Update table data

Delta Lake supports several operations to modify tables using standard DataFrame APIs. This example runs a batch job to overwrite the data in the table.


In [31]:
val data = spark.range(5, 10)
data.write.format("delta").mode("overwrite").save(deltaTablePath)
df.show()

data: org.apache.spark.sql.Dataset[Long] = [id: bigint]
+---+
| id|
+---+
|  6|
|  9|
|  8|
|  5|
|  7|
+---+

When you now inspect the meta-data, what you will notice is that the original data is over-written. Well, not in a true sense but appropriate entries are added to Delta's transaction log so it can provide an "illusion" that the original data was deleted. We can verify this by re-inspecting the meta-data. You will see several entries indicating reference removal to the original data.

In [32]:
spark.read.text(s"$deltaTablePath/_delta_log/").collect.foreach(println)

[{"commitInfo":{"timestamp":1604881286043,"operation":"WRITE","operationParameters":{"mode":"Overwrite","partitionBy":"[]"},"readVersion":0,"isBlindAppend":false,"operationMetrics":{"numFiles":"6","numOutputBytes":"2407","numOutputRows":"5"}}}]
[{"add":{"path":"part-00000-9e64203e-551d-40ea-bd7f-51fc4c00cada-c000.snappy.parquet","partitionValues":{},"size":262,"modificationTime":1604881285000,"dataChange":true}}]
[{"add":{"path":"part-00003-3e4bb459-dcdc-4d7a-add5-df8fa134fea4-c000.snappy.parquet","partitionValues":{},"size":429,"modificationTime":1604881284000,"dataChange":true}}]
[{"add":{"path":"part-00006-7da1993b-7ef0-494b-87f9-534b83af67c2-c000.snappy.parquet","partitionValues":{},"size":429,"modificationTime":1604881284000,"dataChange":true}}]
[{"add":{"path":"part-00009-d9b518fa-c7b6-4db5-885b-c417a66c6848-c000.snappy.parquet","partitionValues":{},"size":429,"modificationTime":1604881284000,"dataChange":true}}]
[{"add":{"path":"part-00012-2a93dd5f-4304-461e-99fd-5421633a893c-c0

## Save as catalog tables

Delta Lake can write to managed or external catalog tables.

In [33]:
// Write data to a new managed catalog table.
data.write.format("delta").saveAsTable("ManagedDeltaTable")

In [34]:
// Define an external catalog table that points to the existing Delta Lake data in storage.
spark.sql(s"CREATE TABLE ExternalDeltaTable USING DELTA LOCATION '$deltaTablePath'")

res68: org.apache.spark.sql.DataFrame = []

In [35]:
// List the 2 new tables.
spark.sql("SHOW TABLES").show

// Explore their properties.
spark.sql("DESCRIBE EXTENDED ManagedDeltaTable").show(truncate=false)
spark.sql("DESCRIBE EXTENDED ExternalDeltaTable").show(truncate=false)

+--------+------------------+-----------+
|database|         tableName|isTemporary|
+--------+------------------+-----------+
| default|externaldeltatable|      false|
| default| manageddeltatable|      false|
+--------+------------------+-----------+

+----------------------------+-------------------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                              |comment|
+----------------------------+-------------------------------------------------------------------------------------------------------+-------+
|id                          |bigint                                                                                                 |null   |
|                            |                                                                                                       |       |
|# Detailed Table Information|  

## Conditional update without overwrite

Delta Lake provides programmatic APIs to conditional update, delete, and merge (upsert) data into tables. For more information on these operations, see [Table Deletes, Updates, and Merges](https://docs.delta.io/latest/delta-update.html).

In [36]:
import io.delta.tables._
import org.apache.spark.sql.functions._

val deltaTable = DeltaTable.forPath(deltaTablePath)

import io.delta.tables._
import org.apache.spark.sql.functions._
deltaTable: io.delta.tables.DeltaTable = io.delta.tables.DeltaTable@236a0d01

In [37]:
// Update every even value by adding 100 to it
deltaTable.update(
  condition = expr("id % 2 == 0"),
  set = Map("id" -> expr("id + 100")))
deltaTable.toDF.show

+---+
| id|
+---+
|  9|
|106|
|108|
|  5|
|  7|
+---+

In [38]:
// Delete every even value
deltaTable.delete(condition = expr("id % 2 == 0"))
deltaTable.toDF.show

+---+
| id|
+---+
|  9|
|  5|
|  7|
+---+

In [39]:
// Upsert (merge) new data
val newData = spark.range(0, 20).toDF

deltaTable.as("oldData").
  merge(
    newData.as("newData"),
    "oldData.id = newData.id").
  whenMatched.
  update(Map("id" -> lit(-1))).
  whenNotMatched.
  insert(Map("id" -> col("newData.id"))).
  execute()

deltaTable.toDF.show()

newData: org.apache.spark.sql.DataFrame = [id: bigint]
+---+
| id|
+---+
|  1|
| 13|
| -1|
|  4|
|  8|
| -1|
|  2|
|  6|
|  3|
|  0|
| 14|
| 10|
| 17|
| 12|
| 19|
| -1|
| 16|
| 11|
| 18|
| 15|
+---+

## History
Delta's most powerful feature is the ability to allow looking into history i.e., the changes that were made to the underlying Delta Table. The cell below shows how simple it is to inspect the history.

In [40]:
deltaTable.history.show(false)

+-------+-------------------+------+--------+---------+-------------------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|version|timestamp          |userId|userName|operation|operationParameters                                                |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics                                                                                                                                                                                              |
+-------+-------------------+------+--------+---------+-------------------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+--------------------

## Read older versions of data using Time Travel

You can query previous snapshots of your Delta Lake table by using a feature called Time Travel. If you want to access the data that you overwrote, you can query a snapshot of the table before you overwrote the first set of data using the versionAsOf option.

Once you run the cell below, you should see the first set of data, from before you overwrote it. Time Travel is an extremely powerful feature that takes advantage of the power of the Delta Lake transaction log to access data that is no longer in the table. Removing the version 0 option (or specifying version 1) would let you see the newer data again. For more information, see [Query an older snapshot of a table (time travel)](https://docs.delta.io/latest/delta-batch.html#deltatimetravel).

In [41]:
val df = spark.read.format("delta").option("versionAsOf", 0).load(deltaTablePath)
df.show()

df: org.apache.spark.sql.DataFrame = [id: bigint]
+---+
| id|
+---+
|  2|
|  1|
|  3|
|  0|
|  4|
+---+

## Write a stream of data to a table

You can also write to a Delta Lake table using Spark's Structured Streaming. The Delta Lake transaction log guarantees exactly-once processing, even when there are other streams or batch queries running concurrently against the table. By default, streams run in append mode, which adds new records to the table.

For more information about Delta Lake integration with Structured Streaming, see [Table Streaming Reads and Writes](https://docs.delta.io/latest/delta-streaming.html).

In the cells below, here's what we are doing:

1. *Cell 28* Setup a simple Spark Structured Streaming job to generate a sequence and make the job write into our Delta Table
2. *Cell 30* Show the newly appended data
3. *Cell 31* Inspect history
4. *Cell 32* Stop the structured streaming job
5. *Cell 33* Inspect history <-- You'll notice appends have stopped

In [42]:
val streamingDf = spark.readStream.format("rate").load()
val stream = streamingDf.select($"value" as "id").writeStream.format("delta").option("checkpointLocation", s"/tmp/checkpoint-$sessionId").start(deltaTablePath)

streamingDf: org.apache.spark.sql.DataFrame = [timestamp: timestamp, value: bigint]
stream: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@6f846098

## Read a stream of changes from a table

While the stream is writing to the Delta Lake table, you can also read from that table as streaming source. For example, you can start another streaming query that prints all the changes made to the Delta Lake table.

In [43]:
deltaTable.toDF.sort($"id".desc).show

+---+
| id|
+---+
| 19|
| 18|
| 17|
| 16|
| 15|
| 14|
| 13|
| 12|
| 11|
| 10|
|  8|
|  6|
|  4|
|  3|
|  2|
|  1|
|  0|
| -1|
| -1|
| -1|
+---+

In [44]:
deltaTable.history.show

+-------+-------------------+------+--------+----------------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+
|version|          timestamp|userId|userName|       operation| operationParameters| job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend|    operationMetrics|
+-------+-------------------+------+--------+----------------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+
|      5|2020-11-09 00:22:43|  null|    null|STREAMING UPDATE|[outputMode -> Ap...|null|    null|     null|          4|          null|         true|[numRemovedFiles ...|
|      4|2020-11-09 00:22:13|  null|    null|           MERGE|[predicate -> (ol...|null|    null|     null|          3|          null|        false|[numTargetRowsCop...|
|      3|2020-11-09 00:22:03|  null|    null|          DELETE|[predicate -> ["(...|null|    null|     null|          2|          null|        false|[n

In [45]:
stream.stop

In [46]:
deltaTable.history.show

+-------+-------------------+------+--------+----------------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+
|version|          timestamp|userId|userName|       operation| operationParameters| job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend|    operationMetrics|
+-------+-------------------+------+--------+----------------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+
|      5|2020-11-09 00:22:43|  null|    null|STREAMING UPDATE|[outputMode -> Ap...|null|    null|     null|          4|          null|         true|[numRemovedFiles ...|
|      4|2020-11-09 00:22:13|  null|    null|           MERGE|[predicate -> (ol...|null|    null|     null|          3|          null|        false|[numTargetRowsCop...|
|      3|2020-11-09 00:22:03|  null|    null|          DELETE|[predicate -> ["(...|null|    null|     null|          2|          null|        false|[n

## Compaction

If a Delta Table is growing too large, you can compact it by repartitioning into a smaller number of files.

The option `dataChange = false` is an optimization that tells Delta Lake to do the repartition without marking the underlying data as "modified". This ensures that any other concurrent operations (such as streaming reads/writes) aren't negatively impacted.


In [47]:
val partitionCount = 2

spark.
    read.
    format("delta").
    load(deltaTablePath).
    repartition(partitionCount).
    write.
    option("dataChange", "false").
    format("delta").
    mode("overwrite").
    save(deltaTablePath)    

partitionCount: Int = 2

## Convert Parquet to Delta
You can do an in-place conversion from the Parquet format to Delta.## Cell title


In [48]:
val parquetPath = s"/parquet/parquet-table-$sessionId"

val data = spark.range(0,5)
data.write.parquet(parquetPath)

// Confirm that the data isn't in the Delta format
DeltaTable.isDeltaTable(parquetPath)

parquetPath: String = /parquet/parquet-table-259512
data: org.apache.spark.sql.Dataset[Long] = [id: bigint]
res99: Boolean = false

In [49]:
DeltaTable.convertToDelta(spark, s"parquet.`$parquetPath`")

// Confirm that the converted data is now in the Delta format
DeltaTable.isDeltaTable(parquetPath)

res100: io.delta.tables.DeltaTable = io.delta.tables.DeltaTable@5900f632
res103: Boolean = true

## SQL Support
Delta supports table utility commands through SQL.  You can use SQL to:
* Get a DeltaTable's history
* Vacuum a DeltaTable
* Convert a Parquet file to Delta


In [50]:
spark.sql(s"DESCRIBE HISTORY delta.`$deltaTablePath`").show()

+-------+-------------------+------+--------+----------------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+
|version|          timestamp|userId|userName|       operation| operationParameters| job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend|    operationMetrics|
+-------+-------------------+------+--------+----------------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+
|      6|2020-11-09 00:22:59|  null|    null|           WRITE|[mode -> Overwrit...|null|    null|     null|          5|          null|        false|[numFiles -> 2, n...|
|      5|2020-11-09 00:22:43|  null|    null|STREAMING UPDATE|[outputMode -> Ap...|null|    null|     null|          4|          null|         true|[numRemovedFiles ...|
|      4|2020-11-09 00:22:13|  null|    null|           MERGE|[predicate -> (ol...|null|    null|     null|          3|          null|        false|[n

In [51]:
spark.sql(s"VACUUM delta.`$deltaTablePath`").show()

Deleted 0 files and directories in a total of 1 directories.
+--------------------+
|                path|
+--------------------+
|abfss://data@...|
+--------------------+

In [52]:
val parquetId = scala.util.Random.nextInt(1000)
val parquetPath = s"/parquet/parquet-table-$sessionId-$parquetId"

val data = spark.range(0,5)
data.write.parquet(parquetPath)

// Confirm that the data isn't in the Delta format
DeltaTable.isDeltaTable(parquetPath)

// Use SQL to convert the parquet table to Delta
spark.sql(s"CONVERT TO DELTA parquet.`$parquetPath`")

DeltaTable.isDeltaTable(parquetPath)

parquetId: Int = 633
parquetPath: String = /parquet/parquet-table-259512-633
data: org.apache.spark.sql.Dataset[Long] = [id: bigint]
res110: Boolean = false
res113: org.apache.spark.sql.DataFrame = []
res115: Boolean = true