# Using Linux Foundation Delta Lake in Azure Synapse Analytics Spark
Azure Synapse is compatible with Linux Foundation Delta Lake. Delta Lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and durability) transactions to Apache Spark and big data workloads.

This notebook provides examples of how to update, merge and delete delta lake tables in Synapse.

## Pre-requisites
In this notebook you will save your tables in Delta Lake format to your workspace's primary storage account. You are required to be a **Blob Storage Contributor** in the ADLS Gen2 account (or folder) you will access.


## Load sample data

First you will load the [public holidays](https://azure.microsoft.com/en-us/services/open-datasets/catalog/public-holidays/) data from last 6 months via Azure Open datasets.


In [72]:
// Load sample data from azure open dataset
val hol_blob_account_name = "azureopendatastorage"
val hol_blob_container_name = "holidaydatacontainer"
val hol_blob_relative_path = "Processed"
val hol_blob_sas_token = ""

val hol_wasbs_path = f"wasbs://$hol_blob_container_name@$hol_blob_account_name.blob.core.windows.net/$hol_blob_relative_path"
spark.conf.set(f"fs.azure.sas.$hol_blob_container_name.$hol_blob_account_name.blob.core.windows.net",hol_blob_sas_token)

val hol_df_raw = spark.read.parquet(hol_wasbs_path)

hol_blob_account_name: String = azureopendatastorage
hol_blob_container_name: String = holidaydatacontainer
hol_blob_relative_path: String = Processed
hol_blob_sas_token: String = ""
hol_wasbs_path: String = wasbs://holidaydatacontainer@azureopendatastorage.blob.core.windows.net/Processed
hol_df_raw: org.apache.spark.sql.DataFrame = [countryOrRegion: string, holidayName: string ... 4 more fields]

In [73]:
display(hol_df_raw)

In [74]:
// Filter out data for the latest 6 months
import java.sql.Timestamp
import org.joda.time.DateTime

val endtime = new Timestamp(DateTime.now.getMillis)
val starttime = new Timestamp(DateTime.now.minusMonths(6).getMillis)

val hol_df = hol_df_raw.filter((hol_df_raw("date") >= starttime) && (hol_df_raw("date") <= endtime)) 
hol_df.show(truncate = false)

import java.sql.Timestamp
import org.joda.time.DateTime
endtime: java.sql.Timestamp = 2020-04-26 08:50:52.293
starttime: java.sql.Timestamp = 2019-10-26 08:50:53.06
hol_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [countryOrRegion: string, holidayName: string ... 4 more fields]
+---------------+----------------------------------------------+----------------------------------------------+-------------+-----------------+-------------------+
|countryOrRegion|holidayName                                   |normalizeHolidayName                          |isPaidTimeOff|countryRegionCode|date               |
+---------------+----------------------------------------------+----------------------------------------------+-------------+-----------------+-------------------+
|Norway         |Søndag                                        |Søndag                                        |null         |NO               |2019-10-27 00:00:00|
|Sweden         |Söndag                          

## Write data to the Delta Lake table


In [75]:
// Set primary storage info
val account_name = "ltianwestus2gen2" // fill in your primary storage account name
val container_name = "mydefault" // fill in your container name
val relative_path = "samplenb/" // fill in your relative folder path

// Set the strorage path info
val adls_path = f"abfss://$container_name@$account_name.dfs.core.windows.net/$relative_path" 

// Delta Lake relative path
val delta_relative_path = adls_path + "delta/holiday/"

account_name: String = ltianwestus2gen2
container_name: String = mydefault
relative_path: String = samplenb/
adls_path: String = abfss://mydefault@ltianwestus2gen2.dfs.core.windows.net/samplenb/
delta_relative_path: String = abfss://mydefault@ltianwestus2gen2.dfs.core.windows.net/samplenb/delta/holiday/

In [76]:
// Filter out indian holidays
val hol_df_IN = hol_df.filter(hol_df("countryRegionCode") === "IN")
hol_df_IN.show(truncate = false)

hol_df_IN: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [countryOrRegion: string, holidayName: string ... 4 more fields]
+---------------+------------------------+------------------------+-------------+-----------------+-------------------+
|countryOrRegion|holidayName             |normalizeHolidayName    |isPaidTimeOff|countryRegionCode|date               |
+---------------+------------------------+------------------------+-------------+-----------------+-------------------+
|India          |Christmas               |Christmas               |false        |IN               |2019-12-25 00:00:00|
|India          |Makar Sankranti / Pongal|Makar Sankranti / Pongal|false        |IN               |2020-01-14 00:00:00|
|India          |Republic Day            |Republic Day            |true         |IN               |2020-01-26 00:00:00|
+---------------+------------------------+------------------------+-------------+-----------------+-------------------+

In [77]:
//Let's write the data in the Delta Lake table. 
import org.apache.spark.sql.SaveMode

hol_df_IN.write.mode("overwrite").format("delta").partitionBy("holidayName").save(delta_relative_path)

import org.apache.spark.sql.SaveMode

In [78]:
val delta_data = spark.read.format("delta").load(delta_relative_path)
delta_data.show(truncate = false)

delta_data: org.apache.spark.sql.DataFrame = [countryOrRegion: string, holidayName: string ... 4 more fields]
+---------------+------------------------+------------------------+-------------+-----------------+-------------------+
|countryOrRegion|holidayName             |normalizeHolidayName    |isPaidTimeOff|countryRegionCode|date               |
+---------------+------------------------+------------------------+-------------+-----------------+-------------------+
|India          |Makar Sankranti / Pongal|Makar Sankranti / Pongal|false        |IN               |2020-01-14 00:00:00|
|India          |Republic Day            |Republic Day            |true         |IN               |2020-01-26 00:00:00|
|India          |Christmas               |Christmas               |false        |IN               |2019-12-25 00:00:00|
+---------------+------------------------+------------------------+-------------+-----------------+-------------------+

## Overwrite the entire Delta Lake table


In [79]:
//Let's overwrite the entire delta file with 1 record
val hol_df_JP= hol_df.filter(hol_df("countryRegionCode") === "JP")
hol_df_JP.write.format("delta").mode("overwrite").save(delta_relative_path)

hol_df_JP: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [countryOrRegion: string, holidayName: string ... 4 more fields]

In [80]:
val delta_data = spark.read.format("delta").load(delta_relative_path)
delta_data.show(truncate = false)

delta_data: org.apache.spark.sql.DataFrame = [countryOrRegion: string, holidayName: string ... 4 more fields]
+---------------+------------+--------------------+-------------+-----------------+-------------------+
|countryOrRegion|holidayName |normalizeHolidayName|isPaidTimeOff|countryRegionCode|date               |
+---------------+------------+--------------------+-------------+-----------------+-------------------+
|Japan          |勤労感謝の日|勤労感謝の日        |null         |JP               |2019-11-23 00:00:00|
|Japan          |建国記念の日|建国記念の日        |null         |JP               |2020-02-11 00:00:00|
|Japan          |天皇誕生日  |天皇誕生日          |null         |JP               |2020-02-23 00:00:00|
|Japan          |文化の日    |文化の日            |null         |JP               |2019-11-03 00:00:00|
|Japan          |成人の日    |成人の日            |null         |JP               |2020-01-13 00:00:00|
|Japan          |春分の日    |春分の日            |null         |JP               |2020-03-20 00:00:00|
|Japan      

## Merge new data based on given merge condition 

In [81]:
// Upsert (merge) the United States' holiday data with Japan's
 
import io.delta.tables._ 

val deltaTable = DeltaTable.forPath(spark,delta_relative_path)
val hol_df_US= hol_df.filter(hol_df("countryRegionCode") === "US")

deltaTable.as("hol_df_JP").merge(hol_df_US.as("hol_df_US"),
     "hol_df_JP.countryRegionCode = hol_df_US.countryRegionCode").whenNotMatched.insertExpr(Map(
         "countryOrRegion" -> "hol_df_US.countryOrRegion",
         "holidayName" -> "hol_df_US.holidayName",
         "normalizeHolidayName" -> "hol_df_US.normalizeHolidayName",
         "isPaidTimeOff"-> "hol_df_US.isPaidTimeOff",
         "countryRegionCode"->"hol_df_US.countryRegionCode",
         "date"->"hol_df_US.date")).execute()

deltaTable.toDF.show(truncate = false)

import io.delta.tables._
deltaTable: io.delta.tables.DeltaTable = io.delta.tables.DeltaTable@3ca38f11
hol_df_US: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [countryOrRegion: string, holidayName: string ... 4 more fields]
+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+
|countryOrRegion|holidayName               |normalizeHolidayName      |isPaidTimeOff|countryRegionCode|date               |
+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+
|United States  |Martin Luther King Jr. Day|Martin Luther King Jr. Day|null         |US               |2020-01-20 00:00:00|
|United States  |Washington's Birthday     |Washington's Birthday     |true         |US               |2020-02-17 00:00:00|
|United States  |New Year's Day            |New Year's Day            |true         |US               |2020-01-01 00:00:00|
|United States  |Chri

## Update table on the rows that match the given condition


In [82]:
// Update column the 'null' value in 'isPaidTimeOff' with 'false'
import io.delta.tables._ 
import org.apache.spark.sql.functions._
import spark.implicits._

deltaTable.update(col("isPaidTimeOff").isNull, Map("isPaidTimeOff" -> lit("false")));

deltaTable.toDF.show(truncate = false)

import io.delta.tables._
import org.apache.spark.sql.functions._
import spark.implicits._
+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+
|countryOrRegion|holidayName               |normalizeHolidayName      |isPaidTimeOff|countryRegionCode|date               |
+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+
|United States  |Martin Luther King Jr. Day|Martin Luther King Jr. Day|false        |US               |2020-01-20 00:00:00|
|United States  |Washington's Birthday     |Washington's Birthday     |true         |US               |2020-02-17 00:00:00|
|United States  |New Year's Day            |New Year's Day            |true         |US               |2020-01-01 00:00:00|
|United States  |Christmas Day             |Christmas Day             |true         |US               |2019-12-25 00:00:00|
|United States  |Veterans Day             

## Delete data from the table that match the given condition


In [83]:
println("Row count before delete: ")
println(deltaTable.toDF.count())


// Delte data with date later than 2020-01-01
deltaTable.delete ("date > '2020-01-01'")


println("Row count after delete:  ")
println(deltaTable.toDF.count())
deltaTable.toDF.show()

Row count before delete: 
15
Row count after delete:  
6
+---------------+-------------+--------------------+-------------+-----------------+-------------------+
|countryOrRegion|  holidayName|normalizeHolidayName|isPaidTimeOff|countryRegionCode|               date|
+---------------+-------------+--------------------+-------------+-----------------+-------------------+
|  United States|Christmas Day|       Christmas Day|         true|               US|2019-12-25 00:00:00|
|  United States| Veterans Day|        Veterans Day|        false|               US|2019-11-11 00:00:00|
|  United States| Thanksgiving|        Thanksgiving|         true|               US|2019-11-28 00:00:00|
|          Japan| 勤労感謝の日|        勤労感謝の日|        false|               JP|2019-11-23 00:00:00|
|          Japan|     振替休日|            振替休日|        false|               JP|2019-11-04 00:00:00|
|          Japan|     文化の日|            文化の日|        false|               JP|2019-11-03 00:00:00|
+---------------+---------

## Get the operation history of the delta table


In [84]:
val fullHistoryDF = deltaTable.history()
val lastOperationDF = deltaTable.history(1)

println("Full history DF: ")
fullHistoryDF.show(truncate = false)

print("lastOperationDF: ")
lastOperationDF.show(truncate = false)

fullHistoryDF: org.apache.spark.sql.DataFrame = [version: bigint, timestamp: timestamp ... 10 more fields]
lastOperationDF: org.apache.spark.sql.DataFrame = [version: bigint, timestamp: timestamp ... 10 more fields]
Full history DF: 
+-------+-------------------+------+--------+---------+------------------------------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+
|version|timestamp          |userId|userName|operation|operationParameters                                                           |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|
+-------+-------------------+------+--------+---------+------------------------------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+
|70     |2020-04-26 08:53:34|null  |null    |DELETE   |[predicate -> ["(CAST(`date` AS STRING) > '2020-01-01')"]]                    |null|null    |null     |69      