# Using delta lake in Synaspe Spark
Synapse is compatible with Linux Foundation Delta Lake. Delta Lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and durability) transactions to Apache Spark and big data workloads.

This notebook provides examples of how to update, merge and delete delta lake tables in Synapse.

## Pre-requisites
In this notebook we are going to save your delta table to workspace's primary storage account. You are required to be a **Blob Storage Contributor** in the ADLS Gen2 account (or folder) you will access.


## Load a sample data

Let's first load the [public holidays](https://azure.microsoft.com/en-us/services/open-datasets/catalog/public-holidays/) of last 6 months from Azure Open datasets as a sample.


In [3]:
from azureml.opendatasets import PublicHolidays

from datetime import datetime
from dateutil import parser
from dateutil.relativedelta import relativedelta


end_date = datetime.today()
start_date = datetime.today() - relativedelta(months=6)
hol = PublicHolidays(start_date=start_date, end_date=end_date)
hol_df = hol.to_spark_dataframe()

In [26]:
# Display 10 rows
hol_df.show(10, truncate = False)

+---------------+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------+-------------+-----------------+-------------------+
|countryOrRegion|holidayName                                                                  |normalizeHolidayName                                                         |isPaidTimeOff|countryRegionCode|date               |
+---------------+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------+-------------+-----------------+-------------------+
|India          |Gandhi Jayanti                                                               |Gandhi Jayanti                                                               |true         |IN               |2019-10-02 00:00:00|
|Germany        |Tag der Deutschen Einheit                                                    |T

## Write data to the delta lake table


In [27]:
# Set the strorage path info
# Primary storage info
account_name = 'Your storage account name' # fill in your primary storage account name
container_name = 'Your container name' # fill in your container name
relative_path = 'Your relative path' # fill in your relative folder path

adls_path = 'abfss://%s@%s.dfs.core.windows.net/%s' % (container_name, account_name, relative_path)
print('Primary storage account path: ' + adls_path)

# Delta lake relative path
delta_relative_path = adls_path + 'delta/holiday/'
print('Delta lake path: ' + delta_relative_path)

Primary storage account path: abfss://mydefault@ltianwestus2gen2.dfs.core.windows.net/samplenb/
Delta lake path: abfss://mydefault@ltianwestus2gen2.dfs.core.windows.net/samplenb/delta/holiday/

In [28]:
# Filter out indian holidays
hol_df_IN = hol_df[(hol_df.countryRegionCode == "IN")]
hol_df_IN.show(5, truncate = False)

+---------------+------------------------+------------------------+-------------+-----------------+-------------------+
|countryOrRegion|holidayName             |normalizeHolidayName    |isPaidTimeOff|countryRegionCode|date               |
+---------------+------------------------+------------------------+-------------+-----------------+-------------------+
|India          |Gandhi Jayanti          |Gandhi Jayanti          |true         |IN               |2019-10-02 00:00:00|
|India          |Christmas               |Christmas               |false        |IN               |2019-12-25 00:00:00|
|India          |Makar Sankranti / Pongal|Makar Sankranti / Pongal|false        |IN               |2020-01-14 00:00:00|
|India          |Republic Day            |Republic Day            |true         |IN               |2020-01-26 00:00:00|
+---------------+------------------------+------------------------+-------------+-----------------+-------------------+

In [29]:
#Let's write the data in the delta table. 
hol_df_IN.write.mode("overwrite").format("delta").partitionBy("holidayName").save(delta_relative_path)

In [12]:
delta_data = spark.read.format("delta").load(delta_relative_path)
delta_data.show()

+---------------+--------------------+--------------------+-------------+-----------------+-------------------+
|countryOrRegion|         holidayName|normalizeHolidayName|isPaidTimeOff|countryRegionCode|               date|
+---------------+--------------------+--------------------+-------------+-----------------+-------------------+
|          India|Makar Sankranti /...|Makar Sankranti /...|        false|               IN|2020-01-14 00:00:00|
|          India|      Gandhi Jayanti|      Gandhi Jayanti|         true|               IN|2019-10-02 00:00:00|
|          India|        Republic Day|        Republic Day|         true|               IN|2020-01-26 00:00:00|
|          India|           Christmas|           Christmas|        false|               IN|2019-12-25 00:00:00|
+---------------+--------------------+--------------------+-------------+-----------------+-------------------+

## Overwrite the entire delta table


In [69]:
#Let's overwrite the entire delta file with 1 record

hol_df_JP= hol_df[(hol_df.countryRegionCode == "JP")]
hol_df_JP.write.format("delta").mode("overwrite").save(delta_relative_path)

In [64]:
delta_data = spark.read.format("delta").load(delta_relative_path)
delta_data.show()

+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+
|countryOrRegion|               holidayName|      normalizeHolidayName|isPaidTimeOff|countryRegionCode|               date|
+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+
|          Japan|即位礼正殿の儀が行われる日|即位礼正殿の儀が行われる日|         null|               JP|2019-10-22 00:00:00|
|          Japan|              勤労感謝の日|              勤労感謝の日|         null|               JP|2019-11-23 00:00:00|
|          Japan|              建国記念の日|              建国記念の日|         null|               JP|2020-02-11 00:00:00|
|          Japan|                天皇誕生日|                天皇誕生日|         null|               JP|2020-02-23 00:00:00|
|          Japan|                  文化の日|                  文化の日|         null|               JP|2019-11-03 00:00:00|
|          Japan|                  成人の日|                  成人の日|         null

## Merge new data based on given merge condition 

In [70]:
# Upsert (merge) the United States' holiday data with Japan's
 
from delta.tables import *

deltaTable = DeltaTable.forPath(spark,delta_relative_path)

hol_df_US= hol_df[(hol_df.countryRegionCode == "US")]


deltaTable.alias("hol_df_JP").merge(
     source = hol_df_US.alias("hol_df_US"),
     condition = "hol_df_JP.countryRegionCode = hol_df_US.countryRegionCode"
    ).whenMatchedUpdate(set = 
    {}).whenNotMatchedInsert( values = 
    {
        "countryOrRegion" : "hol_df_US.countryOrRegion",
        "holidayName" : "hol_df_US.holidayName",
        "normalizeHolidayName" : "hol_df_US.normalizeHolidayName",
        "isPaidTimeOff":"hol_df_US.isPaidTimeOff",
        "countryRegionCode":"hol_df_US.countryRegionCode",
        "date":"hol_df_US.date"
    }
    ).execute()


deltaTable.toDF().show()

+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+
|countryOrRegion|               holidayName|      normalizeHolidayName|isPaidTimeOff|countryRegionCode|               date|
+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+
|          Japan|即位礼正殿の儀が行われる日|即位礼正殿の儀が行われる日|         null|               JP|2019-10-22 00:00:00|
|  United States|      Martin Luther Kin...|      Martin Luther Kin...|         true|               US|2020-01-20 00:00:00|
|  United States|      Washington's Birt...|      Washington's Birt...|         true|               US|2020-02-17 00:00:00|
|  United States|            New Year's Day|            New Year's Day|         true|               US|2020-01-01 00:00:00|
|  United States|             Christmas Day|             Christmas Day|         true|               US|2019-12-25 00:00:00|
|  United States|              Vet

## Update table on the rows that match the given condition


In [71]:
# Update column the 'null' value in 'isPaidTimeOff' with 'false'

from pyspark.sql.functions import *
deltaTable.update(
    condition = (col("isPaidTimeOff").isNull()),
    set = {"isPaidTimeOff": "false"})

deltaTable.toDF().show()

+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+
|countryOrRegion|               holidayName|      normalizeHolidayName|isPaidTimeOff|countryRegionCode|               date|
+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+
|          Japan|即位礼正殿の儀が行われる日|即位礼正殿の儀が行われる日|        false|               JP|2019-10-22 00:00:00|
|  United States|      Martin Luther Kin...|      Martin Luther Kin...|         true|               US|2020-01-20 00:00:00|
|  United States|      Washington's Birt...|      Washington's Birt...|         true|               US|2020-02-17 00:00:00|
|  United States|            New Year's Day|            New Year's Day|         true|               US|2020-01-01 00:00:00|
|  United States|             Christmas Day|             Christmas Day|         true|               US|2019-12-25 00:00:00|
|  United States|              Vet

## Delte data from the table that match the given condition


In [72]:
print("Row count before delete: ")
print(deltaTable.toDF().count())


# Delte data with date later than 2020-01-01
deltaTable.delete ("date > '2020-01-01'")


print("Row count after delete:  ")
print(deltaTable.toDF().count())
deltaTable.toDF().show()

Row count before delete: 
18
Row count after delete:  
9
+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+
|countryOrRegion|               holidayName|      normalizeHolidayName|isPaidTimeOff|countryRegionCode|               date|
+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+
|          Japan|即位礼正殿の儀が行われる日|即位礼正殿の儀が行われる日|        false|               JP|2019-10-22 00:00:00|
|  United States|             Christmas Day|             Christmas Day|         true|               US|2019-12-25 00:00:00|
|  United States|              Veterans Day|              Veterans Day|        false|               US|2019-11-11 00:00:00|
|  United States|              Thanksgiving|              Thanksgiving|         true|               US|2019-11-28 00:00:00|
|  United States|              Columbus Day|              Columbus Day|        false|               U

## Get the operation history of the delta table


In [73]:
fullHistoryDF = deltaTable.history()
lastOperationDF = deltaTable.history(1)

print('Full history DF: ')
fullHistoryDF.show(truncate = False)

print('lastOperationDF: ')
lastOperationDF.show(truncate = False)

Full history DF: 
+-------+-------------------+------+--------+---------+------------------------------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+
|version|timestamp          |userId|userName|operation|operationParameters                                                           |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|
+-------+-------------------+------+--------+---------+------------------------------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+
|19     |2020-03-30 08:34:57|null  |null    |DELETE   |[predicate -> ["(CAST(`date` AS STRING) > '2020-01-01')"]]                    |null|null    |null     |18         |null          |false        |
|18     |2020-03-30 08:33:13|null  |null    |UPDATE   |[predicate -> isnull(isPaidTimeOff#5236)]                                     |null|null    |null     |17         |null        