Previously load the parquet file on a storage from https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet  

# Setup the path location where the parquet file is stored

In [1]:
storagePath = "abfss://churning@churninge2edemo.dfs.core.windows.net/synapse/tables/covid"

StatementMeta(smallPool, 9, 1, Finished, Available)



In [2]:
dfcovidParquet = spark.read.load(f"{storagePath}/ecdc_cases.parquet", format="parquet")

StatementMeta(smallPool, 9, 2, Finished, Available)



# Convert the parquet file to delta

In [3]:
dfcovidParquet.write.format("delta").mode("overwrite").save(f"{storagePath}/delta")

StatementMeta(smallPool, 9, 3, Finished, Available)



In [4]:
dfCovidDelta =  spark.read.load(f"{storagePath}/delta", format="delta")

StatementMeta(smallPool, 9, 4, Finished, Available)



In [5]:
display(dfCovidDelta)

StatementMeta(smallPool, 9, 5, Finished, Available)

SynapseWidget(Synapse.DataFrame, 58fe9f39-6e11-4b48-8790-1a89a64e51ed)

# We will modify data to simulate a update/merge scenario
Load date was initially on the 22nd of June (Column load_date)
France had 0 Covid cases on the 15th of July but we will modify the result to simulate an update

In [6]:
from pyspark.sql.functions import col,lit,to_timestamp


dfCovidDeltaFrance = dfCovidDelta.withColumn("cases",lit(12)).withColumn("load_date",to_timestamp(lit("2021-06-23 00:00:00.000"),'yyyy-MM-dd HH:mm:ss.SSSS')).where("iso_country = 'FR' and date_rep = '2020-07-15'")

StatementMeta(smallPool, 9, 6, Finished, Available)



In [7]:
dfCovidDeltaFrance.write.format("delta").mode("overwrite").save(f"{storagePath}/deltaMerge")

StatementMeta(smallPool, 9, 7, Finished, Available)



# Data that will be used to merge our delta lake
Cases have been set to 12 and load_date has been set to the 23rd of June for France

In [8]:
display(dfCovidDeltaFrance)

StatementMeta(smallPool, 9, 8, Finished, Available)

SynapseWidget(Synapse.DataFrame, eab0db38-0cb1-4909-8691-dace1dfc3ef3)

Show potential record difference between the former records (Delta Lake) and the upcoming record

In [9]:
dfDiff = dfCovidDelta.join(dfCovidDeltaFrance, (dfCovidDelta.date_rep ==  dfCovidDeltaFrance.date_rep) & (dfCovidDelta.iso_country == dfCovidDeltaFrance.iso_country) & (dfCovidDelta.cases !=  dfCovidDeltaFrance.cases)).select("*")

StatementMeta(smallPool, 9, 9, Finished, Available)



In [10]:
display(dfDiff)

StatementMeta(smallPool, 9, 10, Finished, Available)

SynapseWidget(Synapse.DataFrame, 6f49188a-42db-4cc9-9e5b-7a398c22eebc)

In [11]:
from delta.tables import *

deltaCovidTable = DeltaTable.forPath(spark,f"{storagePath}/delta")

StatementMeta(smallPool, 9, 11, Finished, Available)



# Create a Delta table from former storage location
France on the 15th of July had 0 case

In [12]:
display(deltaCovidTable.toDF().where("iso_country = 'FR' and date_rep = '2020-07-15'"))

StatementMeta(smallPool, 9, 12, Finished, Available)

SynapseWidget(Synapse.DataFrame, f958adac-5841-44b7-81f7-f75f5155435a)

# Merge operation
We merge on the colums load_date and country and decide to save data using the updated record

In [13]:



deltaCovidTable.alias("deltaLake").merge(dfCovidDeltaFrance.alias("dataUpdate"),condition="deltaLake.date_rep = dataUpdate.date_rep and deltaLake.iso_country = dataUpdate.iso_country").whenMatchedUpdate(set = {"cases":"dataUpdate.cases","load_date": "dataUpdate.load_date"}).execute()


StatementMeta(smallPool, 9, 13, Finished, Available)



After the merge our Delta Lake for France on the 15th of July has 12 cases and load_date has been updated to 23rd of June

In [14]:
display(deltaCovidTable.toDF().where("date_rep = '2020-07-15' and iso_country='FR'"))

StatementMeta(smallPool, 9, 14, Finished, Available)

SynapseWidget(Synapse.DataFrame, 03e45c91-5dca-4f66-9f99-5dbcdfc421bd)

Only record forFrance on the 15th of july has been modified

In [15]:
display(deltaCovidTable.toDF().where("date_rep = '2020-07-15'"))

StatementMeta(smallPool, 9, 15, Finished, Available)

SynapseWidget(Synapse.DataFrame, 5af4653f-4ea7-464d-ab9d-429192c71521)