#  Updates and GDPR using Delta Lake - .NET for Apache Spark

In this notebook, we will review Delta Lake's end-to-end capabilities using [.NET for Apache Spark](https://github.com/dotnet/spark) (C#). You can also look at the original Quick Start guide if you are not familiar with [Delta Lake](https://github.com/delta-io/delta) [here](https://docs.delta.io/latest/quick-start.html). It provides code snippets that show how to read from and write to Delta Lake tables from interactive, batch, and streaming queries.

In this notebook, we will cover the following:

- Creating sample mock data containing customer orders
- Writing this data into storage in Delta Lake table format (or in short, Delta table)
- Querying the Delta table using functional and SQL
- The Curious Case of Forgotten Discount - Making corrections to data
- Enforcing GDPR on your data
- Oops, enforced it on the wrong customer! - Looking at the audit log to find mistakes in operations
- Rollback all the way!
- Closing the loop - 'defrag' your data

# Creating sample mock data containing customer orders

For this tutorial, we will setup a sample file containing customer orders with a simple schema: (order_id, order_date, customer_name, price).

In [3]:
spark.Sql("DROP TABLE IF EXISTS input");
spark.Sql(@"
          CREATE TEMPORARY VIEW input 
          AS SELECT 1 order_id, '2019-11-01' order_date, 'Saveen' customer_name, 100 price
          UNION ALL SELECT 2, '2019-11-01', 'Terry', 50
          UNION ALL SELECT 3, '2019-11-01', 'Priyanka', 100
          UNION ALL SELECT 4, '2019-11-02', 'Steve', 10
          UNION ALL SELECT 5, '2019-11-03', 'Rahul', 10
          UNION ALL SELECT 6, '2019-11-03', 'Niharika', 75
          UNION ALL SELECT 7, '2019-11-03', 'Elva', 90
          UNION ALL SELECT 8, '2019-11-04', 'Andrew', 70
          UNION ALL SELECT 9, '2019-11-05', 'Michael', 20
          UNION ALL SELECT 10, '2019-11-05', 'Brigit', 25
");
var orders = spark.Sql("SELECT * FROM input");
orders.Show();
orders.PrintSchema();

+--------+----------+-------------+-----+
|order_id|order_date|customer_name|price|
+--------+----------+-------------+-----+
|       1|2019-11-01|       Saveen|  100|
|       2|2019-11-01|        Terry|   50|
|       3|2019-11-01|     Priyanka|  100|
|       4|2019-11-02|        Steve|   10|
|       5|2019-11-03|        Rahul|   10|
|       6|2019-11-03|     Niharika|   75|
|       7|2019-11-03|         Elva|   90|
|       8|2019-11-04|       Andrew|   70|
|       9|2019-11-05|      Michael|   20|
|      10|2019-11-05|       Brigit|   25|
+--------+----------+-------------+-----+

root
 |-- order_id: integer (nullable = false)
 |-- order_date: string (nullable = false)
 |-- customer_name: string (nullable = false)
 |-- price: integer (nullable = false)

# Writing this data into storage in Delta Lake table format (or in short, Delta table)

To create a Delta Lake table, you can write a DataFrame out in the **delta** format. You can use existing Spark SQL code and change the format from parquet, csv, json, and so on, to delta. These operations create a new Delta Lake table using the schema that was inferred from your DataFrame. 

If you already have existing data in Parquet format, you can do an "in-place" conversion to Delta Lake format. The code would look like following:

DeltaTable.ConvertToDelta(spark, $"parquet.`{path_to_data}`");

//Confirm that the converted data is now in the Delta format
DeltaTable.IsDeltaTable(parquetPath)

In [4]:
var sessionId = (new Random()).Next(1000);
var path = $"/delta/delta-table-{sessionId}";
path

/delta/delta-table-555

In [5]:
// Here's how you'd do this in Parquet: 
// orders.Repartition(1).Write().Format("parquet").Save(path);

orders.Repartition(1).Write().Format("delta").Save(path);

# Querying the Delta table using functional and SQL


In [7]:
var ordersDataFrame = spark.Read().Format("delta").Load(path);
ordersDataFrame.Show();

+--------+----------+-------------+-----+
|order_id|order_date|customer_name|price|
+--------+----------+-------------+-----+
|       2|2019-11-01|        Terry|   50|
|       8|2019-11-04|       Andrew|   70|
|       3|2019-11-01|     Priyanka|  100|
|       9|2019-11-05|      Michael|   20|
|       5|2019-11-03|        Rahul|   10|
|       1|2019-11-01|       Saveen|  100|
|       7|2019-11-03|         Elva|   90|
|       6|2019-11-03|     Niharika|   75|
|       4|2019-11-02|        Steve|   10|
|      10|2019-11-05|       Brigit|   25|
+--------+----------+-------------+-----+

In [8]:
ordersDataFrame.CreateOrReplaceTempView("ordersDeltaTable");
spark.Sql("SELECT * FROM ordersDeltaTable").Show()

+--------+----------+-------------+-----+
|order_id|order_date|customer_name|price|
+--------+----------+-------------+-----+
|       2|2019-11-01|        Terry|   50|
|       8|2019-11-04|       Andrew|   70|
|       3|2019-11-01|     Priyanka|  100|
|       9|2019-11-05|      Michael|   20|
|       5|2019-11-03|        Rahul|   10|
|       1|2019-11-01|       Saveen|  100|
|       7|2019-11-03|         Elva|   90|
|       6|2019-11-03|     Niharika|   75|
|       4|2019-11-02|        Steve|   10|
|      10|2019-11-05|       Brigit|   25|
+--------+----------+-------------+-----+

# Understanding Meta-data

In Delta Lake, meta-data is no different from data i.e., it is stored next to the data. Therefore, an interesting side-effect here is that you can peek into meta-data using regular Spark APIs. 

In [9]:
using System.Linq;
spark.Read().Text($"{path}/_delta_log/").Collect().ToList()
    .ForEach(x => 
            Console.WriteLine(x.GetAs<string>("value")));

{"commitInfo":{"timestamp":1573093872531,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists","partitionBy":"[]"},"isBlindAppend":true}}
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"4f975246-53a2-4b42-a085-a3c4ec57fb6b","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"order_id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"order_date\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"customer_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"price\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1573093865554}}
{"add":{"path":"part-00000-12e6adeb-b1d8-4f69-a05d-acbc48fc0464-c000.snappy.parquet","partitionValues":{},"size":1291,"modificationTime":1573093872000,"dataChange":true}}

# The Curious Case of Forgotten Discount - Making corrections to data

Now that you are able to look at the orders table, you realize that you forgot to discount the orders that came in on November 1, 2019. Worry not! You can quickly make that correction.

In [10]:
using Microsoft.Spark.Extensions.Delta;
using Microsoft.Spark.Extensions.Delta.Tables;
using Microsoft.Spark.Sql;
using static Microsoft.Spark.Sql.Functions;

var table = DeltaTable.ForPath(path);

// Update every transaction that took place on November 1, 2019 and apply a discount of 10%
table.Update(
  condition: Expr("order_date == '2019-11-01'"),
  set: new Dictionary<string, Column>(){{ "price", Expr("price - price*0.1") }});

In [11]:
table.ToDF()

When you now inspect the meta-data, what you will notice is that the original data is over-written. Well, not in a true sense but appropriate entries are added to Delta's transaction log so it can provide an "illusion" that the original data was deleted. We can verify this by re-inspecting the meta-data. You will see several entries indicating reference removal to the original data.

In [12]:
spark.Read().Text($"{path}/_delta_log/").Collect().ToList()
    .ForEach(x => 
            Console.WriteLine(x.GetAs<string>("value")));

{"commitInfo":{"timestamp":1573093872531,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists","partitionBy":"[]"},"isBlindAppend":true}}
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"4f975246-53a2-4b42-a085-a3c4ec57fb6b","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"order_id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"order_date\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"customer_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"price\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1573093865554}}
{"add":{"path":"part-00000-12e6adeb-b1d8-4f69-a05d-acbc48fc0464-c000.snappy.parquet","partitionValues":{},"size":1291,"modificationTime":1573093872000,"dataChange":true}}
{"commitInfo":{"timestamp":1573093971854,"operation":"UPDATE","operationParamet

# Enforcing GDPR on your data

One of your customers wanted their data to be deleted. But wait, you are working with data stored on an immutable file system (e.g., HDFS, ADLS, WASB). How would you delete it? Using Delta Lake's Delete API.

Delta Lake provides programmatic APIs to conditionally update, delete, and merge (upsert) data into tables. For more information on these operations, see [Table Deletes, Updates, and Merges](https://docs.delta.io/latest/delta-update.html).

In [14]:
// Delete the appropriate customer
table.Delete(condition: Expr("customer_name == 'Saveen'"));
table.ToDF().Show();

+--------+----------+-------------+-----+
|order_id|order_date|customer_name|price|
+--------+----------+-------------+-----+
|       2|2019-11-01|        Terry|   45|
|       8|2019-11-04|       Andrew|   70|
|       3|2019-11-01|     Priyanka|   90|
|       9|2019-11-05|      Michael|   20|
|       5|2019-11-03|        Rahul|   10|
|       7|2019-11-03|         Elva|   90|
|       6|2019-11-03|     Niharika|   75|
|       4|2019-11-02|        Steve|   10|
|      10|2019-11-05|       Brigit|   25|
+--------+----------+-------------+-----+

# Oops, enforced it on the wrong customer! - Looking at the audit/history log to find mistakes in operations

Delta's most powerful feature is the ability to allow looking into history i.e., the changes that were made to the underlying Delta Table. The cell below shows how simple it is to inspect the history.


In [15]:
table.History().Drop("userId", "userName", "job", "notebook", "clusterId", "isolationLevel", "isBlindAppend").Show(20, 1000, false);

+-------+-------------------+---------+-----------------------------------------------+-----------+
|version|          timestamp|operation|                            operationParameters|readVersion|
+-------+-------------------+---------+-----------------------------------------------+-----------+
|      2|2019-11-07 02:33:31|   DELETE|[predicate -> ["(`customer_name` = 'Saveen')"]]|          1|
|      1|2019-11-07 02:32:52|   UPDATE|   [predicate -> (order_date#438 = 2019-11-01)]|          0|
|      0|2019-11-07 02:31:13|    WRITE|     [mode -> ErrorIfExists, partitionBy -> []]|       null|
+-------+-------------------+---------+-----------------------------------------------+-----------+

# Rollback all the way using Time Travel!

You can query previous snapshots of your Delta Lake table by using a feature called Time Travel. If you want to access the data that you overwrote, you can query a snapshot of the table before you overwrote the first set of data using the versionAsOf option.

Once you run the cell below, you should see the first set of data, from before you overwrote it. Time Travel is an extremely powerful feature that takes advantage of the power of the Delta Lake transaction log to access data that is no longer in the table. Removing the version 0 option (or specifying version 1) would let you see the newer data again. For more information, see [Query an older snapshot of a table (time travel)](https://docs.delta.io/latest/delta-batch.html#deltatimetravel).

In [16]:
spark.Read().Format("delta").Option("versionAsOf", "1").Load(path)
    .Write().Mode("overwrite").Format("delta").Save(path);

In [17]:
// Delete the correct customer - REMOVE
table.Delete(condition: Expr("customer_name == 'Rahul'"));
table.ToDF().Show();

+--------+----------+-------------+-----+
|order_id|order_date|customer_name|price|
+--------+----------+-------------+-----+
|       2|2019-11-01|        Terry|   45|
|       8|2019-11-04|       Andrew|   70|
|       3|2019-11-01|     Priyanka|   90|
|       9|2019-11-05|      Michael|   20|
|       1|2019-11-01|       Saveen|   90|
|       7|2019-11-03|         Elva|   90|
|       6|2019-11-03|     Niharika|   75|
|       4|2019-11-02|        Steve|   10|
|      10|2019-11-05|       Brigit|   25|
+--------+----------+-------------+-----+

In [18]:
table.History().Drop("userId", "userName", "job", "notebook", "clusterId", "isolationLevel", "isBlindAppend").Show(20, 1000, false);

+-------+-------------------+---------+-----------------------------------------------+-----------+
|version|          timestamp|operation|                            operationParameters|readVersion|
+-------+-------------------+---------+-----------------------------------------------+-----------+
|      4|2019-11-07 02:36:33|   DELETE| [predicate -> ["(`customer_name` = 'Rahul')"]]|          3|
|      3|2019-11-07 02:35:52|    WRITE|         [mode -> Overwrite, partitionBy -> []]|          2|
|      2|2019-11-07 02:33:31|   DELETE|[predicate -> ["(`customer_name` = 'Saveen')"]]|          1|
|      1|2019-11-07 02:32:52|   UPDATE|   [predicate -> (order_date#438 = 2019-11-01)]|          0|
|      0|2019-11-07 02:31:13|    WRITE|     [mode -> ErrorIfExists, partitionBy -> []]|       null|
+-------+-------------------+---------+-----------------------------------------------+-----------+

# Closing the loop - 'defrag' your data


In [19]:
spark.Conf().Set("spark.databricks.delta.retentionDurationCheck.enabled", "false");
table.Vacuum(0.01)

// Alternate Syntax: spark.Sql($"VACUUM delta.`{path}`").Show();

index
