-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Storing PII Securely

Adding a pseudonymized key to incremental workloads is as simple as adding a transformation.

In this notebook, we'll examine design patterns for ensuring PII is stored securely and updated accurately. We'll also demonstrate an approach for processing delete requests to make sure these are captured appropriately.

<img src="https://files.training.databricks.com/images/ade/ADE_arch_users.png" width="60%" />

## Learning Objectives
By the end of this notebook, students will be able to:
- Apply incremental transformations to store data with pseudonymized keys
- Use windowed ranking to identify the most-recent records in a CDC feed
-

Begin by running the following cell to set up relevant databases and paths.

In [0]:
%run "../Includes/ade-setup"

Execute the following cell to reset your `users` table.

In [0]:
spark.sql("DROP TABLE IF EXISTS users")
dbutils.fs.rm(Paths.users, True)
dbutils.fs.rm(Paths.usersCheckpointPath, True)

spark.sql("DROP TABLE IF EXISTS delete_requests")
dbutils.fs.rm(Paths.deleteRequests, True)

spark.sql(f"""
  CREATE TABLE users
  (alt_id STRING, dob DATE, sex STRING, gender STRING, first_name STRING, last_name STRING, street_address STRING, city STRING, state STRING, zip INT, updated TIMESTAMP)
  USING DELTA
  LOCATION '{Paths.users}'
""")

## ELT with Pseudonymization
The data in the `user_info` topic contains complete row outputs from a Change Data Capture feed.

There are three values for `update_type` present in the data: `new`, `update`, and `delete`.

The `users` table will be implemented as a Type 1 table, so only the most recent value matters

Run the cell below to visually confirm that both `new` and `update` records contain all the fields we need for our `users` table.

In [0]:
schema = """
    user_id LONG, 
    update_type STRING, 
    timestamp FLOAT, 
    dob STRING, 
    sex STRING, 
    gender STRING, 
    first_name STRING, 
    last_name STRING, 
    address STRUCT<
        street_address: STRING, 
        city: STRING, 
        state: STRING, 
        zip: INT
    >"""

usersDF = (spark.table("bronze")
    .filter("topic = 'user_info'")
    .select(F.from_json(F.col("value").cast("string"), schema).alias("v")).select("v.*")
    .filter(F.col("update_type").isin(["new", "update"]))
          )

display(usersDF)

## Processing Right to Be Forgotten Requests

While it is possible to process deletes at the same time as appends and updates, the fines around right to be forgotten requests may warrant a separate process.

Below, logic for setting up a simple table to process delete requests through the users data is displayed. A simple deadline of 30 days after the request is inserted, allowing internal automated audits to leverage this table to ensure compliance.

In [0]:
display(spark.table("bronze")
    .filter("topic = 'user_info'")
    .select(F.from_json(F.col("value").cast("string"), schema).alias("v")).select("v.*", F.col('v.timestamp').cast("timestamp").alias("requested"))
    .filter("update_type = 'delete'")
    .select("user_id",
        "requested",
        F.date_add("requested", 30).alias("deadline"), 
        F.lit("requested").alias("status")
           )
   )

## Deduplication with Windowed Ranking

We've previously explored some ways to remove duplicate records:
- Using Delta Lake's `MERGE` syntax, we can update or insert records based on keys, matching new records with previously loaded data
- `dropDuplicates` will remove exact duplicates within a table or incremental microbatch

Now we have multiple records for a given primary key BUT these records are not identical. `dropDuplicates` will not work to remove these records, and we'll get an error from our merge statement if we have the same key present multiple times.

Below, a third approach for removing duplicates is shown below using the [pySpark Window class](http://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Window.html?highlight=window#pyspark.sql.Window).

In [0]:
from pyspark.sql.window import Window

window = Window.partitionBy("user_id").orderBy(F.col("timestamp").desc())
rankedDF = usersDF.dropDuplicates().withColumn("rank", F.rank().over(window)).filter("rank == 1").drop("rank")

display(rankedDF)

As desired, we get only the newest (`rank == 1`) entry for each unique `user_id`.

Unfortunately, if we try to apply this to a streaming read of our data, we'll learn that
> Non-time-based windows are not supported on streaming DataFrames

In [0]:
streamingRankedDF = (spark.readStream.table("bronze")
    .filter("topic = 'user_info'")
    .dropDuplicates()
    .select(F.from_json(F.col("value").cast("string"), schema).alias("v")).select("v.*")
    .filter(F.col("update_type").isin(["new", "update"]))
    .withColumn("rank", F.rank().over(window)).filter("rank == 1").drop("rank")
                    )
  
try:
    display(streamingRankedDF)
    raise Exception("Expected failure.")

except pyspark.sql.utils.AnalysisException as e:
    print("Failed as expected...")
    print(e)


Luckily we have a workaround to avoid this restriction.

## Implementing Streaming Ranked De-duplication

As we saw previously, when apply `MERGE` logic with a Structured Streaming job, we need to use `foreachBatch` logic.

Recall that while we're inside a streaming microbatch, we interact with our data using batch syntax.

This means that if we can apply our ranked `Window` logic within our `foreachBatch` function, we can avoid the restriction throwing our error.

The code below sets up all the incremental logic needed to load in the data in the correct schema from the bronze table. This includes:
- Filter for the `user_info` topic
- Dropping identical records within the batch
- Unpack all of the JSON fields from the `value` column into the correct schema
- Update field names and types to match the `users` table schema
- Use the salted hash function to cast the `user_id` to `alt_id`

In [0]:
salt = "BEANS"

unpackedDF = (spark.readStream
    .table("bronze")
    .filter("topic = 'user_info'")
    .dropDuplicates()
    .select(F.from_json(F.col("value").cast("string"), schema).alias("v")).select("v.*")
    .select(F.sha2(F.concat(F.col("user_id"), F.lit(salt)), 256).alias("alt_id"),
        F.col('timestamp').cast("timestamp").alias("updated"),
        F.to_date('dob','MM/dd/yyyy').alias('dob'),
        'sex', 'gender','first_name','last_name',
        'address.*', "update_type"))

The updated Window logic is provided below. Note that this is being applied to each `microBatchDF` to result in a local `rankedDF` that will be used for merging.
 
For our `MERGE` statement, we need to:
- Match entries on our `alt_id`
- Update all when matched **if** the new record has is newer than the previous entry
- When not matched, insert all

Because `foreachBatch` allows for arbitrary writes to multiple tables from the same stream, processing the delete requests to a second table can happen in this same logic.

In [0]:
from pyspark.sql.window import Window

window = Window.partitionBy("alt_id").orderBy(F.col("updated").desc())

def batch_rank_upsert(microBatchDF, batchId):
    appId = "batch_rank_upsert"
    
    (microBatchDF
        .filter(F.col("update_type").isin(["new", "update"]))
        .withColumn("rank", F.rank().over(window)).filter("rank == 1").drop("rank")
        .createOrReplaceTempView("ranked_updates"))
    
    microBatchDF._jdf.sparkSession().sql("""
        MERGE INTO users u
        USING ranked_updates r
        ON u.alt_id=r.alt_id
        WHEN MATCHED AND u.updated < r.updated
          THEN UPDATE SET *
        WHEN NOT MATCHED
          THEN INSERT *
    """)

    (microBatchDF
         .filter("update_type = 'delete'")
         .select(
            "alt_id", 
            F.col("updated").alias("requested"), 
            F.date_add("updated", 30).alias("deadline"), 
            F.lit("requested").alias("status"))
        .write
        .format("delta")
        .mode("append")
        .option("txnVersion", batchId)
        .option("txnAppId", appId)
        .option("path", Paths.deleteRequests)
        .saveAsTable("delete_requests"))

Now we can apply this function to our data. Here, we'll run a trigger once batch to process all records.

In [0]:
(unpackedDF.writeStream
    .foreachBatch(batch_rank_upsert)
    .outputMode("update")
    .option("checkpointLocation", Paths.usersCheckpointPath)
    .trigger(once=True)
    .start()
    .awaitTermination())

The `users` table should only have 1 record for each unique ID.

In [0]:
assert spark.table("users").count() == spark.table("users").select("alt_id").distinct().count()

Assuming some requests to be forgotten have been made, there should be records in the `delete_requests` table.

In [0]:
%sql
SELECT * FROM delete_requests

-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>