-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Type 2 Slowly Changing Data

In this notebook, we'll create a silver table that links workouts to heart rate through our resultant table.

We'll use a Type 2 table to record this data, which will provide us the ability to match our heart rate recordings to users through this information.

<img src="https://files.training.databricks.com/images/ade/ADE_arch_completed_workouts.png" width="60%" />

## Learning Objectives
By the end of this lesson, students will be able to:
- Describe how Slowly Changing Dimension tables can be implemented in the Lakehouse
- Use custom logic to implement a SCD Type 2 table with batch overwrite logic

Set up path and checkpoint variables (these will be used later).

In [0]:
%run ../Includes/workouts-setup

A helper function was defined to land and propagate a batch of data to the `workouts_silver` table.

In [0]:
process_silver_workouts()

Load and preview the `workouts_silver` data.

In [0]:
workoutDF = spark.table("workouts_silver")
display(workoutDF)

For this data, the `user_id` and `session_id` form a composite key. Each pair should eventually have 2 records present, marking the "start" and "stop" action for each workout.

In [0]:
display(workoutDF.groupby("user_id", "session_id").count())

Because we'll be triggering a shuffle in this notebook, we'll be explicit about how many partitions we want at the end of our shuffle.

In [0]:
sqlContext.setConf("spark.sql.shuffle.partitions", "4")

## Create Completed Workouts Table

The query below matches our start and stop actions, capturing the time for each action. The `in_progress` field indicates whether or not a given workout session is ongoing.

In [0]:
%sql

CREATE OR REPLACE TEMP VIEW TEMP_completed_workouts AS (
  SELECT a.user_id, a.workout_id, a.session_id, a.start_time start_time, b.end_time end_time, a.in_progress AND (b.in_progress IS NULL) in_progress
  FROM (
    SELECT user_id, workout_id, session_id, time start_time, null end_time, true in_progress
    FROM workouts_silver
    WHERE action = "start") a
  LEFT JOIN (
    SELECT user_id, workout_id, session_id, null start_time, time end_time, false in_progress
    FROM workouts_silver
    WHERE action = "stop") b
  ON a.user_id = b.user_id AND a.session_id = b.session_id
)

## Register Target Table

The following cell is provided to allow for easy re-setting of this demo. In production, you will _not_ want to drop your target table each time. As such, once you have this notebook working, you should comment out the following cell.

**HINT**: To comment out an entire block of code, select all text and then hit "**CMD** + **/**" (Mac)

In [0]:
spark.sql("DROP TABLE IF EXISTS completed_workouts")

dbutils.fs.rm(Paths.completedWorkouts, True)

spark.sql(f"""
  CREATE TABLE IF NOT EXISTS completed_workouts
  (user_id INT, workout_id INT, session_id INT, start_time TIMESTAMP, end_time TIMESTAMP, in_progress BOOLEAN)
  USING DELTA
  LOCATION '{Paths.completedWorkouts}'
""")

## Write Results as a Batch Overwrite

Our present implementation will replace the `charts_valid` table entirely with each triggered batch. What are some potential benefits and drawbacks of this approach?

In [0]:
def process_completed_workouts():
    (spark.table("TEMP_completed_workouts").write
        .mode("overwrite")
        .saveAsTable("completed_workouts"))
process_completed_workouts()

You can now perform a query directly on your `completed_workouts` table to check your results. Uncomment the `WHERE` clauses below to confirm various functionality of the logic above.

In [0]:
%sql

SELECT COUNT(*)
FROM completed_workouts
-- WHERE in_progress=true                        -- where record is still awaiting end time
-- WHERE end_time IS NOT NULL                    -- where end time has been recorded
-- WHERE start_time IS NULL                      -- where end time arrived before start time
-- WHERE in_progress=true AND end_time IS NULL   -- confirm that no entries are valid with end_time

Use the functions below to propagate another batch of records through the pipeline to this point.

In [0]:
process_silver_workouts()
process_completed_workouts()

In [0]:
%sql

SELECT COUNT(*)
FROM completed_workouts

-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>