## Demo 4: PySpark Medallion Architecture - Persisted in Delta Tables & Queried with the SQL API
### Overview
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The computation is executed on the same optimized Spark SQL engine. Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs. In short, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.

Internally, by default, Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees. However, since Spark 2.3, we have introduced a new low-latency processing mode called Continuous Processing, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees. Without changing the Dataset/DataFrame operations in your queries, you will be able to choose the mode based on your application requirements.

Having gained a better understanding of how to perform **incremental data processing** in a previous demonstration, where we combined Structured Streaming APIs and Spark SQL, we can now explore the tight integration between Structured Streaming and Delta Lake.

#### Objectives
By the end of this lesson, you should be able to:
* Describe Bronze, Silver, and Gold tables
* Create a Delta Lake multi-hop pipeline

#### Incremental Updates in the Lakehouse
Delta Lake allows users to easily combine streaming and batch workloads in a unified multi-stage pipeline, wherein each stage of the pipeline represents a state of our data valuable to driving core use cases within the business. With all data and metadata resident in object storage in the cloud, multiple users and applications can access data in near-real time, allowing analysts to access the freshest data as it's being processed.

![](https://files.training.databricks.com/images/sslh/multi-hop-simple.png)

- **Bronze** tables contain raw data ingested from various sources (JSON files, RDBMS data,  IoT data, to name a few examples).
- **Silver** tables provide a more refined view of our data. We can join fields from various bronze tables to enrich streaming records, or update account statuses based on recent activity.
- **Gold** tables provide business level aggregates often used for reporting and dashboarding. This would include aggregations such as daily active website users, weekly sales per store, or gross revenue per quarter by department. 

The end outputs are actionable insights, dashboards and reports of business metrics.  By considering our business logic at all steps of the ETL pipeline, we can ensure that storage and compute costs are optimized by reducing unnecessary duplication of data and limiting ad hoc querying against full historic data.  Each stage can be configured as a batch or streaming job, and ACID transactions ensure that we succeed or fail completely.

### Lab Details:
This lab will demonstrate ingesting artificially generated medical data, in JSON format, that simulates heart rate monitor signals captured from numerous devices; therefore, this data represents what would be expected from a Streaming data source.

#### Datasets Used:
The schema of our two datasets is represented below. Note that we will be manipulating these schema during various steps.

##### Recordings:
The main dataset uses heart rate recordings from medical devices delivered in the JSON format.

| Field | Type |
| --- | --- |
| device_id | int |
| mrn | long |
| time | double |
| heartrate | double |

##### Personally Identifiable Information (PII):
These data will later be joined with a static table of patient information stored in an external system to identify patients.

| Field | Type |
| --- | --- |
| mrn | long |
| name | string |

### 1.0. Prerequisites:

In [1]:
import findspark
findspark.init()
findspark.find()

'C:\\spark-3.5.4-bin-hadoop3'

#### 1.1. Import Required Libraries

In [2]:
import os
import sys
import json
import time
import shutil

import pyspark
from delta import *
from pyspark.sql.functions import *
from pyspark.sql.types import *

#### 1.2. Instantiate Global Variables

In [3]:
# --------------------------------------------------------------------------------
# Specify Directory Structure for Source Data
# --------------------------------------------------------------------------------
base_dir = os.path.join(os.getcwd(), 'lab_data')
data_dir = os.path.join(base_dir, 'healthcare')
batch_dir = os.path.join(data_dir, 'batch')
stream_dir = os.path.join(data_dir, 'streaming')
tracker_stream_dir = os.path.join(stream_dir, 'tracker')

# --------------------------------------------------------------------------------
# Create Directory Structure for Data Lakehouse Files
# --------------------------------------------------------------------------------
dest_database = "healthcare_dlh"
sql_warehouse_dir = os.path.abspath('spark-warehouse')
database_dir = os.path.join(sql_warehouse_dir, dest_database)

patient_output_bronze = os.path.join(database_dir, 'dim_patient')
heartbeat_output_bronze = os.path.join(database_dir, 'fact_heartbeat_bronze_checkpoint')
heartbeat_output_silver = os.path.join(database_dir, 'fact_heartbeat_silver_checkpoint')
heartbeat_output_gold = os.path.join(database_dir, 'fact_heartbeat_gold_checkpoint')

#### 1.3. Define Global Functions

In [4]:
def remove_directory_tree(path: str):
    '''If it exists, remove the entire contents of a directory structure at a given 'path' parameter's location.'''
    try:
        if os.path.exists(path):
            shutil.rmtree(path)
            return f"Directory '{path}' has been removed successfully."
        else:
            return f"Directory '{path}' does not exist."
            
    except Exception as e:
        return f"An error occurred: {e}"


def wait_until_stream_is_ready(query, min_batches=1):
    while len(query.recentProgress) < min_batches:
        time.sleep(5)
        
    print(f"The stream has processed {len(query.recentProgress)} batchs")

#### 1.4. Create a New Spark Session

In [5]:
worker_threads = f"local[{int(os.cpu_count()/2)}]"
shuffle_partitions = int(os.cpu_count())

builder = pyspark.sql.SparkSession.builder \
    .appName('PySpark Delta Data Lakehouse Heartrate Monitor')\
    .master(worker_threads)\
    .config('spark.driver.memory', '4g') \
    .config('spark.executor.memory', '2g')\
    .config('spark.sql.adaptive.enabled', 'false') \
    .config('spark.sql.debug.maxToStringFields', 50) \
    .config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog') \
    .config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension') \
    .config('spark.sql.shuffle.partitions', shuffle_partitions) \
    .config('spark.sql.streaming.forceDeleteTempCheckpointLocation', 'true') \
    .config('spark.sql.streaming.schemaInference', 'true') \
    .config('spark.sql.warehouse.dir', database_dir) \
    .config('spark.streaming.stopGracefullyOnShutdown', 'true')

spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sparkContext.setLogLevel("OFF")
spark

#### 1.5. Initialize Data Lakehouse Directory Structure
Remove the Data Lakehouse Database Directory Structure to Ensure Idempotency

In [6]:
remove_directory_tree(database_dir)

"Directory 'C:\\Users\\jtupi\\Documents\\UVA\\DS-2002-Teacher\\04-PySpark\\spark-warehouse\\healthcare_dlh' has been removed successfully."

#### 1.6. Create a New Metadata Database

In [7]:
spark.sql(f"DROP DATABASE IF EXISTS {dest_database} CASCADE;")

sql_create_db = f"""
    CREATE DATABASE IF NOT EXISTS {dest_database}
    COMMENT 'DS-2002 Demo 04 Database'
    WITH DBPROPERTIES (contains_pii = true, purpose = 'DS-2002 Demo 4');
"""
spark.sql(sql_create_db)

DataFrame[]

### 2.0. Create Bronze Layer
#### 2.1. Read a Batch of Patient dimension data from a CSV file

In [8]:
patient_csv = os.path.join(batch_dir, 'patient_info.csv')
print(patient_csv)

C:\Users\jtupi\Documents\UVA\DS-2002-Teacher\04-PySpark\lab_data\healthcare\batch\patient_info.csv


In [9]:
df_patient = spark.read.format('csv').options(header='true', inferSchema=True).load(patient_csv)

# Unit Test ---------------------------------------------------------
df_patient.printSchema()
print(f"The 'df_patients' table contains {df_patient.count()} rows.")
df_patient.toPandas().head(5)

root
 |-- mrn: integer (nullable = true)
 |-- name: string (nullable = true)

The 'df_patients' table contains 34 rows.


Unnamed: 0,mrn,name
0,23940128,Caitlin Garcia
1,18064290,Anthony Perez
2,95384990,Tanya Diaz
3,53057176,Autumn Calderon
4,96005424,Ronald Smith


#### 2.2. Persist the Patient dimension data to a Delta table

In [10]:
df_patient.write.mode("overwrite").format("delta").saveAsTable(f"{dest_database}.dim_patients")

#### 2.3. Read Patient Data from the Delta table

In [11]:
df_patient = spark.read.format("delta").table(f"{dest_database}.dim_patients")
df_patient.toPandas().head()

Unnamed: 0,mrn,name
0,23940128,Caitlin Garcia
1,18064290,Anthony Perez
2,95384990,Tanya Diaz
3,53057176,Autumn Calderon
4,96005424,Ronald Smith


#### 2.4. Use Structured Streaming to Read Heartrate Monitor data
##### Read data from a series of JSON source files into a streaming DataFrame

In [12]:
df_tracker = (spark.readStream \
              .option("schemaLocation", heartbeat_output_bronze) \
              .option("maxFilesPerTrigger", 1) \
              .option("multiLine", "true") \
              .json(tracker_stream_dir)
             )

df_tracker.isStreaming

True

In [13]:
# Unit Test -------------
print(type(df_tracker))
df_tracker.printSchema()

<class 'pyspark.sql.dataframe.DataFrame'>
root
 |-- device_id: string (nullable = true)
 |-- heartrate: string (nullable = true)
 |-- mrn: string (nullable = true)
 |-- time: string (nullable = true)



#### 2.5. Write Data from Streaming DataFrame into a 'Bronze' Delta table

In [14]:
bronze_query = (df_tracker.writeStream \
                .format("delta") \
                .outputMode("append") \
                .queryName("heartbeat_tracker_bronze")
                .trigger(availableNow = True) \
                .option("checkpointLocation", heartbeat_output_bronze) \
                .option("compression", "snappy") \
                .toTable(f"{dest_database}.fact_heartbeat_bronze")
                )

In [15]:
# Unit Test ----------------------------------
print(f"Query ID: {bronze_query.id}")
print(f"Query Name: {bronze_query.name}")
print(f"Query Status: {bronze_query.status}")
print(f"Last Progress: {bronze_query.lastProgress}")

Query ID: 98037071-110a-4395-8d07-1ec644d46b46
Query Name: heartbeat_tracker_bronze
Query Status: {'message': 'Initializing sources', 'isDataAvailable': False, 'isTriggerActive': False}
Last Progress: None


#### 2.6. Verify Bronze Tables

In [16]:
spark.sql(f"USE {dest_database};")
spark.sql("SHOW TABLES").toPandas()

Unnamed: 0,namespace,tableName,isTemporary
0,healthcare_dlh,dim_patients,False
1,healthcare_dlh,fact_heartbeat_bronze,False


In [17]:
bronze_query.awaitTermination()

### 3.0. Create Silver Layer
#### 3.1. Define Silver Query to Join Streaming with Batch Data

In [18]:
df_silver = spark.readStream \
    .format("delta") \
    .table(f"{dest_database}.fact_heartbeat_bronze") \
    .join(df_patient, "mrn") \
    .select(col("device_id").cast(IntegerType()), \
            col("mrn").cast(LongType()), \
            (col("time")/1e6).cast(TimestampType()).alias("datetime"), \
            col("heartrate").cast(DoubleType()), \
            col("name").alias("patient_name")
           )

In [19]:
# Unit Test -----------
df_silver.printSchema()

root
 |-- device_id: integer (nullable = true)
 |-- mrn: long (nullable = true)
 |-- datetime: timestamp (nullable = true)
 |-- heartrate: double (nullable = true)
 |-- patient_name: string (nullable = true)



#### 3.2. Persist Data to a 'Silver' Delta table

In [20]:
silver_query = (df_silver.writeStream \
                .format("delta") \
                .outputMode("append") \
                .queryName("heartbeat_tracker_silver")
                .trigger(availableNow = True) \
                .option("checkpointLocation", heartbeat_output_silver) \
                .option("compression", "snappy") \
                .toTable(f"{dest_database}.fact_heartbeat_silver")
                )

#### 3.3. Verify Silver Tables

In [21]:
spark.sql(f"USE {dest_database};")
spark.sql("SHOW TABLES").toPandas()

Unnamed: 0,namespace,tableName,isTemporary
0,healthcare_dlh,dim_patients,False
1,healthcare_dlh,fact_heartbeat_bronze,False
2,healthcare_dlh,fact_heartbeat_silver,False


In [22]:
silver_query.awaitTermination()

### 4.0. Create Gold Layer
#### 4.1. Define Gold Query to Perform an Aggregation

In [23]:
df_heartrate_by_patient_gold = spark.readStream \
    .format("delta") \
    .table(f"{dest_database}.fact_heartbeat_silver") \
    .groupBy('patient_name') \
    .agg((ceiling(avg("heartrate")).alias("avg_heartrate")), \
        (count("device_id").alias("count"))) \
    .orderBy(desc("avg_heartrate"))

In [24]:
# Unit Test -----
df_heartrate_by_patient_gold.printSchema()

root
 |-- patient_name: string (nullable = true)
 |-- avg_heartrate: long (nullable = true)
 |-- count: long (nullable = false)



#### 4.2. Persist Gold Data to a 'Gold' Delta table

In [25]:
gold_query = (df_heartrate_by_patient_gold.writeStream \
              .outputMode("complete") \
              .queryName("fact_heartbeat_by_patient_gold") \
              .format("delta") \
              .trigger(availableNow = True) \
              .option("checkpointLocation", heartbeat_output_gold) \
              .option("compression", "snappy") \
              .toTable(f"{dest_database}.fact_heartbeat_by_patient_gold")
             )

#### 4.3. Read Gold Streaming Data

In [26]:
df_fact_heartbeat_by_patient_gold = spark.sql(f"SELECT * FROM {dest_database}.fact_heartbeat_by_patient_gold")
df_fact_heartbeat_by_patient_gold.printSchema()

root
 |-- patient_name: string (nullable = true)
 |-- avg_heartrate: long (nullable = true)
 |-- count: long (nullable = true)



#### 4.4. Display the Gold table

In [27]:
wait_until_stream_is_ready(gold_query, 1)

df_fact_heartbeat_by_patient_gold.select("patient_name", "avg_heartrate", "count").toPandas()

The stream has processed 1 batchs


Unnamed: 0,patient_name,avg_heartrate,count
0,Anthony Perez,75,2850
1,Dr. Amanda Baxter,75,4640
2,Melissa Martinez,75,2487
3,Crystal Ho,75,2973
4,Autumn Calderon,75,1819
5,Sharon Brewer,86,1819
6,Rachel Contreras,86,4423
7,Valerie Reese,86,4326
8,Joshua Perkins,83,3484
9,Valerie Garcia,83,2726


In [28]:
spark.stop()