## Demo 3: PySpark Medallion Architecture - Persisted using Parquet and Queried with Spark-SQL
### Overview
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The computation is executed on the same optimized Spark SQL engine. Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs. In short, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.

Internally, by default, Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees. However, since Spark 2.3, we have introduced a new low-latency processing mode called Continuous Processing, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees. Without changing the Dataset/DataFrame operations in your queries, you will be able to choose the mode based on your application requirements.

Having gained a better understanding of how to perform **incremental data processing** in a previous demonstration, where we combined Structured Streaming APIs and Spark SQL, we can now explore the tight integration between Structured Streaming and Delta Lake.

#### Objectives
By the end of this lesson, you should be able to:
* Describe Bronze, Silver, and Gold tables
* Create a Delta Lake multi-hop pipeline

#### Incremental Updates in the Lakehouse
Delta Lake allows users to easily combine streaming and batch workloads in a unified multi-stage pipeline, wherein each stage of the pipeline represents a state of our data valuable to driving core use cases within the business. With all data and metadata resident in object storage in the cloud, multiple users and applications can access data in near-real time, allowing analysts to access the freshest data as it's being processed.

![](https://files.training.databricks.com/images/sslh/multi-hop-simple.png)

- **Bronze** tables contain raw data ingested from various sources (JSON files, RDBMS data,  IoT data, to name a few examples).
- **Silver** tables provide a more refined view of our data. We can join fields from various bronze tables to enrich streaming records, or update account statuses based on recent activity.
- **Gold** tables provide business level aggregates often used for reporting and dashboarding. This would include aggregations such as daily active website users, weekly sales per store, or gross revenue per quarter by department. 

The end outputs are actionable insights, dashboards and reports of business metrics.  By considering our business logic at all steps of the ETL pipeline, we can ensure that storage and compute costs are optimized by reducing unnecessary duplication of data and limiting ad hoc querying against full historic data.  Each stage can be configured as a batch or streaming job, and ACID transactions ensure that we succeed or fail completely.

### Lab Details:
This lab will demonstrate ingesting artificially generated medical data, in JSON format, that simulates heart rate monitor signals captured from numerous devices; therefore, this data represents what would be expected from a Streaming data source.

#### Datasets Used:
The schema of our two datasets is represented below. Note that we will be manipulating these schema during various steps.

##### Recordings:
The main dataset uses heart rate recordings from medical devices delivered in the JSON format.

| Field | Type |
| --- | --- |
| device_id | int |
| mrn | long |
| time | double |
| heartrate | double |

##### Personally Identifiable Information (PII):
These data will later be joined with a static table of patient information stored in an external system to identify patients.

| Field | Type |
| --- | --- |
| mrn | long |
| name | string |

### 1.0. Prerequisites:

In [1]:
import findspark
findspark.init()
findspark.find()

'C:\\spark-3.5.4-bin-hadoop3'

#### 1.1. Import Required Libraries

In [2]:
import os
import sys
import json
import shutil

from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

#### 1.2. Instantiate Global Variables

In [3]:
# --------------------------------------------------------------------------------
# Specify Directory Structure for Source Data
# --------------------------------------------------------------------------------
base_dir = os.path.join(os.getcwd(), 'lab_data')
data_dir = os.path.join(base_dir, 'healthcare')
batch_dir = os.path.join(data_dir, 'batch')
stream_dir = os.path.join(data_dir, 'streaming')
tracker_stream_dir = os.path.join(stream_dir, 'tracker')

# --------------------------------------------------------------------------------
# Create Directory Structure for Data Lakehouse Files
# --------------------------------------------------------------------------------
dest_database = "healthcare_dlh"
sql_warehouse_dir = os.path.abspath('spark-warehouse')
database_dir = os.path.join(sql_warehouse_dir, dest_database)

output_bronze = os.path.join(database_dir, 'bronze')
output_silver = os.path.join(database_dir, 'silver')
output_gold = os.path.join(database_dir, 'gold')

patient_output_bronze = os.path.join(output_bronze, 'dim_patient')
heartbeat_output_bronze = os.path.join(output_bronze, 'fact_heartbeat')
heartbeat_output_silver = os.path.join(output_silver, 'fact_heartbeat')
heartbeat_output_gold = os.path.join(output_gold, 'fact_heartbeat')

#### 1.3. Define Global Functions

In [4]:
def remove_directory_tree(path: str):
    '''If it exists, remove the entire contents of a directory structure at a given 'path' parameter's location.'''
    try:
        if os.path.exists(path):
            shutil.rmtree(path)
            return f"Directory '{path}' has been removed successfully."
        else:
            return f"Directory '{path}' does not exist."
            
    except Exception as e:
        return f"An error occurred: {e}"

#### 1.4. Create a New Spark Session

In [5]:
worker_threads = f"local[{int(os.cpu_count()/2)}]"
shuffle_partitions = int(os.cpu_count())

sparkConf = SparkConf().setAppName('PySpark Parquet Data Lakehouse Heartrate Monitor')\
    .setMaster(worker_threads)\
    .set('spark.driver.memory', '4g') \
    .set('spark.executor.memory', '2g')\
    .set('spark.sql.adaptive.enabled', 'false') \
    .set('spark.sql.shuffle.partitions', shuffle_partitions) \
    .set('spark.sql.streaming.forceDeleteTempCheckpointLocation', 'true') \
    .set("spark.sql.streaming.schemaInference", "true") \
    .set('spark.sql.warehouse.dir', database_dir) \
    .set('spark.streaming.stopGracefullyOnShutdown', 'true')

spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
spark.sparkContext.setLogLevel("OFF")
spark

#### 1.5. Initialize Data Lakehouse Directory Structure
Remove the Data Lakehouse Database Directory Structure to Ensure Idempotency

In [6]:
remove_directory_tree(database_dir)

"Directory 'C:\\Users\\jtupi\\Documents\\UVA\\DS-2002-Teacher\\04-PySpark\\spark-warehouse\\healthcare_dlh' has been removed successfully."

### 2.0. Create Bronze Layer
#### 2.1. Read a Batch of Patient dimension data from a CSV file

In [7]:
patient_csv = os.path.join(batch_dir, 'patient_info.csv')
print(patient_csv)

C:\Users\jtupi\Documents\UVA\DS-2002-Teacher\04-PySpark\lab_data\healthcare\batch\patient_info.csv


In [8]:
df_patient = spark.read.format('csv').options(header='true', inferSchema='true').load(patient_csv)

In [9]:
# Unit Test ---------------------------------------------------------
df_patient.printSchema()
print(f"The 'df_patients' table contains {df_patient.count()} rows.")
df_patient.limit(5).show()

root
 |-- mrn: integer (nullable = true)
 |-- name: string (nullable = true)

The 'df_patients' table contains 34 rows.
+--------+---------------+
|     mrn|           name|
+--------+---------------+
|23940128| Caitlin Garcia|
|18064290|  Anthony Perez|
|95384990|     Tanya Diaz|
|53057176|Autumn Calderon|
|96005424|   Ronald Smith|
+--------+---------------+



#### 2.2. Persist the Patient dimension data to a Parquet file 

In [10]:
df_patient.write.mode("overwrite").parquet(patient_output_bronze)

#### 2.3. Read Bronze Patient data into a Temporary View

In [11]:
spark.read.parquet(patient_output_bronze).createOrReplaceTempView("dim_patients_tempvw")

In [12]:
# Unit Test -------------------------------------------
spark.sql("SELECT * FROM dim_patients_tempvw LIMIT 5").show()

+--------+---------------+
|     mrn|           name|
+--------+---------------+
|23940128| Caitlin Garcia|
|18064290|  Anthony Perez|
|95384990|     Tanya Diaz|
|53057176|Autumn Calderon|
|96005424|   Ronald Smith|
+--------+---------------+



#### 2.4. Use Structured Streaming to Read Heartrate Monitor data
##### 2.4.1. Read data from a series of JSON source files into an In-Memory Streaming DataFrame

In [13]:
df_tracker = (spark.readStream \
             .option("schemaLocation", heartbeat_output_bronze) \
             .option("maxFilesPerTrigger", 1) \
             .option("multiLine", "true") \
             .json(tracker_stream_dir)
            )

df_tracker.isStreaming

True

##### 2.4.2. Unit Test

In [14]:
df_tracker.printSchema()

root
 |-- device_id: string (nullable = true)
 |-- heartrate: string (nullable = true)
 |-- mrn: string (nullable = true)
 |-- time: string (nullable = true)



#### 2.5. Write Data from Streaming DataFrame into a Parquet output

In [15]:
tracker_checkpoint_bronze = os.path.join(heartbeat_output_bronze, '_checkpoint')

bronze_query = (df_tracker.writeStream \
                .format("parquet") \
                .outputMode("append") \
                .queryName("heartbeat_tracker")
                .trigger(availableNow = True) \
                .option("checkpointLocation", tracker_checkpoint_bronze) \
                .option("compression", "snappy") \
                .start(heartbeat_output_bronze)
               )

##### 2.5.1. Monitor Query Execution

In [16]:
print(f"Query ID: {bronze_query.id}")
print(f"Query Name: {bronze_query.name}")
print(f"Query Status: {bronze_query.status}")
print(f"Last Progress: {bronze_query.lastProgress}")

Query ID: fcc99b9d-0f85-466a-9caf-d036b5fbc588
Query Name: heartbeat_tracker
Query Status: {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}
Last Progress: {'id': 'fcc99b9d-0f85-466a-9caf-d036b5fbc588', 'runId': '4512b489-ba8f-4396-93ee-7d8e7133131a', 'name': 'heartbeat_tracker', 'timestamp': '2025-03-17T16:10:50.242Z', 'batchId': 1, 'numInputRows': 8294, 'inputRowsPerSecond': 5771.746694502435, 'processedRowsPerSecond': 8228.174603174602, 'durationMs': {'addBatch': 424, 'commitOffsets': 174, 'getBatch': 20, 'latestOffset': 185, 'queryPlanning': 12, 'triggerExecution': 1008, 'walCommit': 188}, 'stateOperators': [], 'sources': [{'description': 'FileStreamSource[file:/C:/Users/jtupi/Documents/UVA/DS-2002-Teacher/04-PySpark/lab_data/healthcare/streaming/tracker]', 'startOffset': {'logOffset': 0}, 'endOffset': {'logOffset': 1}, 'latestOffset': None, 'numInputRows': 8294, 'inputRowsPerSecond': 5771.746694502435, 'processedRowsPerSecond': 8228.174603174602}

In [17]:
bronze_query.awaitTermination()

### 3.0. Create Silver Layer
#### 3.1. Read Bronze Streaming Data into a Temporary View

In [18]:
spark.read.parquet(heartbeat_output_bronze).createOrReplaceTempView("fact_heartbeat_bronze_tempvw")

In [19]:
# Unit Test ---------------------------------------------------
spark.sql("SELECT * FROM fact_heartbeat_bronze_tempvw LIMIT 5").show()

+---------+-------------+--------+------------------+
|device_id|    heartrate|     mrn|              time|
+---------+-------------+--------+------------------+
|       24|44.4530588466|53057176|1601510459.7677453|
|       31|68.7408647233|70379340|1601510873.6363325|
|       13|61.5566117017|55527081|1601510999.0770276|
|        9| 47.628456003|15902097| 1601511002.711981|
|       15|53.2287944137|97376381|1601511277.3706536|
+---------+-------------+--------+------------------+



#### 3.2. Persist the Silver table to the Lakehouse
##### 3.2.1. Define Silver Query to Join Streaming with Batch Data

In [20]:
silver_query = """
    SELECT CAST(h.device_id AS int) AS device_id
        , CAST(h.heartrate AS double) AS heartrate
        , CAST(h.mrn AS long) AS mrn
        , CAST((h.time/1000000) AS timestamp) AS datetime
        , p.name
    FROM fact_heartbeat_bronze_tempvw AS h
    INNER JOIN dim_patients_tempvw AS p
    ON CAST(p.mrn AS int) = h.mrn
    """
spark.sql(silver_query).write.mode("overwrite").parquet(heartbeat_output_silver)

### 4.0. Create Gold Layer
#### 4.1. Read Silver Streaming Data into a Temporary View

In [21]:
spark.read.parquet(heartbeat_output_silver).createOrReplaceTempView("fact_heartbeat_silver_tempvw")

In [22]:
# Unit Test ----------------------------------------------------
spark.sql("SELECT * FROM fact_heartbeat_silver_tempvw").show(5)

+---------+-------------+--------+--------------------+--------------+
|device_id|    heartrate|     mrn|            datetime|          name|
+---------+-------------+--------+--------------------+--------------+
|       35|54.0083778414|27831169|1969-12-31 19:26:...|Ashley Schmidt|
|       14|59.7276487851|84682617|1969-12-31 19:26:...|     Kyle Cruz|
|        7|55.5989222294|41675882|1969-12-31 19:26:...|    Crystal Ho|
|        6|53.1187476559|88104185|1969-12-31 19:26:...| George Wagner|
|       10|49.0628332598|18064290|1969-12-31 19:26:...| Anthony Perez|
+---------+-------------+--------+--------------------+--------------+
only showing top 5 rows



#### 4.2. Persist the Gold table to the Lakehouse
##### 4.2.1. Define Gold Query to Perform an Aggregation

In [23]:
gold_query = """
    SELECT name AS patient_name
        , COUNT(device_id) AS total_recordings
        , CEILING(AVG(heartrate)) AS avg_heartrate
    FROM fact_heartbeat_silver_tempvw
    GROUP BY patient_name
    ORDER BY avg_heartrate DESC
    """
spark.sql(gold_query).write.mode("overwrite").parquet(heartbeat_output_gold)

##### 4.2.2. Read Gold Streaming data into a Temporary View of the Gold table

In [24]:
spark.read.parquet(heartbeat_output_gold).createOrReplaceTempView("fact_heartbeat_gold_tempvw")

#### 4.3. Display the Gold table

In [25]:
report_query = """
    SELECT patient_name
        , total_recordings
        , avg_heartrate
    FROM fact_heartbeat_gold_tempvw
    """
spark.sql(report_query).toPandas().head()

Unnamed: 0,patient_name,total_recordings,avg_heartrate
0,Anthony Perez,2850,75
1,Dr. Amanda Baxter,4640,75
2,Melissa Martinez,2487,75
3,Crystal Ho,2973,75
4,Autumn Calderon,1819,75


In [26]:
spark.stop()