## Demo: Medallion Architecture Using Tables and Queried with the PySpark SQL API
### Overview
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The computation is executed on the same optimized Spark SQL engine. Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs. In short, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.

Internally, by default, Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees. However, since Spark 2.3, we have introduced a new low-latency processing mode called Continuous Processing, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees. Without changing the Dataset/DataFrame operations in your queries, you will be able to choose the mode based on your application requirements.

### Lab Details:

This lab will demonstrate ingesting artificially generated medical data, in JSON format, that simulates heart rate monitor signals captured from numerous devices; therefore, this data represents what would be expected from a Streaming data source.

#### Datasets Used:
The schema of our two datasets is represented below. Note that we will be manipulating these schema during various steps.

##### Recordings:
The main dataset uses heart rate recordings from medical devices delivered in the JSON format.

| Field | Type |
| --- | --- |
| device_id | int |
| mrn | long |
| time | double |
| heartrate | double |

##### Personally Identifiable Information (PII):
These data will later be joined with a static table of patient information stored in an external system to identify patients.

| Field | Type |
| --- | --- |
| mrn | long |
| name | string |

### Prerequisites:

In [1]:
import findspark
findspark.init()
findspark.find()

'C:\\spark-3.5.4-bin-hadoop3'

#### Import Required Libraries

In [2]:
import os
import sys
import json
import shutil

from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

#### Instantiate Global Variables

In [3]:
# --------------------------------------------------------------------------------
# Specify Directory Structure for Source Data
# --------------------------------------------------------------------------------
base_dir = os.path.join(os.getcwd(), 'lab_data')
data_dir = os.path.join(base_dir, 'healthcare')
batch_dir = os.path.join(data_dir, 'batch')
stream_dir = os.path.join(data_dir, 'streaming')
tracker_stream_dir = os.path.join(stream_dir, 'tracker')

# --------------------------------------------------------------------------------
# Create Directory Structure for Data Lakehouse Files
# --------------------------------------------------------------------------------
dest_database = "healthcare_dlh"
sql_warehouse_dir = os.path.abspath('spark-warehouse')
database_dir = os.path.join(sql_warehouse_dir, dest_database)

patient_output_bronze = os.path.join(database_dir, 'dim_patient')
heartbeat_output_bronze = os.path.join(database_dir, 'fact_heartbeat_bronze')
heartbeat_output_silver = os.path.join(database_dir, 'fact_heartbeat_silver')
heartbeat_output_gold = os.path.join(database_dir, 'fact_heartbeat_gold')

#### Define Global Functions

In [4]:
def remove_directory_tree(path: str):
    '''If it exists, remove the entire contents of a directory structure at a given 'path' parameter's location.'''
    try:
        if os.path.exists(path):
            shutil.rmtree(path)
            return f"Directory '{path}' has been removed successfully."
        else:
            return f"Directory '{path}' does not exist."
            
    except Exception as e:
        return f"An error occurred: {e}"

In [5]:
remove_directory_tree(f"{database_dir}.db")

"Directory 'C:\\Users\\jtupi\\Documents\\UVA\\DS-2002-Teacher\\04-PySpark\\spark-warehouse\\healthcare_dlh.db' does not exist."

#### Create a New Spark Session

In [6]:
worker_threads = f"local[{int(os.cpu_count()/2)}]"
shuffle_partitions = int(os.cpu_count())

sparkConf = SparkConf().setAppName('PySpark Heartrate Monitor in Juptyer')\
    .setMaster(worker_threads)\
    .set('spark.driver.memory', '2g') \
    .set('spark.executor.memory', '3g')\
    .set('spark.sql.adaptive.enabled', 'false') \
    .set('spark.sql.shuffle.partitions', shuffle_partitions) \
    .set('spark.sql.streaming.forceDeleteTempCheckpointLocation', 'true') \
    .set('spark.sql.streaming.schemaInference', 'true') \
    .set('spark.sql.warehouse.dir', sql_warehouse_dir) \
    .set('spark.streaming.stopGracefullyOnShutdown', 'true')

spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
spark

### Create Bronze Layer
#### Read a Batch of Patient dimension data from a CSV file

In [7]:
patient_csv = os.path.join(batch_dir, 'patient_info.csv')

# Unit Test -------
print(patient_csv)

C:\Users\jtupi\Documents\UVA\DS-2002-Teacher\04-PySpark\lab_data\healthcare\batch\patient_info.csv


In [8]:
df_patient = spark.read.format('csv').options(header='true', inferSchema='true').load(patient_csv)

# Unit Test ----------------------------------------------------------
df_patient.printSchema()
print(f"The 'df_patients' table contains {df_patient.count()} rows.")
df_patient.show(5)

root
 |-- mrn: integer (nullable = true)
 |-- name: string (nullable = true)

The 'df_patients' table contains 34 rows.
+--------+---------------+
|     mrn|           name|
+--------+---------------+
|23940128| Caitlin Garcia|
|18064290|  Anthony Perez|
|95384990|     Tanya Diaz|
|53057176|Autumn Calderon|
|96005424|   Ronald Smith|
+--------+---------------+
only showing top 5 rows



#### Persist the Patient dimension data into a Lakehouse table 
##### Create a New Data Lakehouse Database

In [9]:
spark.sql(f"DROP DATABASE IF EXISTS {dest_database} CASCADE;")
spark.sql(f"CREATE DATABASE {dest_database};")

DataFrame[]

##### Create the 'dim_patients' Dimension table

In [10]:
df_patient.write.saveAsTable(f"{dest_database}.dim_patients", mode="overwrite")

In [11]:
# Unit Test ------------------------------------------------------------
spark.sql(f"DESCRIBE EXTENDED {dest_database}.dim_patients;").show()
spark.sql(f"SELECT * FROM {dest_database}.dim_patients LIMIT 5").show()

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|                 mrn|                 int|   NULL|
|                name|              string|   NULL|
|                    |                    |       |
|# Detailed Table ...|                    |       |
|             Catalog|       spark_catalog|       |
|            Database|      healthcare_dlh|       |
|               Table|        dim_patients|       |
|        Created Time|Mon Mar 17 13:30:...|       |
|         Last Access|             UNKNOWN|       |
|          Created By|         Spark 3.5.4|       |
|                Type|             MANAGED|       |
|            Provider|             parquet|       |
|            Location|file:/C:/Users/jt...|       |
+--------------------+--------------------+-------+

+--------+---------------+
|     mrn|           name|
+--------+---------------+
|23940128| Caitlin Garcia|
|180642

#### Use Structured Streaming to Read Heartrate Monitor data
##### Read data from a series of JSON source files into a streaming DataFrame

In [12]:
df_tracker = (spark.readStream \
             .option("schemaLocation", heartbeat_output_bronze) \
             .option("maxFilesPerTrigger", 1) \
             .option("multiLine", "true") \
             .json(tracker_stream_dir)
            )

df_tracker.isStreaming

True

In [13]:
# Unit Test -------------
print(type(df_tracker))
df_tracker.printSchema()

<class 'pyspark.sql.dataframe.DataFrame'>
root
 |-- device_id: string (nullable = true)
 |-- heartrate: string (nullable = true)
 |-- mrn: string (nullable = true)
 |-- time: string (nullable = true)



#### Write data from streaming DataFrame into a Streaming Table

In [14]:
tracker_checkpoint_bronze = os.path.join(heartbeat_output_bronze, '_checkpoint')

bronze_query = (df_tracker.writeStream \
                .outputMode("append") \
                .queryName("heartbeat_tracker_bronze")
                .trigger(availableNow = True) \
                .option("checkpointLocation", tracker_checkpoint_bronze) \
                .option("compression", "snappy") \
                .toTable(f"{dest_database}.fact_heartrate_bronze")
               )

In [15]:
# Unit Test ----------------------------------
print(f"Query ID: {bronze_query.id}")
print(f"Query Name: {bronze_query.name}")
print(f"Query Status: {bronze_query.status}")
print(f"Last Progress: {bronze_query.lastProgress}")

Query ID: 3cbe32b3-04c0-45d8-be8b-e2ba2633556b
Query Name: heartbeat_tracker_bronze
Query Status: {'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}
Last Progress: {'id': '3cbe32b3-04c0-45d8-be8b-e2ba2633556b', 'runId': '50b44a22-380a-472c-ac60-f8db1fb378d9', 'name': 'heartbeat_tracker_bronze', 'timestamp': '2025-03-17T17:31:49.055Z', 'batchId': 11, 'numInputRows': 5725, 'inputRowsPerSecond': 6483.578708946772, 'processedRowsPerSecond': 6243.184296619411, 'durationMs': {'addBatch': 346, 'commitOffsets': 199, 'getBatch': 11, 'latestOffset': 175, 'queryPlanning': 6, 'triggerExecution': 917, 'walCommit': 176}, 'stateOperators': [], 'sources': [{'description': 'FileStreamSource[file:/C:/Users/jtupi/Documents/UVA/DS-2002-Teacher/04-PySpark/lab_data/healthcare/streaming/tracker]', 'startOffset': {'logOffset': 10}, 'endOffset': {'logOffset': 11}, 'latestOffset': None, 'numInputRows': 5725, 'inputRowsPerSecond': 6483.578708946772, 'processedRowsPerSecond': 6243.18429661

In [16]:
bronze_query.awaitTermination()

In [17]:
# Unit Test ------------------------------------------------------------------ 
spark.sql(f"DESCRIBE EXTENDED {dest_database}.fact_heartrate_bronze;").show()
spark.table(f"{dest_database}.fact_heartrate_bronze").show(5)

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|           device_id|              string|   NULL|
|           heartrate|              string|   NULL|
|                 mrn|              string|   NULL|
|                time|              string|   NULL|
|                    |                    |       |
|# Detailed Table ...|                    |       |
|             Catalog|       spark_catalog|       |
|            Database|      healthcare_dlh|       |
|               Table|fact_heartrate_br...|       |
|        Created Time|Mon Mar 17 13:31:...|       |
|         Last Access|             UNKNOWN|       |
|          Created By|         Spark 3.5.4|       |
|                Type|             MANAGED|       |
|            Provider|             parquet|       |
|            Location|file:/C:/Users/jt...|       |
+--------------------+--------------------+-------+

+---------+

### Create Silver Layer
#### Define Silver Query to Join Streaming with Batch Data

In [18]:
df_silver = spark.table(f"{dest_database}.fact_heartrate_bronze") \
    .join(df_patient, "mrn") \
    .select(col("device_id").cast(IntegerType()), \
            col("mrn").cast(LongType()), \
            (col("time")/1e6).cast(TimestampType()).alias("datetime"), \
            from_unixtime("time", "MM/dd/yyyy").alias("date"), \
            from_unixtime("time", "hh:mm:ss a z").alias("time"), \
            col("heartrate").cast(DoubleType()), \
            col("name").alias("patient_name")
           )

In [19]:
# Unit Test ------------
df_silver.printSchema()
df_silver.show(5)

root
 |-- device_id: integer (nullable = true)
 |-- mrn: long (nullable = true)
 |-- datetime: timestamp (nullable = true)
 |-- date: string (nullable = true)
 |-- time: string (nullable = true)
 |-- heartrate: double (nullable = true)
 |-- patient_name: string (nullable = true)

+---------+--------+--------------------+----------+---------------+-------------+---------------+
|device_id|     mrn|            datetime|      date|           time|    heartrate|   patient_name|
+---------+--------+--------------------+----------+---------------+-------------+---------------+
|       24|53057176|1969-12-31 19:26:...|09/30/2020|08:00:59 PM EDT|44.4530588466|Autumn Calderon|
|       31|70379340|1969-12-31 19:26:...|09/30/2020|08:07:53 PM EDT|68.7408647233| Robert Vincent|
|       13|55527081|1969-12-31 19:26:...|09/30/2020|08:09:59 PM EDT|61.5566117017|    George King|
|        9|15902097|1969-12-31 19:26:...|09/30/2020|08:10:02 PM EDT| 47.628456003|     John Smith|
|       15|97376381|1969-1

#### Persist Silver Data to a Table in the Lakehouse

In [20]:
silver_table = f"{dest_database}.fact_heartrate_silver"
df_silver.write.saveAsTable(silver_table, mode="overwrite")

In [21]:
# Unit Test -------------------------------------------
spark.sql(f"DESCRIBE EXTENDED {silver_table};").show()
spark.table(silver_table).show(5)

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|           device_id|                 int|   NULL|
|                 mrn|              bigint|   NULL|
|            datetime|           timestamp|   NULL|
|                date|              string|   NULL|
|                time|              string|   NULL|
|           heartrate|              double|   NULL|
|        patient_name|              string|   NULL|
|                    |                    |       |
|# Detailed Table ...|                    |       |
|             Catalog|       spark_catalog|       |
|            Database|      healthcare_dlh|       |
|               Table|fact_heartrate_si...|       |
|        Created Time|Mon Mar 17 13:32:...|       |
|         Last Access|             UNKNOWN|       |
|          Created By|         Spark 3.5.4|       |
|                Type|             MANAGED|       |
|           

### Create Gold Layer
#### Define Gold Query to Perform an Aggregation

In [22]:
df_gold = df_silver.groupBy("patient_name") \
    .agg((ceiling(avg("heartrate")).alias("avg_heartrate")), \
        (count("device_id").alias("count"))) \
    .orderBy(desc("avg_heartrate"))

In [23]:
# Unit Test -----
df_gold.toPandas().head()

Unnamed: 0,patient_name,avg_heartrate,count
0,William Gomez Jr.,104,1069
1,Robert Vincent,100,3171
2,Sean Brown,93,3354
3,Mary Adams,93,3801
4,Tanya Diaz,92,2153


#### Persist Gold Data to a Table in the Lakehouse

In [24]:
gold_table = f"{dest_database}.fact_heartrate_gold"
df_gold.write.saveAsTable(gold_table, mode="overwrite")

In [25]:
# Unit Test ------------------------------------------------------------
spark.table(gold_table).toPandas().head()

Unnamed: 0,patient_name,avg_heartrate,count
0,Anthony Perez,75,2850
1,Dr. Amanda Baxter,75,4640
2,Melissa Martinez,75,2487
3,Crystal Ho,75,2973
4,Autumn Calderon,75,1819


#### Display the Gold table

In [26]:
spark.table(gold_table) \
    .select("patient_name", "avg_heartrate", "count") \
    .orderBy(asc("avg_heartrate")).toPandas()

Unnamed: 0,patient_name,avg_heartrate,count
0,John Smith,69,2962
1,Troy Davis,70,3463
2,Anthony Perez,75,2850
3,Dr. Amanda Baxter,75,4640
4,Melissa Martinez,75,2487
5,Crystal Ho,75,2973
6,Autumn Calderon,75,1819
7,Cynthia Figueroa,76,1409
8,Joshua Harris,78,3381
9,Ashley Schmidt,79,3234


In [27]:
spark.stop()