## Demo: Medallion Architecture Using Parquet Files and Queried with the PySpark SQL API
### Overview
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The computation is executed on the same optimized Spark SQL engine. Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs. In short, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.

Internally, by default, Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees. However, since Spark 2.3, we have introduced a new low-latency processing mode called Continuous Processing, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees. Without changing the Dataset/DataFrame operations in your queries, you will be able to choose the mode based on your application requirements.

### Lab Details:

This lab will demonstrate ingesting artificially generated medical data, in JSON format, that simulates heart rate monitor signals captured from numerous devices; therefore, this data represents what would be expected from a Streaming data source.

#### Datasets Used:
The schema of our two datasets is represented below. Note that we will be manipulating these schema during various steps.

##### Recordings:
The main dataset uses heart rate recordings from medical devices delivered in the JSON format.

| Field | Type |
| --- | --- |
| device_id | int |
| mrn | long |
| time | double |
| heartrate | double |

##### Personally Identifiable Information (PII):
These data will later be joined with a static table of patient information stored in an external system to identify patients.

| Field | Type |
| --- | --- |
| mrn | long |
| name | string |

### Prerequisites:

In [1]:
import findspark
findspark.init()
findspark.find()

'C:\\spark-3.5.4-bin-hadoop3'

#### Import Required Libraries

In [2]:
import os
import sys
import json
import time

from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

#### Instantiate Global Variables

In [3]:
# --------------------------------------------------------------------------------
# Specify Directory Structure for Source Data
# --------------------------------------------------------------------------------
base_dir = os.path.join(os.getcwd(), 'lab_data')
data_dir = os.path.join(base_dir, 'healthcare')
batch_dir = os.path.join(data_dir, 'batch')
stream_dir = os.path.join(data_dir, 'streaming')
tracker_stream_dir = os.path.join(stream_dir, 'tracker')

# --------------------------------------------------------------------------------
# Create Directory Structure for Data Lakehouse Files
# --------------------------------------------------------------------------------
dest_database = "healthcare_dlh"
sql_warehouse_dir = os.path.abspath('spark-warehouse')
database_dir = os.path.join(sql_warehouse_dir, dest_database)

patient_output_bronze = os.path.join(database_dir, 'dim_patient')
heartbeat_output_bronze = os.path.join(database_dir, 'fact_heartbeat_bronze')
heartbeat_output_silver = os.path.join(database_dir, 'fact_heartbeat_silver')
heartbeat_output_gold = os.path.join(database_dir, 'fact_heartbeat_gold')

#### Create a New Spark Session

In [4]:
worker_threads = f"local[{int(os.cpu_count()/2)}]"
shuffle_partitions = int(os.cpu_count())

sparkConf = SparkConf().setAppName('PySpark Heartrate Monitor in Juptyer')\
    .setMaster(worker_threads)\
    .set('spark.driver.memory', '2g') \
    .set('spark.executor.memory', '3g')\
    .set("spark.sql.adaptive.enabled", 'false') \
    .set('spark.sql.shuffle.partitions', shuffle_partitions) \
    .set('spark.sql.streaming.forceDeleteTempCheckpointLocation', 'true') \
    .set('spark.sql.streaming.schemaInference', 'true') \
    .set('spark.sql.warehouse.dir', database_dir) \
    .set('spark.streaming.stopGracefullyOnShutdown', 'true')

spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
spark

### Create Bronze Layer
#### Read a Batch of Patient dimension data from a CSV file

In [5]:
patient_csv = os.path.join(batch_dir, 'patient_info.csv')

# Unit Test
print(patient_csv)

C:\Users\jtupi\Documents\UVA\DS-2002-Teacher\04-PySpark\lab_data\healthcare\batch\patient_info.csv


In [6]:
df_patient = spark.read.format('csv').options(header='true', inferSchema=True).load(patient_csv)

# Unit Test -------------
df_patient.printSchema()
print(f"The 'df_patients' table contains {df_patient.count()} rows.")
df_patient.show(5)

root
 |-- mrn: integer (nullable = true)
 |-- name: string (nullable = true)

The 'df_patients' table contains 34 rows.
+--------+---------------+
|     mrn|           name|
+--------+---------------+
|23940128| Caitlin Garcia|
|18064290|  Anthony Perez|
|95384990|     Tanya Diaz|
|53057176|Autumn Calderon|
|96005424|   Ronald Smith|
+--------+---------------+
only showing top 5 rows



#### Persist the Patient dimension data to a Parquet file

In [7]:
df_patient.write.mode("overwrite").parquet(patient_output_bronze)

In [8]:
# Unit Test -------------------------------------------
df_patient = spark.read.parquet(patient_output_bronze)
df_patient.show(5)

+--------+---------------+
|     mrn|           name|
+--------+---------------+
|23940128| Caitlin Garcia|
|18064290|  Anthony Perez|
|95384990|     Tanya Diaz|
|53057176|Autumn Calderon|
|96005424|   Ronald Smith|
+--------+---------------+
only showing top 5 rows



#### Use Structured Streaming to Read Heartrate Monitor data
##### Read data from a series of JSON source files into a streaming DataFrame

In [9]:
df_tracker = (spark.readStream \
              .option("schemaLocation", heartbeat_output_bronze) \
              .option("maxFilesPerTrigger", 1) \
              .option("multiLine", "true") \
              .json(tracker_stream_dir)
             )

df_tracker.isStreaming

True

In [10]:
# Unit Test -------------
print(type(df_tracker))
df_tracker.printSchema()

<class 'pyspark.sql.dataframe.DataFrame'>
root
 |-- device_id: string (nullable = true)
 |-- heartrate: string (nullable = true)
 |-- mrn: string (nullable = true)
 |-- time: string (nullable = true)



#### Write data from streaming DataFrame into a Parquet output

In [11]:
tracker_checkpoint_bronze = os.path.join(heartbeat_output_bronze, '_checkpoint')

bronze_query = (df_tracker.writeStream \
                .format("parquet") \
                .outputMode("append") \
                .queryName("heartbeat_tracker_bronze")
                .trigger(availableNow = True) \
                .option("checkpointLocation", tracker_checkpoint_bronze) \
                .option("compression", "snappy") \
                .start(heartbeat_output_bronze)
                )

In [12]:
# Unit Test ----------------------------------
print(f"Query ID: {bronze_query.id}")
print(f"Query Name: {bronze_query.name}")
print(f"Query Status: {bronze_query.status}")
print(f"Last Progress: {bronze_query.lastProgress}")

Query ID: e47a47f1-6d08-445c-af50-84e0aa0b772d
Query Name: heartbeat_tracker_bronze
Query Status: {'message': 'Initializing sources', 'isDataAvailable': False, 'isTriggerActive': False}
Last Progress: None


In [13]:
bronze_query.awaitTermination()

### Create Silver Layer

#### Define Silver Query to Join Streaming with Batch Data

In [14]:
df_silver = spark.readStream.format("parquet").load(heartbeat_output_bronze) \
    .join(df_patient, "mrn") \
    .select(col("device_id").cast(IntegerType()), \
            col("mrn").cast(LongType()), \
            (col("time")/1e6).cast(TimestampType()).alias("datetime"), \
            col("heartrate").cast(DoubleType()), \
            col("name").alias("patient_name")
           )

In [15]:
# Unit Test -----------
df_silver.printSchema()

root
 |-- device_id: integer (nullable = true)
 |-- mrn: long (nullable = true)
 |-- datetime: timestamp (nullable = true)
 |-- heartrate: double (nullable = true)
 |-- patient_name: string (nullable = true)



#### Persist Data to a Parquet file

In [16]:
tracker_checkpoint_silver = os.path.join(heartbeat_output_silver, '_checkpoint')

silver_query = (df_silver.writeStream \
                .format("parquet") \
                .outputMode("append") \
                .queryName("heartbeat_tracker_silver")
                .trigger(availableNow = True) \
                .option("checkpointLocation", tracker_checkpoint_silver) \
                .option("compression", "snappy") \
                .start(heartbeat_output_silver)
                )

In [17]:
# Unit Test -------------------------------------------
silver_query.awaitTermination()

### Create Gold Layer
##### Define Gold Query to Perform an Aggregation

In [18]:
df_heartrate_by_patient_gold = spark.readStream.format("parquet").load(heartbeat_output_silver) \
    .groupBy('patient_name') \
    .agg((ceiling(avg("heartrate")).alias("avg_heartrate")), \
        (count("device_id").alias("count"))) \
    .orderBy(desc("avg_heartrate"))

In [19]:
# Unit Test -----
df_heartrate_by_patient_gold.printSchema()

root
 |-- patient_name: string (nullable = true)
 |-- avg_heartrate: long (nullable = true)
 |-- count: long (nullable = false)



#### Persist Gold Data to a Parquet file

In [20]:
gold_query = (df_heartrate_by_patient_gold.writeStream \
              .outputMode("complete") \
              .queryName("fact_heartbeat_by_patient_gold") \
              .format("memory")
              .start()
             )

##### Read Gold Streaming Data

In [21]:
df_fact_heartbeat_by_patient_gold = spark.sql("SELECT * FROM fact_heartbeat_by_patient_gold")

In [22]:
time.sleep(60)

##### Display the Gold table

In [23]:
df_fact_heartbeat_by_patient_gold \
    .select("patient_name", "avg_heartrate", "count").toPandas()

Unnamed: 0,patient_name,avg_heartrate,count


In [25]:
#spark.stop()