## Demo: Medallion Architecture Using Tables and Queried with the PySpark SQL API
### Overview
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The computation is executed on the same optimized Spark SQL engine. Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs. In short, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.

Internally, by default, Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees. However, since Spark 2.3, we have introduced a new low-latency processing mode called Continuous Processing, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees. Without changing the Dataset/DataFrame operations in your queries, you will be able to choose the mode based on your application requirements.

### Lab Details:

This lab will demonstrate ingesting artificially generated medical data, in JSON format, that simulates heart rate monitor signals captured from numerous devices; therefore, this data represents what would be expected from a Streaming data source.

#### Datasets Used:
The schema of our two datasets is represented below. Note that we will be manipulating these schema during various steps.

##### Recordings:
The main dataset uses heart rate recordings from medical devices delivered in the JSON format.

| Field | Type |
| --- | --- |
| device_id | int |
| mrn | long |
| time | double |
| heartrate | double |

##### Personally Identifiable Information (PII):
These data will later be joined with a static table of patient information stored in an external system to identify patients.

| Field | Type |
| --- | --- |
| mrn | long |
| name | string |

### Prerequisites:

In [None]:
import findspark
findspark.init()
findspark.find()

#### Import Required Libraries

In [None]:
import os
import sys
import json

from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

#### Instantiate Global Variables

In [None]:
# --------------------------------------------------------------------------------
# Specify Directory Structure for Source Data
# --------------------------------------------------------------------------------
base_dir = os.path.join(os.getcwd(), 'lab_data')
data_dir = os.path.join(base_dir, 'healthcare')
batch_dir = os.path.join(data_dir, 'batch')
stream_dir = os.path.join(data_dir, 'streaming')
tracker_stream_dir = os.path.join(stream_dir, 'tracker')

# --------------------------------------------------------------------------------
# Create Directory Structure for Data Lakehouse Files
# --------------------------------------------------------------------------------
dest_database = "healthcare_dlh"
sql_warehouse_dir = os.path.abspath('spark-warehouse')
database_dir = os.path.join(sql_warehouse_dir, dest_database)

patient_output_bronze = os.path.join(database_dir, 'dim_patient')
heartbeat_output_bronze = os.path.join(database_dir, 'fact_heartbeat_bronze')
heartbeat_output_silver = os.path.join(database_dir, 'fact_heartbeat_silver')
heartbeat_output_gold = os.path.join(database_dir, 'fact_heartbeat_gold')

#### Create a New Spark Session

In [None]:
worker_threads = f"local[{int(os.cpu_count()/2)}]"
shuffle_partitions = int(os.cpu_count())

sparkConf = SparkConf().setAppName('PySpark Heartrate Monitor in Juptyer')\
    .setMaster(worker_threads)\
    .set('spark.driver.memory', '2g') \
    .set('spark.executor.memory', '3g')\
    .set('spark.sql.adaptive.enabled', 'false') \
    .set('spark.sql.shuffle.partitions', shuffle_partitions) \
    .set('spark.sql.streaming.forceDeleteTempCheckpointLocation', 'true') \
    .set('spark.sql.streaming.schemaInference', 'true') \
    .set('spark.sql.warehouse.dir', sql_warehouse_dir) \
    .set('spark.streaming.stopGracefullyOnShutdown', 'true')

spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
spark

### Create Bronze Layer
#### Read a Batch of Patient dimension data from a CSV file

In [None]:
patient_csv = os.path.join(batch_dir, 'patient_info.csv')

# Unit Test -------
print(patient_csv)

In [None]:
df_patient = spark.read.format('csv').options(header='true', inferSchema='true').load(patient_csv)

# Unit Test ----------------------------------------------------------
df_patient.printSchema()
print(f"The 'df_patients' table contains {df_patient.count()} rows.")
df_patient.show(5)

#### Persist the Patient dimension data into a Lakehouse table 
##### Create a New Data Lakehouse Database

In [None]:
spark.sql(f"CREATE DATABASE {dest_database};")

##### Create the 'dim_patients' Dimension table

In [None]:
df_patient.write.saveAsTable(f"{dest_database}.dim_patients", mode="overwrite")

In [None]:
# Unit Test ------------------------------------------------------------
spark.sql(f"DESCRIBE EXTENDED {dest_database}.dim_patients;").show()
spark.sql(f"SELECT * FROM {dest_database}.dim_patients LIMIT 5").show()

#### Use Structured Streaming to Read Heartrate Monitor data
##### Read data from a series of JSON source files into a streaming DataFrame

In [None]:
df_tracker = (spark.readStream \
             .option("schemaLocation", tracker_output_bronze) \
             .option("maxFilesPerTrigger", 1) \
             .option("multiLine", "true") \
             .json(tracker_stream_dir)
            )

df_tracker.isStreaming

In [None]:
# Unit Test -------------
print(type(df_tracker))
df_tracker.printSchema()

#### Write data from streaming DataFrame into a Streaming Table

In [None]:
tracker_checkpoint_bronze = os.path.join(tracker_output_bronze, '_checkpoint')

bronze_query = (df_tracker.writeStream \
                .outputMode("append") \
                .queryName("heartbeat_tracker_bronze")
                .trigger(availableNow = True) \
                .option("checkpointLocation", tracker_checkpoint_bronze) \
                .option("compression", "snappy") \
                .toTable(f"{dest_database}.fact_heartrate_bronze")
               )

In [None]:
# Unit Test ----------------------------------
print(f"Query ID: {bronze_query.id}")
print(f"Query Name: {bronze_query.name}")
print(f"Query Status: {bronze_query.status}")
print(f"Last Progress: {bronze_query.lastProgress}")

In [None]:
bronze_query.awaitTermination()

In [None]:
# Unit Test ------------------------------------------------------------------ 
spark.sql(f"DESCRIBE EXTENDED {dest_database}.fact_heartrate_bronze;").show()
spark.table(f"{dest_database}.fact_heartrate_bronze").show(5)

### Create Silver Layer
#### Define Silver Query to Join Streaming with Batch Data

In [None]:
df_silver = spark.table(f"{dest_database}.fact_heartrate_bronze") \
    .join(df_patient, "mrn") \
    .select(col("device_id").cast(IntegerType()), \
            col("mrn").cast(LongType()), \
            (col("time")/1e6).cast(TimestampType()).alias("datetime"), \
            from_unixtime("time", "MM/dd/yyyy").alias("date"), \
            from_unixtime("time", "hh:mm:ss a z").alias("time"), \
            col("heartrate").cast(DoubleType()), \
            col("name").alias("patient_name")
           )

In [None]:
# Unit Test ------------
df_silver.printSchema()
df_silver.show(5)

#### Persist Silver Data to a Table in the Lakehouse

In [None]:
silver_table = f"{dest_database}.fact_heartrate_silver"
df_silver.write.saveAsTable(silver_table, mode="overwrite")

In [None]:
# Unit Test -------------------------------------------
spark.sql(f"DESCRIBE EXTENDED {silver_table};").show()
spark.table(silver_table).show(5)

### Create Gold Layer
#### Define Gold Query to Perform an Aggregation

In [None]:
df_gold = df_silver.groupBy("patient_name") \
    .agg((ceiling(avg("heartrate")).alias("avg_heartrate")), \
        (count("device_id").alias("count"))) \
    .orderBy(desc("avg_heartrate"))

In [None]:
# Unit Test -----
df_gold.show(5)

#### Persist Gold Data to a Table in the Lakehouse

In [None]:
gold_table = f"{dest_database}.fact_heartrate_gold"
df_gold.write.saveAsTable(gold_table, mode="overwrite")

In [None]:
# Unit Test ------------------------------------------------------------
spark.table(gold_table).show(5)

#### Display the Gold table

In [None]:
spark.table(gold_table) \
    .select("patient_name", "avg_heartrate", "count") \
    .orderBy(asc("avg_heartrate")).toPandas().()

In [None]:
spark.stop()