### Multi-Stage Incremental Processing Architecture in the Lakehouse
Having a better understanding of how to perform incremental data processing by combining Structured Streaming APIs and Spark SQL, we can now explore the tight integration between Structured Streaming and Delta Lake.

#### Objectives
By the end of this lesson, you should be able to:
* Describe Bronze, Silver, and Gold tables
* Create a Delta Lake multi-hop pipeline

#### Incremental Updates in the Lakehouse
Delta Lake allows users to easily combine streaming and batch workloads in a unified multi-stage pipeline, wherein each stage of the pipeline represents a state of our data valuable to driving core use cases within the business. With all data and metadata resident in object storage in the cloud, multiple users and applications can access data in near-real time, allowing analysts to access the freshest data as it's being processed.

![](https://files.training.databricks.com/images/sslh/multi-hop-simple.png)

- **Bronze** tables contain raw data ingested from various sources (JSON files, RDBMS data,  IoT data, to name a few examples).
- **Silver** tables provide a more refined view of our data. We can join fields from various bronze tables to enrich streaming records, or update account statuses based on recent activity.
- **Gold** tables provide business level aggregates often used for reporting and dashboarding. This would include aggregations such as daily active website users, weekly sales per store, or gross revenue per quarter by department. 

The end outputs are actionable insights, dashboards and reports of business metrics.  By considering our business logic at all steps of the ETL pipeline, we can ensure that storage and compute costs are optimized by reducing unnecessary duplication of data and limiting ad hoc querying against full historic data.  Each stage can be configured as a batch or streaming job, and ACID transactions ensure that we succeed or fail completely.

#### Datasets Used:
This demo uses simplified artificially generated medical data. The schema of our two datasets is represented below. Note that we will be manipulating these schema during various steps.

##### Recordings
The main dataset uses heart rate recordings from medical devices delivered in the JSON format. 

| Field | Type |
| --- | --- |
| device_id | int |
| mrn | long |
| time | double |
| heartrate | double |

##### PII
These data will later be joined with a static table of patient information stored in an external system to identify patients by name.

| Field | Type |
| --- | --- |
| mrn | long |
| name | string |

#### 1.0. Import Shared Utilities and Data Files

Run the following cell to configure the lab environment.

In [0]:
%run ./Includes/5.0-setup

#### 1.1. Data Simulator
Databricks Auto Loader can automatically process files as they land in your cloud object stores. To simulate this process, you will run the following operation several times.

In [0]:
DA.data_factory.load()

#### 2.0. Bronze Table: Ingesting Raw JSON Recordings
Below, we configure a read on a raw JSON source using Auto Loader with schema inference. Note that while you need to use the Spark DataFrame API to set up an incremental read, once configured you can immediately register a temp view to leverage Spark SQL for streaming transformations on your data.

**NOTE**: For a JSON data source, Auto Loader will default to inferring each column as a string. Here, we demonstrate specifying the data type for the **`time`** column using the **`cloudFiles.schemaHints`** option. Note that specifying improper types for a field will result in null values.

In [0]:
(spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaHints", "time DOUBLE")
    .option("cloudFiles.schemaLocation", f"{DA.paths.checkpoints}/bronze")
    .option("cloudFiles.inferColumnTypes", "true")
    .option("multiLine", "true")
    .load(DA.paths.data_landing_location)
    .createOrReplaceTempView("recordings_raw_temp"))

Here, we'll enrich our raw data with additional metadata describing the source file and the time it was ingested. This additional metadata can be ignored during downstream processing while providing useful information for troubleshooting errors if corrupt data is encountered.

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW recordings_bronze_temp AS (
  SELECT *, current_timestamp() receipt_time, input_file_name() source_file
  FROM recordings_raw_temp
)

The code below passes our enriched raw data back to PySpark API to process an incremental write to a Delta Lake table.

In [0]:
(spark.table("recordings_bronze_temp")
      .writeStream
      .format("delta")
      .option("checkpointLocation", f"{DA.paths.checkpoints}/bronze")
      .outputMode("append")
      .table("bronze"))

Trigger another file arrival with the following cell and you'll see the changes immediately detected by the streaming query you've written.

In [0]:
DA.data_factory.load()

##### 2.1. Load Static Lookup Table
The ACID guarantees that Delta Lake brings to your data are managed at the table level, ensuring that only fully successfully commits are reflected in your tables. If you choose to merge these data with other data sources, be aware of how those sources version data and what sort of consistency guarantees they have.

In this simplified demo, we are loading a static CSV file to add patient data to our recordings. In production, we could use Databricks' <a href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html" target="_blank">Auto Loader</a> feature to keep an up-to-date view of these data in our Delta Lake.

In [0]:
(spark.read
      .format("csv")
      .schema("mrn STRING, name STRING")
      .option("header", True)
      .load(f"{DA.paths.data_source}/patient/patient_info.csv")
      .createOrReplaceTempView("pii"))

In [0]:
%sql
SELECT * FROM pii

#### 3.0. Silver Table: Enriched Recording Data
As a second hop in our silver level, we will do the follow enrichments and checks:
- Our recordings data will be joined with the PII to add patient names
- The time for our recordings will be parsed to the format **`'yyyy-MM-dd HH:mm:ss'`** to be human-readable
- We will exclude heart rates that are <= 0, as we know that these either represent the absence of the patient or an error in transmission

In [0]:
(spark.readStream
  .table("bronze")
  .createOrReplaceTempView("bronze_tmp"))

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW recordings_w_pii AS (
  SELECT device_id, a.mrn, b.name, cast(from_unixtime(time, 'yyyy-MM-dd HH:mm:ss') AS timestamp) time, heartrate
  FROM bronze_tmp a
  INNER JOIN pii b
  ON a.mrn = b.mrn
  WHERE heartrate > 0)

In [0]:
(spark.table("recordings_w_pii")
      .writeStream
      .format("delta")
      .option("checkpointLocation", f"{DA.paths.checkpoints}/recordings_enriched")
      .outputMode("append")
      .table("recordings_enriched"))

Trigger another new file and wait for it propagate through both previous queries.

In [0]:
%sql
SELECT COUNT(*) FROM recordings_enriched

In [0]:
DA.data_factory.load()

#### 4.0. Gold Table: Daily Averages

Here we read a stream of data from **`recordings_enriched`** and write another stream to create an aggregate gold table of daily averages for each patient.

In [0]:
(spark.readStream
  .table("recordings_enriched")
  .createOrReplaceTempView("recordings_enriched_temp"))

In [0]:
%sql
CREATE OR REPLACE TEMP VIEW patient_avg AS (
  SELECT mrn, name, mean(heartrate) avg_heartrate, date_trunc("DD", time) date
  FROM recordings_enriched_temp
  GROUP BY mrn, name, date_trunc("DD", time))

Note that we're using **`.trigger(once=True)`** below. This provides us the ability to continue to use the strengths of structured streaming while trigger this job as a single batch. To recap, these strengths include:
- exactly once end-to-end fault tolerant processing
- automatic detection of changes in upstream data sources

If we know the approximate rate at which our data grows, we can appropriately size the cluster we schedule for this job to ensure fast, cost-effective processing. The customer will be able to evaluate how much updating this final aggregate view of their data costs and make informed decisions about how frequently this operation needs to be run.

Downstream processes subscribing to this table do not need to re-run any expensive aggregations. Rather, files just need to be de-serialized and then queries based on included fields can quickly be pushed down against this already-aggregated source.

In [0]:
(spark.table("patient_avg")
      .writeStream
      .format("delta")
      .outputMode("complete")
      .option("checkpointLocation", f"{DA.paths.checkpoints}/daily_avg")
      .trigger(once=True)
      .table("daily_patient_avg"))

##### 4.1. Important Considerations for Complete Output with Delta
When using **`complete`** output mode, we rewrite the entire state of our table each time our logic runs. While this is ideal for calculating aggregates, we **cannot** read a stream from this directory, as Structured Streaming assumes data is only being appended in the upstream logic.  **NOTE**: Certain options can be set to change this behavior, but have other limitations attached. For more details, refer to <a href="https://docs.databricks.com/delta/delta-streaming.html#ignoring-updates-and-deletes" target="_blank">Delta Streaming: Ignoring Updates and Deletes</a>.  The gold Delta table we have just registered will perform a static read of the current state of the data each time we run the following query.

In [0]:
%sql
SELECT * FROM daily_patient_avg

Note the above table includes all days for all users. If the predicates for our ad hoc queries match the data encoded here, we can push down our predicates to files at the source and very quickly generate more limited aggregate views.

In [0]:
%sql
SELECT * 
FROM daily_patient_avg
WHERE date BETWEEN "2020-01-17" AND "2020-01-31"

#### 5.0. Process Remaining Records
The following cell will land additional files for the rest of 2020 in your source directory. You'll be able to see these process through the first 3 tables in your Delta Lake, but will need to re-run your final query to update your **`daily_patient_avg`** table, since this query uses the trigger once syntax.

In [0]:
DA.data_factory.load(continuous=True)

#### 6.0. Cleaning Up
Finally, make sure all streams are stopped.

In [0]:
DA.cleanup()

#### 7.0. Summary, Additional Topics & Resources
Delta Lake and Structured Streaming combine to provide near real-time analytic access to data in the lakehouse. To learn more, check out the following resources:
* <a href="https://docs.databricks.com/delta/delta-streaming.html" target="_blank">Table Streaming Reads and Writes</a>
* <a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html" target="_blank">Structured Streaming Programming Guide</a>
* <a href="https://www.youtube.com/watch?v=rl8dIzTpxrI" target="_blank">A Deep Dive into Structured Streaming</a> by Tathagata Das. This is an excellent video describing how Structured Streaming works.
* <a href="https://databricks.com/glossary/lambda-architecture" target="_blank">Lambda Architecture</a>
* <a href="https://bennyaustin.wordpress.com/2010/05/02/kimball-and-inmon-dw-models/#" target="_blank">Data Warehouse Models</a>
* <a href="http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html" target="_blank">Create a Kafka Source Stream</a>