-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Streaming from Multiplex Bronze

In this notebook, you will configure a query to consume and parse raw data from a single topic as it lands in the multiplex bronze table configured in the last lesson. We'll continue refining this query in the following notebooks.

<img src="https://files.training.databricks.com/images/ade/ADE_arch_heartrate_silver.png" width="60%" />

## Learning Objectives
By the end of this lesson, you should be able to:
- Describe how filters are applied to streaming jobs
- Use built-in functions to flatten nested JSON data
- Parse and save binary-encoded strings to native types

Declare database and set all path variables.

In [0]:
%run ../Includes/silver-setup

Use the following cell to reset the target directories, if necessary.

In [0]:
spark.sql("DROP TABLE IF EXISTS heart_rate_silver")
dbutils.fs.rm(Paths.silverRecordingsTable, True)
dbutils.fs.rm(Paths.silverRecordingsCheckpoint, True)

## Define a Batch Read

Before building our streams, we'll start with a static view of our data. Working with static data can be easier during interactive development as no streams will be triggered. Because we're working with Delta Lake as our source, we'll still get the most up-to-date version of our table each time we execute a query.

If you're working with SQL, you can just directly query the table registered in the previous lesson `bronze`. Python and Scala users can easily create a Dataframe from a registered table.

In [0]:
batchDF = spark.table("bronze")

Delta Lake stores our schema information. Let's print it out, just to make sure we remember.

In [0]:
%sql

DESCRIBE bronze

Preview your data.

In [0]:
%sql

SELECT *
FROM bronze
LIMIT 20

There are multiple topics being ingested. So, we'll need to define logic for each of these topics separately.

In [0]:
%sql

SELECT DISTINCT(topic)
FROM bronze

We'll cast our binary fields as strings, as this will allow us to manually review their contents.

In [0]:
%sql

SELECT cast(key AS STRING), cast(value AS STRING)
FROM bronze
LIMIT 20

## Parse Heart Rate Recordings

Let's start by defining logic to parse our heart rate recordings. We'll write this logic against our static data. Note that there are some [unsupported operations](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations) in Structured Streaming, so we may need to refactor some of our logic if we don't build our current queries with these limitations in mind.

Together, we'll iteratively develop a single query that parses our `bpm` topic to the following schema.

| field | type |
| --- | --- |
| device_id | LONG | 
| time | TIMESTAMP | 
| heartrate | DOUBLE |

We'll be creating the table `heartrate_silver` in our architectural diagram.

<img src="https://files.training.databricks.com/images/ade/ADE_arch_heartrate_silver.png" width="60%" />

In [0]:
%sql
MAGIC -- TODO
MAGIC 
MAGIC -- Use this cell to explore and build your query

## Convert Logic for Streaming Read

We can define a streaming read directly against our Delta table. Note that most configuration for streaming queries is done on write rather than read, so here we see little change to our above logic.

The cell below shows how to convert a static table into a streaming temp view (if you wish to write streaming queries with Spark SQL).

In [0]:
(spark.readStream
  .table("bronze")
  .createOrReplaceTempView("TEMP_bronze")
)

Updating our above query to refer to this temp view gives us a streaming result.

In [0]:
%sql

SELECT
 v.*
FROM
 (
   SELECT
     from_json(
       cast(value AS STRING),
       "device_id LONG, time TIMESTAMP, heartrate DOUBLE"
     ) v
   FROM
     TEMP_bronze
   WHERE
     topic = "bpm"
 )

The cell below has this logic refactored to Python.

In [0]:
bpmDF = (spark.readStream
  .table("bronze")
  .filter("topic = 'bpm'")
  .select(F.from_json(F.col("value").cast("string"), "device_id LONG, time TIMESTAMP, heartrate DOUBLE").alias("v"))
  .select("v.*")
)

Note that anytime a streaming read is displayed to a notebook, a streaming job will begin. To persist results to disk, a streaming write will need to be performed.

Using the `trigger(once=True)` option will process all records as a single batch.

In [0]:
(bpmDF.writeStream
    .option("checkpointLocation", Paths.silverRecordingsCheckpoint)
    .option("path", Paths.silverRecordingsTable)
    .trigger(once=True)
    .table("heart_rate_silver"))

<img src="https://files.training.databricks.com/images/icon_warn_24.png"/> Before continuing, make sure you cancel any streams. The `Run All` button at the top of the screen will say `Stop Execution` if you have a stream still running.

## Silver Table Motivations

In addtion to parsing records and flattening and changing our schema, we should also check the quality of our data before writing to our silver tables.

In the following notebooks, we'll review various quality checks.

-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>