### Understanding Incremental Data

Spark Structured Streaming extends the functionality of Apache Spark to allow for simplified configuration and bookkeeping when processing incremental datasets. In the past, much of the emphasis for streaming with big data has focused on reducing latency to provide near real time analytic insights. While Structured Streaming provides exceptional performance in achieving these goals, this lesson will focus more on the applications of incremental data processing.

While incremental processing is not absolutely necessary to work successfully in the data lakehouse, experience has shown that many workloads can benefit substantially from an incremental processing approach. To that end, many of Databricks' core features have been optimized specifically for handling these ever-growing datasets.

Consider the following datasets and use cases:
* Data scientists need secure, de-identified, versioned access to frequently updated records in an operational database
* Credit card transactions need to be compared to past customer behavior to identify and flag fraud
* A multi-national retailer seeks to serve custom product recommendations using purchase history
* Log files from distributed systems need to be analayzed to detect and respond to instabilities
* Clickstream data from millions of online shoppers needs to be leveraged for A/B testing of UX

These are just a small sample of datasets that grow incrementally and infinitely over time.  Here, we demonstrate the basics of using Spark Structured Streaming for incremental data processing.

#### Objectives
* Describe the programming model used by Spark Structured Streaming
* Configure required options to perform a streaming read on a source
* Describe the requirements for end-to-end fault tolerance
* Configure required options to perform a streaming write to a sink

First, run the following cell to import the data and make various utilities available for our experimentation.

In [0]:
%run ./Includes/4.1-setup

#### 1.0. Treating Infinite Data as a Table
The main benefit of Spark Structured Streaming is that it enables users to interact with ever-growing data sources as if they were simply a static table of records.

<img src="http://spark.apache.org/docs/latest/img/structured-streaming-stream-as-a-table.png" width="800"/>

In the graphic above, a **data stream** describes any data source that grows over time. New data in a data stream might correspond to:
* A new JSON log file landing in cloud storage
* Updates to a database captured in a CDC feed
* Events queued in a pub/sub messaging feed
* A CSV file of sales closed the previous day

Historically, to update the results of a continuous stream of real-time data, either the entire source dataset had to be completely reprocessed, or custom logic had to be implemented to identify and process only those files or records that had been added since the previous update was executed.  Structured Streaming enables defining a query against the data source to automatically detect new records and propagate them through previously defined logic. **Spark Structured Streaming is optimized on Databricks to integrate closely with Delta Lake and Auto Loader.**

#### 2.0. Basic Concepts

- The developer defines an **input table** by configuring a streaming read against a **source**. The syntax for doing this is similar to working with static data.
- A **query** is defined against the input table. Both the DataFrames API and Spark SQL can be used to easily define transformations and actions against the input table.
- This logical query on the input table generates the **results table**. The results table contains the incremental state information of the stream.
- The **output** of a streaming pipeline will persist updates to the results table by writing to an external **sink**. Generally, a sink will be a durable system such as files or a pub/sub messaging bus.
- New rows are appended to the input table for each **trigger interval**. These new rows are essentially analogous to micro-batch transactions and will be automatically propagated through the results table to the sink.

<img src="http://spark.apache.org/docs/latest/img/structured-streaming-model.png" width="800"/>


For more information, see the section in the <a href="http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts" target="_blank">Structured Streaming Programming Guide</a>.

#### 3.0. End-to-End Fault Tolerance
Structured Streaming ensures end-to-end exactly-once fault-tolerance guarantees through _checkpointing_ (discussed below) and <a href="https://en.wikipedia.org/wiki/Write-ahead_logging" target="_blank">Write Ahead Logs</a>.  Structured Streaming sources, sinks, and the underlying execution engine work together to track the progress of stream processing. If a failure occurs, the streaming engine attempts to restart and/or reprocess the data. For best practices on recovering from a failed streaming query see <a href="https://docs.databricks.com/spark/latest/structured-streaming/production.html#recover-from-query-failures" target="_blank">docs</a>. Note: This approach _only_ works if the streaming source is replayable; replayable sources include cloud-based object storage and pub/sub messaging services.

At a high level, the underlying streaming mechanism relies on a couple approaches:
* First, Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data being processed during each trigger interval.
* Next, the streaming sinks are designed to be _idempotent_—that is, multiple writes of the same data (as identified by the offset) do _not_ result in duplicates being written to the sink.

Taken together, replayable data sources and idempotent sinks allow Structured Streaming to ensure **end-to-end, exactly-once semantics** under any failure condition.


#### 4.0. Reading a Stream
The **`spark.readStream()`** method returns a **`DataStreamReader`** used to configure and query the stream.  In the previous lesson, we saw code configured for incrementally reading with Auto Loader. Here, we'll show how easy it is to incrementally read a Delta Lake table.  The code uses the PySpark API to incrementally read a Delta Lake table named **`bronze`** and register a streaming temp view named **`streaming_tmp_vw`**.

**NOTE**: A number of optional configurations (not shown here) can be set when configuring incremental reads, the most important of which allows you to <a href="https://docs.databricks.com/delta/delta-streaming.html#limit-input-rate" target="_blank">limit the input rate</a>.

In [0]:
(spark.readStream
    .table("bronze")
    .createOrReplaceTempView("streaming_tmp_vw"))

When we execute a query on a streaming temporary view, we'll continue to update the results of the query as new data arrives in the source.  Think of a query executed against a streaming temp view as an **always-on incremental query**.

**NOTE**: Generally speaking, unless a human is actively monitoring the output of a query during development or live dashboarding, we won't return streaming results to a notebook.

In [0]:
%sql
SELECT * FROM streaming_tmp_vw

You will recognize the data as being the same as the Delta table written out in our previous lesson.  Before continuing, click **`Stop Execution`** at the top of the notebook, **`Cancel`** immediately under the cell, or run the following cell to stop all active streaming queries.

In [0]:
for s in spark.streams.active:
    print("Stopping " + s.id)
    s.stop()
    s.awaitTermination()

#### 5.0. Working with Streaming Data
We can execute most transformation against streaming temp views the same way we would with static data. Here, we'll run a simple aggregation to get counts of records for each **`device_id`**.  Because we are querying a streaming temp view, this becomes a streaming query that executes indefinitely, rather than completing after retrieving a single set of results. For streaming queries like this, Databricks Notebooks include interactive dashboards that allow users to monitor streaming performance. Explore this below. One important note regarding this example: this is merely displaying an aggregation of input as seen by the stream. **None of these records are being persisted anywhere at this point.**

In [0]:
%sql
SELECT device_id, count(device_id) AS total_recordings
FROM streaming_tmp_vw
GROUP BY device_id

Before continuing, click **`Stop Execution`** at the top of the notebook, **`Cancel`** immediately under the cell, or run the following cell to stop all active streaming queries.

In [0]:
for s in spark.streams.active:
    print("Stopping " + s.id)
    s.stop()
    s.awaitTermination()

#### 6.0. Unsupported Operations

Most operations on a streaming DataFrame are identical to a static DataFrame. There are <a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations" target="_blank">some exceptions to this</a>.  Consider the model of the data as a constantly appending table. Sorting is one of a handful of operations that is either too complex or logically not possible to do when working with streaming data.  A full discussion of these exceptions is out of scope for this course. Note that advanced streaming methods like windowing and watermarking can be used to add additional functionality to incremental workloads.

Uncomment and run the following cell how this failure may appear:

In [0]:
# %sql
# SELECT * 
# FROM streaming_tmp_vw
# ORDER BY time

#### 7.0. Persisting Streaming Results
In order to persist incremental results, we need to pass our logic back to the PySpark Structured Streaming DataFrames API.  Above, we created a temp view from a PySpark streaming DataFrame. If we create another temp view from the results of a query against a streaming temp view, we'll again have a streaming temp view.

In [0]:
%sql
CREATE OR REPLACE TEMP VIEW device_counts_tmp_vw AS (
  SELECT device_id, COUNT(device_id) AS total_recordings
  FROM streaming_tmp_vw
  GROUP BY device_id
)

#### 8.0. Writing a Stream
To persist the results of a streaming query, we must write them out to durable storage. The **`DataFrame.writeStream`** method returns a **`DataStreamWriter`** used to configure the output. When writing to Delta Lake tables, we typically will only need to worry about 3 settings, discussed here.


##### 8.1. Checkpointing
Databricks creates checkpoints by storing the current state of your streaming job to cloud storage. Checkpointing combines with write ahead logs to allow a terminated stream to be restarted and continue from where it left off. Checkpoints cannot be shared between separate streams. A checkpoint is required for every streaming write to ensure processing guarantees.


##### 8.2. Output Modes
Streaming jobs have output modes similar to static/batch workloads. <a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes" target="_blank">More details here</a>.

| Mode   | Example | Notes |
| ------------- | ----------- | --- |
| **Append** | **`.outputMode("append")`**     | **This is the default.** Only newly appended rows are incrementally appended to the target table with each batch |
| **Complete** | **`.outputMode("complete")`** | The Results Table is recalculated each time a write is triggered; the target table is overwritten with each batch |


##### 8.3. Trigger Intervals
When defining a streaming write, the **`trigger`** method specifies when the system should process the next set of data..

| Trigger Type                           | Example | Notes |
|----------------------------------------|-----------|-------------|
| Unspecified                            |  | **This is the default.** This is equivalent to using **`processingTime="500ms"`** |
| Fixed interval micro-batches           | **`.trigger(processingTime="2 minutes")`** | The query will be executed in micro-batches and kicked off at the user-specified intervals |
| One-time micro-batch                   | **`.trigger(once=True)`** | The query will execute a single micro-batch to process all the available data and then stop on its own |

*Note that triggers are specified when defining how data will be written to a sink and control the frequency of micro-batches. By default, Spark will automatically detect and process all data in the source that has been added since the last trigger.*


#### 9.0 Pulling It All Together
The code below demonstrates using **`spark.table()`** to load data from a streaming temp view back to a DataFrame. Note that Spark will always load streaming views as a streaming DataFrame and static views as static DataFrames (meaning that incremental processing must be defined with read logic to support incremental writing). 

In this first query, we'll demonstrate using **`trigger(once=True)`** to perform incremental batch processing.

In [0]:
(spark.table("device_counts_tmp_vw")                               
    .writeStream                                                
    .option("checkpointLocation", f"{DA.paths.checkpoints}/silver")
    .outputMode("complete")
    .trigger(once=True)
    .table("device_counts")
    .awaitTermination() # This optional method blocks execution of the next cell until the incremental batch write has succeeded
)

Below, we change our trigger method to change this query from a triggered incremental batch to an always-on query triggered every 4 seconds.

**NOTE**: As we start this query, no new records exist in our source table. We'll add new data shortly.

In [0]:
query = (spark.table("device_counts_tmp_vw")                               
              .writeStream                                                
              .option("checkpointLocation", f"{DA.paths.checkpoints}/silver")
              .outputMode("complete")
              .trigger(processingTime='4 seconds')
              .table("device_counts"))

# Like before, wait until our stream has processed some data
DA.block_until_stream_is_ready(query)

#### 10.0. Querying the Output
Now let's query the output we've written from SQL. Because the result is a table, we only need to deserialize the data to return the results.  Because we are now querying a table (not a streaming DataFrame), the following will **not** be a streaming query.

In [0]:
%sql
SELECT *
FROM device_counts

#### 11.0. Land New Data

As in our previous lesson, we have configured a helper function to process new records into our source table. Run the cell below to land another batch of data.

In [0]:
DA.data_factory.load()

Query the target table again to see the updated counts for each **`device_id`**.

In [0]:
%sql
SELECT *
FROM device_counts

#### 12.0 Clean Up
Feel free to continue landing new data and exploring the table results with the cells above.  When you're finished, run the following cell to stop all active streams and remove created resources before continuing.

In [0]:
DA.cleanup()