# Lecture 24. Structured Streaming

## You will learn

- what is a data stream and how to process streaming data using spark structured streaming.
- how to use a data stream reader to perform a stream read from a source and 
- how to use and configure a data stream writer to perform a streaming write to sink.

## Datastream

A ***data stream*** is any data source that grows over time.
For example, 
  - a new data in a data stream might correspond to a new JSON log file landing into a cloud storage, 
  - updates to a database captured in a CDC or Change Data Capture feed
  - events queued in a pub/sub messaging feed like Kafka.

### Processing Data Stream

To process a data stream, usually you have two approaches: 
  - either a traditional approach where you reprocess the entire dataset each time you receive a new update to your data.
  - Or another approach would be to write a custom logic to only capture those files or records that have been added since the last time an update was run.
    And here we can use the spark structured streaming to achieve this goal.

## Spark Structured Streaming

**Spark structured streaming** is a scalable streaming processing engine.
It allows you to query an infinite data source where automatically detects new data and persists the result incrementally into a data sink.

<div style="text-align: center;">
<img src="../../assets/images/Spark Structured Streaming.jpg" style="width:640px" >
</div> 

A ***data sink*** is just a durable file system, such as files or tables.

## Treating Infinite Data as a Table

But the question is how to interact and query an infinite data source !?

Simply by treating it as a table.

In fact, the magic behind Spark Structured Streaming is that it allows users to interact with ever-growing data source as if it were just a static table of records.

<div style="text-align: center;">
<img src="../../assets/images/Treating Infinite Data as a Table.jpg" style="width:640px" >
</div> 

So new data in the input data stream is simply treated as new rows appended to a table.

And such a table representing an infinite data source is seen as "unbounded" table.

## Input Streaming Table

As we said, an input data stream could be a directory of files, a messaging system like Kafka, or simply a Delta table.

Delta Lake is well integrated with spark structured streaming.

<div style="text-align: center;">
<img src="../../assets/images/Input Streaming Table.jpg" style="width:640px" >
</div> 

We can simply use `spark.readStream()` to query the delta table as a stream source, which allows to process all of the data present in the table as well as any new data that arrive later.

```python
streamDF = spark.readStream
                .table("Input_Table")
```
This creates a streaming data frame on which we can apply any transformation as if it were just a static data frame.

Then to persist the result of a streaming query, we need to write them out to durable storage using `dataframe.writeStream` method.
With the `writeStream` method, we can configure our output.

```python
streamDF.writeStream
        .trigger(processingTime="2 minutes")
        .outputMode("append")
        .option("checkpointLocation", "/path")
        .table("Output_Table")
```
Here for example, we trigger the streaming processing every 2 minutes to check if there are new arriving records, and we choose to append them to the target table.

So again, after another 2 minutes, we will check if there are new arriving data and we append them to the target table.

And all this happened thanks to checkpoints created by Spark to track the progress of your streaming processing.

### Trigger Intervals

Let us take a closer look on these configurations.

When defining a streaming write, the trigger method specifies when the system should process the next set of data.
And this is called the ***trigger interval***.

<div style="text-align: center;">
<img src="../../assets/images/Trigger Intervals.jpg" style="width:640px" >
</div> 

By default, if you don't provide any trigger interval, the data will be processed every half second.

Or you can specify a fixed interval as we did previously.
So the data will be processed in micro batches at your specified interval, for example, every 5 minutes.

In addition, you can run your stream in a batch mode to process all available data at once using either trigger `once` option, or `availableNow` option.

In both cases, the trigger will stop on its own once finished processing the available data.

The only difference is that with the trigger Once, all available data will be processed in a single batch, compared to multiple micro batches in availableNow option.

### Output Modes

<div style="text-align: center;">
<img src="../../assets/images/writeStream Output Modes.jpg" style="width:640px" >
</div> 

Streaming jobs have also output mode similar to static workloads, either: 
- `append` mode, which is the default mode.

  In this mode, only new appended rows are incrementally appended to the target table with each batch.

- While in `complete` mode,

  the result table is recalculated each time a write is triggered, so the target table is overwritten with each batch.

### Checkpointing

Databricks creates checkpoints by storing the current state of your streaming job to cloud storage.

<div style="text-align: center;">
<img src="../../assets/images/writeStream Checkpointing.jpg" style="width:640px" >
</div> 

Checkpointings allow the streaming engine to track the progress of your streaming processing.

An important note here is that checkpoints cannot be shared between several streams.

A separate checkpoint location is required for every streaming write to ensure processing guarantees.

## Guarantees

Structured streaming provides two guarantees.

1. Fault Tolerance
   
    First in case of failure, the streaming agent can resume from where it left off.

    Thanks to both the checkpointing and also a mechanism called *Write-ahead logs*, they allow to record the offset range of data being processed during each trigger interval, to track your stream progress.

2. Exactly-once guarantee

    Structured streaming also ensures exactly once data processing because the streaming sinks are designed to be idempotent.

    That is, multiple writes of the same data, of course identified by the offset, do not result in duplicates being written to the sink.

And of course, the two guarantees here only work if the streaming source is repeatable, like cloud based object storage or pub/sub messaging service.

So, taking all together, repeatable data sources and idempotent sinks allows spark structured streaming 
to ensure end-to-end exactly-once semantics under any failure condition.

## Unsupported Operations

Lastly, we need to understand that some operations are not supported by streaming data frames.

Yes, it is true that most operations on a streaming data frame are identical to a static data frame, but there are some exceptions to this.

Operations such as **sorting** and **deduplication**, are either too complex or logically not possible to do when working with streaming data.
A full discussion of this exception is out of scope of this course.

However, there are advanced streaming methods like **windowing** and **watermarking** that can help to do such operations.