## Structured Streaming

[ Ref _Learning Spark v2_ book, _Chapter 3_.]

Traditionally, distributed stream processing has been implemented with a record-at-a-time processing model.

<img src="https://github.com/Marco-Santoni/databricks-from-scratch/blob/main/training-spark/img/chapter_8_traditional.png?raw=true" width="500">

Ref [O'Reilly](https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/ch08.html)

This processing model can achieve very low latencies—that is, an input record can be processed by the pipeline and the resulting output can be generated within milliseconds. However, this model is not very efficient at recovering from node failures and straggler nodes (i.e., nodes that are slower than others).

### Micro-batch stream processing

Spark Streaming introduced the idea of micro-batch stream processing, where the streaming computation is modeled as a continuous series of small, map/reduce-style batch processing jobs (hence, “micro-batches”) on small chunks of the stream data.

<img src="https://github.com/Marco-Santoni/databricks-from-scratch/blob/main/training-spark/img/chapter_8_microbatch.png?raw=true" width="500">

[ Ref _Learning Spark v2_ book, _page 208_]

As shown here, Spark Streaming divides the data from the input stream into, say, 1- second micro-batches. Each batch is processed in the Spark cluster in a distributed manner with small deterministic tasks that generate the output in micro-batches.

Main advantages

1. **fault-tolerance.** Recovering from failures is quick and easy thanks to Spark's task scheduling
2. **determinism**. The deterministic nature of the tasks ensures that the output data is the same no matter how many times the task is reexecuted. This crucial characteristic enables Spark Streaming to provide end-to-end exactly-once processing guarantees, that is, the generated output results will be such that every input record was processed exactly once.

Main disadvantage: **latency**. Few seconds rather than few milliseconds.

### Programming model

Structured Streaming extends the concept of a table to streaming applications by treating a stream as an unbounded, continuously appended table.

<img src="https://spark.apache.org/docs/latest/img/structured-streaming-stream-as-a-table.png" width="500">

Ref [Spark docs](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts)

Every new record received in the data stream is like a new row being appended to the unbounded input table. Structured Streaming will not actually retain all the input, but the output produced by Structured Streaming until time T will be equivalent to having all of the input until T in a static, bounded table and running a batch job on the table.

### Incrementalization

The developer then defines a query on this conceptual input table, as if it were a static table, to compute the result table that will be written to an output sink. Structured Streaming will automatically convert this batch-like query to a streaming execution plan.


<img src="https://spark.apache.org/docs/latest/img/structured-streaming-model.png" width="500">

Ref [Spark docs](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts)

Finally, developers specify **triggering** policies to control when to update the results. Each time a trigger fires, Structured Streaming checks for new data (i.e., a new row in the input table) and incrementally updates the result.

The last part of the model is the **output** mode. The _output_ is defined as what gets written out to the external storage. The output can be defined in a different mode:

- **Complete** - The entire updated Result Table will be written to the external storage. It is up to the storage connector to decide how to handle writing of the entire table.
- **Append** - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change.
- **Update** - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage. Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.

### From batch to streaming

Thinking of the data streams as tables not only makes it easier to conceptualize the logical computations on the data, but also makes it easier to express them in code. Since Spark’s DataFrame is a programmatic representation of a table, you can use the **DataFrame** API to express your computations on streaming data. All you need to do is define an input DataFrame (i.e., the input table) from a streaming data source, and then you apply operations on the DataFrame in the same way as you would on a DataFrame defined on a batch source.