## Structured Streaming

[ Ref _Learning Spark v2_ book, _Chapter 8_.]

Traditionally, distributed stream processing has been implemented with a record-at-a-time processing model.

<img src="https://github.com/Marco-Santoni/databricks-from-scratch/blob/main/training-spark/img/chapter_8_traditional.png?raw=true" width="500">

Ref [O'Reilly](https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/ch08.html)

This processing model can achieve very low latencies—that is, an input record can be processed by the pipeline and the resulting output can be generated within milliseconds. However, this model is not very efficient at recovering from node failures and straggler nodes (i.e., nodes that are slower than others).

### Micro-batch stream processing

Spark Streaming introduced the idea of micro-batch stream processing, where the streaming computation is modeled as a continuous series of small, map/reduce-style batch processing jobs (hence, “micro-batches”) on small chunks of the stream data.

<img src="https://github.com/Marco-Santoni/databricks-from-scratch/blob/main/training-spark/img/chapter_8_microbatch.png?raw=true" width="500">

[ Ref _Learning Spark v2_ book, _page 208_]

As shown here, Spark Streaming divides the data from the input stream into, say, 1- second micro-batches. Each batch is processed in the Spark cluster in a distributed manner with small deterministic tasks that generate the output in micro-batches.

Main advantages

1. **fault-tolerance.** Recovering from failures is quick and easy thanks to Spark's task scheduling
2. **determinism**. The deterministic nature of the tasks ensures that the output data is the same no matter how many times the task is reexecuted. This crucial characteristic enables Spark Streaming to provide end-to-end exactly-once processing guarantees, that is, the generated output results will be such that every input record was processed exactly once.

Main disadvantage: **latency**. Few seconds rather than few milliseconds.

### Programming model

Structured Streaming extends the concept of a table to streaming applications by treating a stream as an unbounded, continuously appended table.

<img src="https://spark.apache.org/docs/latest/img/structured-streaming-stream-as-a-table.png" width="500">

Ref [Spark docs](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts)

Every new record received in the data stream is like a new row being appended to the unbounded input table. Structured Streaming will not actually retain all the input, but the output produced by Structured Streaming until time T will be equivalent to having all of the input until T in a static, bounded table and running a batch job on the table.

### Incrementalization

The developer then defines a query on this conceptual input table, as if it were a static table, to compute the result table that will be written to an output sink. Structured Streaming will automatically convert this batch-like query to a streaming execution plan.


<img src="https://spark.apache.org/docs/latest/img/structured-streaming-model.png" width="500">

Ref [Spark docs](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts)

Finally, developers specify **triggering** policies to control when to update the results. Each time a trigger fires, Structured Streaming checks for new data (i.e., a new row in the input table) and incrementally updates the result.

The last part of the model is the **output** mode. The _output_ is defined as what gets written out to the external storage. The output can be defined in a different mode:

- **Complete** - The entire updated Result Table will be written to the external storage. It is up to the storage connector to decide how to handle writing of the entire table.
- **Append** - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change.
- **Update** - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage. Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.

### From batch to streaming

Thinking of the data streams as tables not only makes it easier to conceptualize the logical computations on the data, but also makes it easier to express them in code. Since Spark’s DataFrame is a programmatic representation of a table, you can use the **DataFrame** API to express your computations on streaming data. All you need to do is define an input DataFrame (i.e., the input table) from a streaming data source, and then you apply operations on the DataFrame in the same way as you would on a DataFrame defined on a batch source.

## Time to code: first exercise

This excercise is inspired by Databricks's [streaming tutorial](https://docs.databricks.com/getting-started/spark/streaming.html).

We have some sample action data as files in `/databricks-datasets/structured-streaming/events/` which we are going to use to build this appication. Let's take a look at the contents of this directory.

In [None]:
%fs ls /databricks-datasets/structured-streaming/events/

path,name,size
dbfs:/databricks-datasets/structured-streaming/events/file-0.json,file-0.json,72530
dbfs:/databricks-datasets/structured-streaming/events/file-1.json,file-1.json,72961
dbfs:/databricks-datasets/structured-streaming/events/file-10.json,file-10.json,73025
dbfs:/databricks-datasets/structured-streaming/events/file-11.json,file-11.json,72999
dbfs:/databricks-datasets/structured-streaming/events/file-12.json,file-12.json,72987
dbfs:/databricks-datasets/structured-streaming/events/file-13.json,file-13.json,73006
dbfs:/databricks-datasets/structured-streaming/events/file-14.json,file-14.json,73003
dbfs:/databricks-datasets/structured-streaming/events/file-15.json,file-15.json,73007
dbfs:/databricks-datasets/structured-streaming/events/file-16.json,file-16.json,72978
dbfs:/databricks-datasets/structured-streaming/events/file-17.json,file-17.json,73008


There are about 50 JSON files in the directory. Let's see what each JSON file contains.

In [None]:
%fs head /databricks-datasets/structured-streaming/events/file-0.json

Each line in the file contains JSON record with two fields - `time` and `action`. Let's try to analyze these files interactively.

## Batch/Interactive Processing
The usual first step in attempting to process the data is to interactively query the data. Let's define a static DataFrame on the files, and give it a table name.

In [None]:
from pyspark.sql.types import StructType, StructField, TimestampType, StringType

inputPath = "/databricks-datasets/structured-streaming/events/"

# Since we know the data format already, let's define the schema to speed up processing (no need for Spark to infer schema)
jsonSchema = StructType(
    [
        StructField("time", TimestampType()),
        StructField("action", StringType())
    ]
)

In [None]:
# Static DataFrame representing data in the JSON files
staticInputDF = (
  spark
    .read
    .schema(jsonSchema)
    .json(inputPath)
)

display(staticInputDF)

time,action
2016-07-28T04:19:28.000+0000,Close
2016-07-28T04:19:28.000+0000,Close
2016-07-28T04:19:29.000+0000,Open
2016-07-28T04:19:31.000+0000,Close
2016-07-28T04:19:31.000+0000,Open
2016-07-28T04:19:31.000+0000,Open
2016-07-28T04:19:32.000+0000,Close
2016-07-28T04:19:33.000+0000,Close
2016-07-28T04:19:35.000+0000,Close
2016-07-28T04:19:36.000+0000,Open


Now we can compute the number of "open" and "close" actions with one hour windows. To do this, we will group by the `action` column and 1 hour windows over the `time` column. We use the [window](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.window.html) to bucketize rows in time windows of 1 hour.

In [None]:
from pyspark.sql.functions import window

staticCountsDF = (
  staticInputDF
    .groupBy(
       staticInputDF.action, 
       window(staticInputDF.time, "1 hour"))    
    .count()
)

In [None]:
display(staticCountsDF.orderBy(staticCountsDF.window.start.asc()))

action,window,count
Close,"List(2016-07-26T02:00:00.000+0000, 2016-07-26T03:00:00.000+0000)",11
Open,"List(2016-07-26T02:00:00.000+0000, 2016-07-26T03:00:00.000+0000)",179
Close,"List(2016-07-26T03:00:00.000+0000, 2016-07-26T04:00:00.000+0000)",344
Open,"List(2016-07-26T03:00:00.000+0000, 2016-07-26T04:00:00.000+0000)",1001
Open,"List(2016-07-26T04:00:00.000+0000, 2016-07-26T05:00:00.000+0000)",999
Close,"List(2016-07-26T04:00:00.000+0000, 2016-07-26T05:00:00.000+0000)",815
Close,"List(2016-07-26T05:00:00.000+0000, 2016-07-26T06:00:00.000+0000)",1003
Open,"List(2016-07-26T05:00:00.000+0000, 2016-07-26T06:00:00.000+0000)",1000
Close,"List(2016-07-26T06:00:00.000+0000, 2016-07-26T07:00:00.000+0000)",1011
Open,"List(2016-07-26T06:00:00.000+0000, 2016-07-26T07:00:00.000+0000)",993


Note the two ends of the graph. The close actions are generated such that they are after the corresponding open actions, so there are more "opens" in the beginning and more "closes" in the end.

## Stream Processing 
Now that we have analyzed the data interactively, let's convert this to a streaming query that continuously updates as data comes. Since we just have a static set of files, we are going to emulate a stream from them by reading one file at a time, in the chronological order they were created. The query we have to write is pretty much the same as the interactive query above.

In [None]:
from pyspark.sql.functions import window

# Similar to definition of staticInputDF above, just using `readStream` instead of `read`
streamingInputDF = (
  spark
    .readStream                       
    .schema(jsonSchema)               # Set the schema of the JSON data
    .option("maxFilesPerTrigger", 1)  # Treat a sequence of files as a stream by picking one file at a time
    .json(inputPath)
)

# Same query as staticInputDF
streamingCountsDF = (                 
  streamingInputDF
    .groupBy(
      streamingInputDF.action, 
      window(streamingInputDF.time, "1 hour"))
    .count()
)

# Is this DF actually a streaming DF?
streamingCountsDF.isStreaming

Out[2]: True

As you can see, `streamingCountsDF` is a streaming Dataframe (`streamingCountsDF.isStreaming` was `true`). You can start streaming computation, by defining the sink and starting it. 
In our case, we want to interactively query the counts (same queries as above), so we will set the complete set of 1 hour counts to be in a in-memory table.

In [None]:
#spark.conf.set("spark.sql.shuffle.partitions", "2")  # keep the size of shuffles small

query = (
  streamingCountsDF
    .writeStream
    .format("memory")        # memory = store in-memory table 
    .queryName("counts")     # counts = name of the in-memory table
    .outputMode("complete")  # complete = all the counts should be in the table
    .start()
)

`query` is a handle to the streaming query that is running in the background. This query is continuously picking up files and updating the windowed counts. 

Note the status of query in the above cell. The progress bar shows that the query is active. 
Furthermore, if you expand the `> counts` above, you will find the number of files they have already processed. 

Let's wait a bit for a few files to be processed and then interactively query the in-memory `counts` table.

In [None]:
from time import sleep
sleep(5)  # wait a bit for computation to start

In [None]:
%sql select action, date_format(window.end, "MMM-dd HH:mm") as time, count from counts order by time, action

action,time,count
Close,Jul-26 03:00,11
Open,Jul-26 03:00,179
Close,Jul-26 04:00,344
Open,Jul-26 04:00,1001
Close,Jul-26 05:00,815
Open,Jul-26 05:00,999
Close,Jul-26 06:00,1003
Open,Jul-26 06:00,1000
Close,Jul-26 07:00,328
Open,Jul-26 07:00,320


We see the timeline of windowed counts (similar to the static one earlier) building up. If we keep running this interactive query repeatedly, we will see the latest updated counts which the streaming query is updating in the background.

In [None]:
sleep(5)  # wait a bit more for more data to be computed

In [None]:
%sql select action, date_format(window.end, "MMM-dd HH:mm") as time, count from counts order by time, action

action,time,count
Close,Jul-26 03:00,11
Open,Jul-26 03:00,179
Close,Jul-26 04:00,344
Open,Jul-26 04:00,1001
Close,Jul-26 05:00,815
Open,Jul-26 05:00,999
Close,Jul-26 06:00,1003
Open,Jul-26 06:00,1000
Close,Jul-26 07:00,1011
Open,Jul-26 07:00,993


Also, let's see the total number of "opens" and "closes".

In [None]:
%sql select action, sum(count) as total_count from counts group by action order by action

action,total_count
Close,7511
Open,8489


If you keep running the above query repeatedly, you will always find that the number of "opens" is more than the number of "closes", as expected in a data stream where a "close" always appear after corresponding "open". This shows that Structured Streaming ensures **prefix integrity**.

Note that there are only a few files, so consuming all of them there will be no updates to the counts. Rerun the query if you want to interact with the streaming query again.

Finally, you can stop the query running in the background, either by clicking on the 'Cancel' link in the cell of the query, or by executing `query.stop()`. Either way, when the query is stopped, the status of the corresponding cell above will automatically update to `TERMINATED`.

## Five steps to define a streaming query

A streaming query has a similar interface to a batch query, and it requires you to go through these 5 steps:

1. define input sources
2. transform data
3. Define output sink and output mode
4. Specify processing details
5. Start the query

### 1. Input sources

When reading batch data sources, we need `spark.read` to create a `DataFrameReader`, whereas with streaming sources we need `spark.readStream` to create a `DataStreamReader`. Apache Spark natively supports reading data streams from

- Apache Kafka
- various file-based formats that `DataFrameReader` supports (Parquet, ORC, JSON, etc.)
- web sockets

In [None]:
# from the previous example
streamingInputDF = (
  spark
    .readStream                       
    .schema(jsonSchema)               # Set the schema of the JSON data
    .option("maxFilesPerTrigger", 1)  # Treat a sequence of files as a stream by picking one file at a time
    .json(inputPath)
)

### 2. Transform data

Now we can apply the usual `DataFrame` operations, such as grouping by and counting.

In [None]:
from pyspark.sql.functions import window

streamingCountsDF = (          
  streamingInputDF
    .groupBy(
      streamingInputDF.action, 
      window(streamingInputDF.time, "1 hour"))
    .count()
)

`streamingCountsDF` is a streaming `DataFrame` (that is, a DataFrame on `unbounded`, streaming data) that represents the running counts that will be computed once the stream‐ing query is started and the streaming input data is being continuously processed.

To understand which operations are supported in Structured Streaming, you have to recognize the two broad classes of data transformations:

- **stateless tranformations**. Do not require any information from previous rows to process the next row; each row can be processed by itself. The lack of previous “state” in these operations make them stateless. Stateless operations can be applied to both batch and streaming DataFrames. Examples: `select` and `filter`.
- **stateful transformations**. Any DataFrame operations involving grouping, joining, or aggregating are stateful transformations. While many of these operations are supported in Structured Streaming, a few combinations of them are **not** supported because it is either computationally hard or infeasible to compute them in an incremental manner. Examples: `count`

❓ Do you think the _mean_ is supported as stateful transformation? What about the _median_?

### 3. Define output sink and output mode

After transforming the data, we can define how to write the processed output data with `DataFrame.writeStream` by specifying

- output writing details (where and how to write the output)
- processing details (how to process data and how to recover from failures)

In [None]:
# this time, we output to console
streamingCountsDF\
    .writeStream\
    .format("memory")\
    .queryName("counts")\
    .outputMode("complete")

Out[9]: <pyspark.sql.streaming.DataStreamWriter at 0x7f6f56b298e0>

The output mode of a streaming query specifies what part of the updated output to write out after processing new input data. It can be either _complete_, _append_, or _update_ (see above).

Besides writing the output to the console, Structured Streaming natively supports streaming writes to

- files
- Apache Kafka

In addition, you can write to arbitrary locations using the `foreachBatch()` and `foreach()` API methods. In fact, you can use foreachBatch() to write streaming outputs using existing batch data sources (but you will lose exactly-once guarantees).

### 4. Specify processing details

The final step before starting the query is to specify details of how to process the data.

In [None]:
checkpointDir = '/checkpoints/streaming-exercise-1/'

df = (
    streamingCountsDF
    .writeStream
    .format("memory")
    .queryName("counts")
    .outputMode("complete")
    .trigger(processingTime="20 seconds")
    .option("checkpointLocation", checkpointDir)
)
df

Out[16]: <pyspark.sql.streaming.DataStreamWriter at 0x7f6f559fbfd0>

Here, we specified 2 details

- **triggering details**. This indicates when to trigger the discovery and processing of newly available streaming data.
  - _default_. The streaming query executes data in micro-batches where the next micro-batch is triggered as soon as the previous micro-batch has completed.
  - _processing time with trigger interval_. The query will trigger micro-batches at that fixed interval.
  - _once_. The streaming query will execute exactly one micro-batch. It processes all the new data available in a single batch and then stops itself. This is useful when you want to control the triggering and processing from an external scheduler that will restart the query using any custom schedule
  - _continuous_. the streaming query will process data continuously instead of in micro-batches. While only a small subset of DataFrame operations allow this mode to be used, it can pro‐ vide much lower latency (as low as milliseconds) than the micro-batch trigger modes.
- **checkpoint location**. This is a directory in any HDFS-compatible filesystem where a streaming query saves its progress information, ie what data has been successfully processed. Upon failure, this metadata is used to restart the failed query exactly where it left off.

### 5. Start the query
Once everything has been specified, the final step is to start the query.

In [None]:
query = df.start()
type(query)

Out[17]: pyspark.sql.streaming.StreamingQuery

In [None]:
%sql select action, date_format(window.end, "MMM-dd HH:mm") as time, count from counts order by time, action

action,time,count
Close,Jul-26 03:00,11
Open,Jul-26 03:00,179
Close,Jul-26 04:00,344
Open,Jul-26 04:00,1001
Close,Jul-26 05:00,815
Open,Jul-26 05:00,999
Close,Jul-26 06:00,1003
Open,Jul-26 06:00,1000
Close,Jul-26 07:00,1011
Open,Jul-26 07:00,993


In [None]:
%fs ls /checkpoints/streaming-exercise-1/

path,name,size,modificationTime
dbfs:/checkpoints/streaming-exercise-1/commits/,commits/,0,0
dbfs:/checkpoints/streaming-exercise-1/metadata,metadata,45,1665033422000
dbfs:/checkpoints/streaming-exercise-1/offsets/,offsets/,0,0
dbfs:/checkpoints/streaming-exercise-1/sources/,sources/,0,0
dbfs:/checkpoints/streaming-exercise-1/state/,state/,0,0


The returned object of type `query` represents an active query and can be used to manage the query. Note that `start()` is a nonblocking method, so it will return as soon as the query has started in the background. If you want the main thread to block until the streaming query has terminated, you can use `query.awaitTermination()`. You can wait up to a timeout duration using `awaitTermination(timeoutMillis)`, and you can explicitly stop the query with `query.stop()`.

(Page 218).

In [None]:
query.stop()

## Exercise

Consider the JSON files under `dbfs:/databricks-datasets/iot-stream/data-device/`. Define a streaming job that

- computes the sum of `num_steps`
- writes the result in a memory table by overriding its content at every micro-batch
- is triggered every 20 seconds

Then, query the memory table while the streaming is running. Is the number increasing? If not, why do you think so?

**Bonus.** Can you edit the `readStream` options so that you can see the total sum of the number increase as if the stream job processes one file at a time?

## Solution

In [None]:
%fs head /databricks-datasets/iot-stream/README.md

In [None]:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, LongType, FloatType

filename = 'dbfs:/databricks-datasets/iot-stream/'

device_schema = StructType([
    StructField("id",LongType(),False),  
    StructField("user_id",LongType(),True),  
    StructField("device_id",LongType(),True),  
    StructField("num_steps",LongType(),True),  
    StructField("miles_walked",FloatType(),True),  
    StructField("calories_burnt",FloatType(),True),  
    StructField("timestamp",StringType(),True),  
    StructField("value",StringType(),True)
])

In [None]:
display(
    spark.read
    .schema(device_schema)
    .json(filename + 'data-device/')
    .agg({'num_steps': 'sum'})
)

sum(num_steps)
6493324948


In [None]:
dbutils.fs.rm("/checkpoints/streaming-exercise-2/", recurse=True)

Out[56]: False

In [None]:
dbutils.fs.rm("/output/streaming/sum-of-steps", recurse=True)

Out[57]: False

In [None]:
device_df = spark.readStream\
                .schema(device_schema)\
                .option("maxFilesPerTrigger", 1)\
                .json(filename + 'data-device/')\
                .agg({'num_steps': 'sum'})

In [None]:
device_df.writeStream\
    .format("memory")\
    .queryName("total_steps")\
    .outputMode("complete")\
    .option("checkpointLocation", "/checkpoints/streaming-exercise-2/")\
    .trigger(processingTime="20 seconds")\
    .start()

Out[59]: <pyspark.sql.streaming.StreamingQuery at 0x7f035079f0d0>

In [None]:
display(spark.read.table('total_steps'))

sum(num_steps)
974165189
