# Chapter 8. Structured Streaming

In [None]:
from uuid import uuid1

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = (SparkSession.builder
  # Add Kafka-source library.  The version after ":" must be the Kafka version that you usew
  .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0")
  .master("local[4]")
  .appName("StructuredStreaming")
  .getOrCreate())
spark

## The Fundamentals of a Structured Streaming Query

For the following streaming query to work, we need a TCP server that will listen at `127.0.0.1:61080` and will be sending text lines.

We can use `netcat-openbsd` for this. In a terminal run `nc -lk -s 127.0.0.1 -p 61080` and start typing text lines. Observe the output in this notebook. It should be something like this

```
-------------------------------------------
Batch: 1
-------------------------------------------
+----+-----+
|word|count|
+----+-----+
| foo|    1|
+----+-----+
```

To terminate the query interrupt the Jupyter kernel (menu Krenel -> Interrupt Kernel)

In [None]:
# Random checkpoint dirname. Ust it if you want every query to start anew.
checkpoint_dir = f"/tmp/spark-streaming-checkpoints-{uuid1()}"

# Static checkpoint dirname. Use it if you want to restart a stopped query.
# checkpoint_dir = f"./spark-streaming-checkpoints"

# Step 1: Define input sources 
lines = (spark
         .readStream
         .format("socket")
         .option("host", "127.0.0.1")
         .option("port", "61080")
         .load())
# Step 2: Transform data
words = lines.select(F.explode(F.split(F.col("value"), "\\s")).alias("word"))
counts = words.groupBy("word").count()
# Step 3: Define output sink and output mode
writer = (counts
         .writeStream
         .format("console")
         .outputMode("complete"))
# Step 4: Specify processing details
writer2 = (writer
           .trigger(processingTime="1 second")
           .option("checkpointLocation", checkpoint_dir))
# Step 5: Start the query
streaming_query = writer2.start()
# The following line will block for 60 seconds and the console output will be echoed in this notebook
# in the cell output. You can unblock earlier by interrupting the Jupyter kernel (menu Krenel -> Interrupt Kernel)
streaming_query.awaitTermination(60)

In [None]:
# The streaming query is still running. You can still observe the console output
# in the terminal in which you started Jupyter.
streaming_query.status

In [None]:
streaming_query.stop()
streaming_query.status

Now the query is stopped.

If you used a static checkpoint dirname, you can restart the query from the point where it left off. To restart the query, reexecute the cell that creates and starts the streaming query (with steps 1 to 5). You may get "ERROR MicroBatchExecution" with IndexOutOfBoundsException. In this case rerun the cell one more time.

**NOTE:** If you use a static checkpoint dirname and you stopped and restart netcat inbetween, your restarted query may stop accepting input from netcat. In this case you may need a complete reset: stop the query, remove the checkpoint directory manually, restart netcat, and finaly restart the query.

In [None]:
streaming_query.lastProgress

## Streaming Data Sources and Sinks

### Reading from Files

In [None]:
input_directory_of_json_files = "../data/streaming_json"
file_schema_read_json = "`key` integer, `value` string"

df_read_json = (spark
           .readStream
           .format("json")
           .schema(file_schema_read_json)
           .load(input_directory_of_json_files))

After starting the query in the next cell you will see the data from the file `00.json` in the cell output. Create a new file by copying `00.json` to `1.json`:
```
cp data/streaming_json/00.json data/streaming_json/1.json
```
and you will see the same data output again.

In [None]:
checkpoint_dir_read_json = f"./spark-streaming-checkpoints-read-json"

streaming_query_read_json = (df_read_json
                        .writeStream
                        .format("console")
                        .outputMode("append")
                        .trigger(processingTime="1 second")
                        .option("checkpointLocation", checkpoint_dir_read_json)
                        .start())
# The following line will block for 60 seconds and the console output will be echoed in this notebook
# in the cell output. You can unblock earlier by interrupting the Jupyter kernel (menu Krenel -> Interrupt Kernel)
streaming_query_read_json.awaitTermination(60)

In [None]:
streaming_query_read_json.stop()
streaming_query_read_json.status

If you want to restart the streaming query with the same JSON files all over again, remove the checkpoint directory `checkpoint_dir_read_json`. Otherwise the query will skip the files that it have read already. 

### Writitng to Files

The following streaming query writes data read by `df_read_json` from JSON files in `input_directory_of_json_files` directory to files in `output_directory_for_json_files` directory.

In [None]:
output_directory_for_json_files = "../data_output/streaming_json"
checkpoint_dir_write_json = f"./spark-streaming-checkpoints-write-json"

streaming_query_write_json = (df_read_json
                              .writeStream
                              .format("json")
                              .option("checkpointLocation", checkpoint_dir_write_json)
                              .start(output_directory_for_json_files))

In [None]:
streaming_query_write_json.status

In [None]:
streaming_query_write_json.stop()
streaming_query_write_json.status

### Reading from Apache Kafka

Before we can read anything from Kafka, we need to write some data into a topic.  We will use `kafka-time-producer.py` to wirte a stream of timestamps to the `timestamps` Kafka topic.  Then we will read this stream and write it out to console using Spark streaming query.

To start producing timestamps into the Kafka topic run the following command from the project root
```
poetry run python3 bin/kafka-time-producer.py
```

FYI: `kafka-time-producer.py` generates the key-value pairs and wirtes them to Kafka using Spark, too.

In [None]:
df_read_kafka = (spark
                 .readStream
                 .format("kafka")
                 .option("kafka.bootstrap.servers", "localhost:9093,localhost:9094,localhost:9095")
                 .option("subscribe", "timestamps")
                 .option("startingOffsets", "earliest")  # the default for streaming queries is "latest"
                 .load())
# df_read_kafka_transformed = df_read_kafka.withColumns({"key_string": F.expr("cast(key as string)"),
#                                                        "value_string": F.expr("cast(value as string)")})
df_read_kafka_transformed = df_read_kafka.withColumns({"key_string": F.col("key").cast("string"),
                                                       "value_string": F.col("value").cast("string")})

In [None]:
checkpoint_dir_read_kafka = f"./spark-streaming-checkpoints-read-kafka"

streaming_query_read_kafka = (df_read_kafka_transformed
                        .writeStream
                        .format("console")
                        .outputMode("append")
                        .trigger(processingTime="1 second")
                        .option("checkpointLocation", checkpoint_dir_read_kafka)
                        .start())
# The following line will block for 60 seconds and the console output will be echoed in this notebook
# in the cell output. You can unblock earlier by interrupting the Jupyter kernel (menu Krenel -> Interrupt Kernel)
streaming_query_read_kafka.awaitTermination(60)

In [None]:
streaming_query_read_kafka.stop()
streaming_query_read_kafka.status

### Writing to Apache Kafka

The following streaming query reads key-value pairs form CSV files in a directory and writes those key-value pairs to a Kafka topic.

In [None]:
file_schema_write_kafka = "`word` string, `count` long"

df_write_kafka = spark.readStream.format("csv").schema(file_schema_write_kafka).option("header", "true").load("../data/counts")

In [None]:
checkpoint_dir_write_kafka = f"/tmp/spark-streaming-checkpoints-write-kafka-{uuid1()}"

streaming_query_write_kafka = (df_write_kafka
  .selectExpr(
    "cast(word as string) as key",
    "cast(count as string) as value")
  .writeStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9093,localhost:9094,localhost:9095")
  .option("topic", "wordcounts")
  .outputMode("update")
  .option("checkpointLocation", checkpoint_dir_write_kafka)
  .start())


Check the outputted messages in [AKHQ](http://localhost:8086/ui/docker-kafka-server/topic/wordcounts/data?sort=Oldest&partition=All). If the counts are not written to the Kafka topic, check the terminal where you started the notebook for error logs.

In [None]:
streaming_query_write_kafka.stop()
streaming_query_write_kafka.status

### Custom Streaming Sources and Sinks