## Spark Streaming
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The computation is executed on the same optimized Spark SQL engine. Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs. In short, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.

Internally, by default, Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees. However, since Spark 2.3, we have introduced a new low-latency processing mode called Continuous Processing, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees. Without changing the Dataset/DataFrame operations in your queries, you will be able to choose the mode based on your application requirements.

In this guide, we are going to walk you through the programming model and the APIs. We are going to explain the concepts mostly using the default micro-batch processing model, and then later discuss Continuous Processing model. First, let’s start with a simple example of a Structured Streaming query - a streaming word count.

In [2]:
from pyspark.sql.types import *
from pyspark.sql.functions import window
from pyspark.sql.functions import avg, count

#### Using Sample Data Provided by Databricks
To keep things simple and free for everyone, we will use a sample event log dataset provided by databricks as files in /databricks-datasets/structured-streaming/events/. We will use this data for this lab.

We can view the file in the following way. We will be simulating a streaming situation by using a "trigger" mechanism and reading one file at a time!

In [5]:
%fs ls /databricks-datasets/structured-streaming/events/

Its a good idea to clear up old files, streaming application produces a lot of temp files, and there is a limit to how many temp files you can have for a free Databricks account.

In [7]:
dbutils.fs.rm('dbfs:/SOME_CHECKPOINT_DIRECTORY/', True)
dbutils.fs.rm(('dbfs:/tmp/'), True)

We will enforce a schema as it is more efficient, if we leave it blank Spark can figure out the Schema as well!

In [9]:
# Since we know the data format already, let's define the schema to speed up processing (no need for Spark to infer schema)
schema = StructType([ StructField("time", TimestampType(), True), StructField("action", StringType(), True) ])

Now we will read from the stream into our streaming dataframe!

In [11]:
# similar to creating dataframe with .read use the .readStream to start reading from a stream
# by using the .option("maxFilesPerTrigger", 1) we will simulate streaming from a fixed set of files stored in our FS
streaming_input_df = spark.readStream.schema(schema).option("maxFilesPerTrigger", 1).json("/databricks-datasets/structured-streaming/events/")

In [12]:
# you can check if a dataframe is streaming by using .isStreaming
streaming_input_df.isStreaming

In [13]:
# Let's take a quick look at what is streaming using the display function provided by databricks
display(streaming_input_df)
# as you can see we have timestamp and an action (open/ close) in each entry of our data

### Quering! 
Now you can use all the things you learnt previously from Spark SQL! For example, you can groupBy certain attribute and aggregate, filter, or select as you wish! <br/>
We haven't been introduced to the concept of windowing, which we will briefly zoom in now.

A window function can also be applied to to Bucketize rows into one or more time windows given a timestamp specifying column. For that we will use window groupBy function (pyspark.sql.functions.window) <br/>
you can call the window groupby function in the following way: __window(timeColumn, windowDuration, slideDuration=None, startTime=None)__. The definition of slide interval and window interval are as follows:
* Window Duration: how far back in time the windowed transformation goes
* Slide Duration: how often a windowed intransformation is computed

we use do an window function to count user activity over 10 minutes

In [17]:
windowed_counts_df = streaming_input_df.groupBy(window(streaming_input_df.time, "10 minutes", "5 minutes").alias('time_window')).agg(count(streaming_input_df.action).alias('user_activity'))
display(windowed_counts_df)

we use do an window function to count each time user closes or opens per ten minutes bucket

In [19]:
windowed_grouped_counts_df = streaming_input_df.groupBy(streaming_input_df.action, window(streaming_input_df.time, "10 minutes", "5 minutes").alias('time_window')).agg(count(streaming_input_df.action))
display(windowed_grouped_counts_df)

#### Writing The Output
You can write the output of stream to a sink in the following way.

In [21]:
#query = windowed_grouped_counts_df.writeStream.format("memory").queryName("counts").outputMode("complete").start()
#query.awaitTermination()

You can also use all the other SQL functions you have learned in previous labs to process streaming data! Give them a try below: