## SparkML on Streaming Data

Let's take in the model we saved earlier, and apply it to some streaming data!

In [None]:
%run "../includes/mnt_blob"

In [None]:
%run "../includes/setup_env"

In [None]:
from pyspark.ml.pipeline import PipelineModel

fileName = userhome + "/tmp/DT_Pipeline"
pipelineModel = PipelineModel.load(fileName)

We can simulate streaming data.

**Note**: You must specify a schema when creating a streaming source DataFrame. Why!?

In [None]:
from pyspark.sql.types import *

schema = StructType([
  StructField("rating",DoubleType()), 
  StructField("review",StringType())])

streamingData = (spark
                 .readStream
                 .schema(schema)
                 .option("maxFilesPerTrigger", 1)
                 .parquet("/mnt/data/imdb/imdb_ratings_50k.parquet"))

Why is this stream taking so long? What configuration should we set?

In [None]:
stream = (pipelineModel
          .transform(streamingData)
          .groupBy("label", "prediction")
          .count()
          .sort("label", "prediction"))

display(stream)

In [None]:
spark.conf.get("spark.sql.shuffle.partitions")

In [None]:
spark.conf.set("spark.sql.shuffle.partitions", "8")

Let's try this again

In [None]:
stream = (pipelineModel
          .transform(streamingData)
          .groupBy("label", "prediction")
          .count()
          .sort("label", "prediction"))

display(stream)

Let's save our results to a file.

In [None]:
import re

streamingView = re.sub('\W', '', username)
checkpointFile = userhome + "/tmp/checkPoint"
dbutils.fs.rm(checkpointFile, True) # Clear out the checkpointing directory

(stream
 .writeStream
 .format("memory")
 .option("checkpointLocation", checkpointFile)
 .outputMode("complete")
 .queryName(streamingView)
 .start())

In [None]:
display(sql("select * from " + streamingView))

Read more about streaming [here](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html).

-sandbox
&copy; 2018 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>