# Spark streaming basics project

_____
### Note on  Streaming
Streaming is something that is rapidly advancing and changing fast, there are multiple new libraries every year, new and different services always popping up, and what is in this notebook may or may not apply to you. Maybe your looking for something specific on Kafka, or maybe you are looking for streaming about twitter, in which case Spark might be overkill for what you really want. Realistically speaking each situation is going to require a customized solution and this course is never going to be able to supply a one size fits all solution. Because of this, I wanted to point out some great resources for Python and Spark StreamingL

* [The Official Documentation is great. This should be your first go to.](http://spark.apache.org/docs/latest/streaming-programming-guide.html#spark-streaming-programming-guide)

* [Fantastic Guide to Spark Streaming with Kafka](https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/)

* [Another Spark Streaming Example with Geo Plotting](http://nbviewer.jupyter.org/github/ibm-cds-labs/spark.samples/blob/master/notebook/DashDB%20Twitter%20Car%202015%20Python%20Notebook.ipynb)
____

Spark has pretty well known Streaming Capabilities, if streaming is something you've found yourself needing at work then you are probably familiar with some of these concepts already, in which case you may find it more useful to jump straight to the official documentation here:

http://spark.apache.org/docs/latest/streaming-programming-guide.html#spark-streaming-programming-guide

It is really a great guide, but keep in mind some of the features are restricted to Scala at this time (Spark 2.1), hopefully they will be expanded to the Python API in the future!

For those of you new to Spark Streaming, let's get started with a classic example, streaming Twitter! Twitter is a great source for streaming because its something most people already have an intuitive understanding of, you can visit the site yourself, and a lot of streaming technology has come out of Twitter as a company. You don't access to the entire "firehose" of twitter without paying for it, but that would be too much for us to handle anyway, so we'll be more than fine with the freely available API access.

_____

Let's discuss SparkStreaming!

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.

<img src='http://spark.apache.org/docs/latest/img/streaming-arch.png'/>

Keep in mind that a few of these Streamiing Capabilities are limited when it comes to Python, you'll need to reference the documentation for the most up to date information. Also the streaming contexts tend to follow more along with the older RDD syntax, so a few things might seem different than what we are used to seeing, keep that in mind, you'll definitely want to have a good understanding of lambda expressions before continuing with this!

There are SparkSQL modules for streaming: 

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=streaming#module-pyspark.sql.streaming

But they are all still listed as experimental, so instead of showing you somethign that might break in the future, we'll stick to the RDD methods (which is what the documentation also currently shows for streaming).

Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

<img src='http://spark.apache.org/docs/latest/img/streaming-flow.png'/>

## Simple local example

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode,split

In [2]:
spark = SparkSession.builder.appName('spark_stream').getOrCreate()

In [3]:
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# Split the lines into words
words = lines.select(
    explode(
        split(lines.value, " ")
    ).alias('word')
)

# Generate running word count
wordCounts = words.groupBy('word').count()

Now we open up a Unix terminal and type:

         $ nc -lk 9999
     $ hello world any text you want
     
With this running run the line below, then type Ctrl+C to terminate it.

In [9]:
# Start running the query that prints the running counts to the console
query = wordCounts.writeStream.outputMode('complete').format('console').start()

query.awaitTermination()

In [12]:
''' will print like this based on words we enter

-------------------------------------------                                     
Batch: 0
-------------------------------------------
+----+-----+
|word|count|
+----+-----+
+----+-----+

-------------------------------------------                                     
Batch: 1
-------------------------------------------
+------+-----+
|  word|count|
+------+-----+
|hadoop|    1|
+------+-----+

-------------------------------------------                                     
Batch: 2
-------------------------------------------
+------+-----+
|  word|count|
+------+-----+
|apache|    1|
|hadoop|    1|
+------+-----+

-------------------------------------------                                     
Batch: 3
-------------------------------------------
+------+-----+
|  word|count|
+------+-----+
|apache|    1|
| spark|    1|
|hadoop|    2|
+------+-----+

-------------------------------------------                                     
Batch: 4
-------------------------------------------
+------+-----+
|  word|count|
+------+-----+
|apache|    2|
| spark|    1|
|hadoop|    2|
+------+-----+
'''

' will print like this based on words we enter\n\n-------------------------------------------                                     \nBatch: 0\n-------------------------------------------\n+----+-----+\n|word|count|\n+----+-----+\n+----+-----+\n\n-------------------------------------------                                     \nBatch: 1\n-------------------------------------------\n+------+-----+\n|  word|count|\n+------+-----+\n|hadoop|    1|\n+------+-----+\n\n-------------------------------------------                                     \nBatch: 2\n-------------------------------------------\n+------+-----+\n|  word|count|\n+------+-----+\n|apache|    1|\n|hadoop|    1|\n+------+-----+\n\n-------------------------------------------                                     \nBatch: 3\n-------------------------------------------\n+------+-----+\n|  word|count|\n+------+-----+\n|apache|    1|\n| spark|    1|\n|hadoop|    2|\n+------+-----+\n\n-------------------------------------------       