# Spark Streaming Documentation Example

The interested reader can have a look at how **Structured Streaming** is performed at:  
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#quick-example

In this Notebook, we will work through good old-fashioned Spark Streaming, which is detailed at:  
https://spark.apache.org/docs/latest/streaming-programming-guide.html#a-quick-example

Because we will be using Spark Streaming and not structured streaming, we need to use some older "RDD" syntax.

This stems from using a SparkContext instead of a SparkSession.

We will be building a very simple application that connects to a local stream of data (an open terminal) through a socket connection.

It will then count the words for each line that we type in.

The steps for streaming will be:
- Create a SparkContext.
- Create a StreamingContext.
- Create a Socket Text Stream.
- Read in the lines as a "DStream".

The steps for working with the Data:
- Split the input line into a list of words.
- Map each work to a tuple: ```(word, 1)```.  The second parameter will always be 1.
- Then group (**reduce**) the tuples by the work (**key**), and sum up the second argument (the number one).  I.e., we _reduce by key_.

That will then provide us with a word count in the form **('hello', 3)** for each line.

As a quick note, the RDD syntax relies heavily on lambda expressions, which are just quick anonymous functions.

Fortunately, all the lambda expressions used here are quite simple and basic.

Let's get started with this simple example!

## Basic Example

First, we import ```StreamingContext```, which is the main entry point for all streaming functionality.  We create a local StreamingContext with 2 execution threads, and a batch interval of 1 second.

In [1]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

In [2]:
# Create a local StreamingContext with 2 working threads and a batch interval of 1 second.
sc = SparkContext("local[2]", "NetworkWordCount")

In [3]:
ssc = StreamingContext(sc, 1)  # This means that the interval is 1 second.

Using this context, we can create a DStream that represents streaming data from a TCP source, specified as hostname (e.g. ```localhost```) and port (e.g. ```9999```).

In [4]:
# Create a DStream that will connect to hostname:port, like localhost:9999
lines = ssc.socketTextStream("localhost", 9999)  # use a port that we are pretty darn sure is not being 
# currently used

The ```lines``` DStream represents the stream of Data that will be received from the data server.  Each record in this DStream is a line of text.  Next, we want to split the lines by space into words.

In [5]:
# Split each line into words.
words = lines.flatMap(lambda line: line.split(" "))  # Allows to map something to those lines.

```flatMap``` is a one-to-many DStream operation that creates a new DStream by generating multiple new records from each record in the source DStream.  In this case, each line will be split into multiple words and the stream of words is represented as the ```words``` DStream.  Next, we want to count these words.

In [6]:
# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda num1, num2: num1 + num2)  # think of it as a groupBy mechanism.
# Keep iteratively reducing untill you are only left with one tuple instance of the word.

# Print the first ten elements of each RDD generated in this DStream to the console.
wordCounts.pprint()  # pretty print

The ```words``` DStream is further mapped (one-to-one transformation) to a DStream of ```(word, 1)``` pairs, which is then reduced to get the frequency of words in each batch of data.  Finally ```wordCounts.pprint()``` will print a few of the counts generated every second.

Note that when these lines are executed, Spark Streaming only sets up the computation it will perform when it is started, and no real processing has started yet.  To start the processing after all the transformation have been setup, we finally call:

In [7]:
ssc.start()  # Start the computation

# If you have already downloaded and built Spark, you can run this example as follows.  
# You will first need to run NetCat (a small utility found in most Unix-like systems) as a data server by using:  
# $ nc -lk 9999

# Use the STOP button above to stop the streaming.
# This might reflect as a KeyboardInterruptError in the output of this cell.

-------------------------------------------
Time: 2019-06-06 14:54:05
-------------------------------------------

-------------------------------------------
Time: 2019-06-06 14:54:06
-------------------------------------------

-------------------------------------------
Time: 2019-06-06 14:54:07
-------------------------------------------

-------------------------------------------
Time: 2019-06-06 14:54:08
-------------------------------------------

-------------------------------------------
Time: 2019-06-06 14:54:09
-------------------------------------------

-------------------------------------------
Time: 2019-06-06 14:54:10
-------------------------------------------

-------------------------------------------
Time: 2019-06-06 14:54:11
-------------------------------------------

-------------------------------------------
Time: 2019-06-06 14:54:12
-------------------------------------------
('now', 1)
('how', 1)
('about', 1)

-------------------------------------------
T

The complete code can be found in the Spark Streaming example https://github.com/apache/spark/blob/v2.4.3/examples/src/main/python/streaming/network_wordcount.py