<div style="font-size:18pt; padding-top:20px; text-align:center"><b>Introduction to </b><span style="font-weight:bold; color:green">Spark Streaming</span></div><hr>
<div style="text-align:right;">Папулин С.Ю. <span style="font-style: italic;font-weight: bold;">(papulin.study@yandex.ru)</span></div>

<a name="0"></a>
<div><span style="font-size:14pt; font-weight:bold">Contents</span>
    <ol>
        <li><a href="#1">Stateless Transformation</a></li>
        <li><a href="#2">Stateful Transformation</a></li>
        <li><a href="#3">Window Transformation</a></li>
        <li><a href="#4">Sources</a></li>
    </ol>
</div>

<a name="1"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">1. Stateless Transformation</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To contents</a></div>
    </div>
</div>

In [None]:
# !!! COPY CONTENT OF THIS CELL AND PAST IT INTO SEPERATE PY FILE TO RUN IN TERMINAL !!!


# -*- coding: utf-8 -*-

import sys

from pyspark import SparkContext
from pyspark.streaming import StreamingContext



# Create Spark Context
sc = SparkContext(appName="WordCount")

# Set log level
#sc.setLogLevel("INFO")

# Batch interval (10 seconds)
batch_interval = 10

# Create Streaming Context
ssc = StreamingContext(sc, batch_interval)

# Create a stream (DStream)
lines = ssc.socketTextStream("localhost", 9999)


# TRANSFORMATION FOR EACH BATCH


"""
    MapFlat Transformation

    Example: ["a a b", "b c"] => ["a", "a", "b", "b", "c"] 

"""
words = lines.flatMap(lambda line: line.split())


"""
    Map Transformation

    Example: ["a", "a", "b", "b", "c"] => [("a",1), ("a",1), ("b",1), ("b",1), ("c",1)] ] 

"""
word_tuples = words.map(lambda word: (word, 1))



"""
    ReduceByKey Transformation

    Example: [("a",1), ("a",1), ("b",1), ("b",1), ("c",1)] => [("a",3),("b",2), ("c",1)]

"""
counts = word_tuples.reduceByKey(lambda x1, x2: x1 + x2)


# Print the result (10 records)
counts.pprint()

# Save to permanent storage
#counts.transform(lambda rdd: rdd.coalesce(1)).saveAsTextFiles("/YOUR_PATH/output/wordCount")

# Start Spark Streaming
ssc.start()

# Await terminiation
ssc.awaitTermination()

<p>Create a test text source using the netcat tool. The netcat will set a listener to port 9999, and text typed in terminal will be read by Spark Streaming</p>

In [None]:
nc -lk 9999

<p>Run the spark streaming application (above code)</p>

In [None]:
spark-submit --master local[2] /YOUR_PATH/spark_streaming_wordcount.py 

<p>Now you should have two terminal (one for messages and the other for spark streaming app). Enter random text messages in the terminal with netcat and look at the terminal with sparl streaming. How can you describe behaviour of the spark streaming app? What features do you notice? Why is it stateless?</p>

<a name="2"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">2. Stateful Transformation</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To contents</a></div>
    </div>
</div>

In [None]:
# !!! COPY CONTENT OF THIS CELL AND PAST IT INTO SEPERATE PY FILE TO RUN IN TERMINAL !!!


# -*- coding: utf-8 -*-

import sys

from pyspark import SparkContext
from pyspark.streaming import StreamingContext


# Create Spark Context
sc = SparkContext(appName="WordCountStatefull")

# Set log level
#sc.setLogLevel("INFO")

# Batch interval (10 seconds)
batch_interval = 10

# Create Streaming Context
ssc = StreamingContext(sc, batch_interval)

# Add checkpoint to preserve the states
ssc.checkpoint("tmp_spark_streaming") # == /user/cloudera/tmp_spark_streaming

# Create a stream
lines = ssc.socketTextStream("localhost", 9999)

# TRANSFORMATION FOR EACH BATCH
words = lines.flatMap(lambda line: line.split())
word_tuples = words.map(lambda word: (word, 1))
counts = word_tuples.reduceByKey(lambda x1, x2: x1 + x2)

# function for updating values
def update_total_count(currentCount, countState):
    if countState is None:
        countState = 0
    return sum(currentCount, countState)

# Update current values
total_counts = counts.updateStateByKey(update_total_count)

# Print the result (10 records)
counts.pprint()
#counts.transform(lambda rdd: rdd.coalesce(1)).saveAsTextFiles("/YOUR_PATH/output/wordCount")

# Start Spark Streaming
ssc.start()

# Await terminiation
ssc.awaitTermination()

<p>Run this spark streaming app in terminal as in the previous case. What is the difference between the results of the two applications?</p>

<a name="3"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">3. Window Transformation</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To contents</a></div>
    </div>
</div>

In [None]:
# !!! COPY CONTENT OF THIS CELL AND PAST IT INTO SEPERATE PY FILE TO RUN IN TERMINAL !!!


# -*- coding: utf-8 -*-

import sys

from pyspark import SparkContext
from pyspark.streaming import StreamingContext


# Create Spark Context
sc = SparkContext(appName="WordCountStatefullWindow")

# Set log level
#sc.setLogLevel("INFO")

# Batch interval (10 seconds)
batch_interval = 10

# Create Streaming Context
ssc = StreamingContext(sc, batch_interval)

# Add checkpoint to preserve the states
ssc.checkpoint("tmp_spark_streaming") # == /user/cloudera/tmp_spark_streaming

# Create a stream
lines = ssc.socketTextStream("localhost", 9999)


# TRANSFORMATION FOR EACH BATCH
words = lines.flatMap(lambda line: line.split())
word_tuples = words.map(lambda word: (word, 1))
counts = word_tuples.reduceByKey(lambda x1, x2: x1 + x2)

# Apply window
windowed_word_counts = counts.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 20, 10)
#windowed_word_counts = counts.reduceByKeyAndWindow(lambda x, y: x + y, None, 20, 10)


# Print the result (10 records)
windowed_word_counts.pprint()
#windowed_word_counts.transform(lambda rdd: rdd.coalesce(1)).saveAsTextFiles("/YOUR_PATH/output/wordCount")

# Start Spark Streaming
ssc.start()

# Await terminiation
ssc.awaitTermination()

<a name="4"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">4. Sources</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To contents</a></div>
    </div>
</div>

In [None]:
https://spark.apache.org/docs/latest/streaming-programming-guide.html