# Analyzing Sensor Data with Spark Streaming

First of all, Spark Streaming requires more than one executor, in my case I am using a VM and needed to make sure this VM had more than virtual processor at least.

In the case of VirtualBox this is pretty easy of configure, just select your virtual machine and then settins -> system -> processor -> and select your desired number of virtual processors in the slider (VirtualBox 6.0) , I am currently using two virtual processors as I think I won't need more in this quick demo.

Now we can work on getting the sensor data we will be taking data from a weather sensor provided by the HPWREN, an interdisciplinary and multi-institutional UCSD research and education project. In this demo we would like to return the average wind direction (called Dm in the output of the sensor), before using the sensor we have studied the values it generates and have generated a function that parses each line and returns the average wind direction as follows:

In [1]:
import re

def parse(line):
    match = re.search("Dm=(\d+)", line)
    if match:
        val = match.group(1)
        return [int(val)]
    return []


In order to analyze our sensor data in real time we will need Spark Streaming, now we will import and create a new instance of Spark's StremingContext. Similarly to the SparkContext, the StreamingContext provides an interface to Spark's streaming capabilities.

In [2]:
from pyspark.streaming import StreamingContext

#The argument sc is the SparkContext and 1 specifies a batch interval of one second
ssc = StreamingContext(sc, 1)

Next, we create a Dstream of weather data, micro batches (Dstream is short for discretized streams) by opening a connection to the streaming weather data sensor:

In [3]:
lines = ssc.socketTextStream("rtd.hpwren.ucsd.edu", 12028)

Now we will read the average wind speed from each line and store it in a new DStream variable, then we will use **flatMap transformation** parsing the input lines in order to return an RDD with an aggregate of all the results for all the elements.

In [4]:
values = lines.flatMap(parse)

At this moment, we can create a sliding window over the measurements by calling the **window()** method, this will create a new DStream window that combines 'x' seconds worth of data and moves by 'y', similar to the way of TCP works using sliding windows too.

In [5]:
window = values.window(10, 5)

#### Define and call analysis function

We will find the minimum and maximum values in our window. Next we have a function which can do this:

In [6]:
def stats(rdd):
    print(rdd.collect())
    if rdd.count() > 0:
        print("max = {}, min = {}".format(rdd.max(), rdd.min()))

This function first prints the entire contents of the RDD by calling the **collect()** method. This is done to demonstrate the sliding window and would not be practical if the RDD was containing a large amount of data. Next, we check if the size of the RDD is greater than zero before printing the maximum and minimum values.

Next, we call the **stats()** function for each RDD in our sliding window:

In [7]:
window.foreachRDD(lambda rdd: stats(rdd))

At this point we are prepared to start the stream processing calling **start()** on the StreamingContext:

In [8]:
ssc.start()

[16, 19]
max = 19, min = 16
[16, 19, 16, 13, 12, 5, 356]
max = 356, min = 5
[16, 13, 12, 5, 356, 347, 340, 334, 337, 337]
max = 356, min = 5
[347, 340, 334, 337, 337, 344, 344, 344, 347, 356]
max = 356, min = 334
[344, 344, 344, 347, 356, 2, 4, 4, 3, 6, 9]
max = 356, min = 2
[2, 4, 4, 3, 6, 9, 9, 10, 4, 7, 7]
max = 10, min = 2
[9, 10, 4, 7, 7, 10, 15, 18, 12, 3]
max = 18, min = 3
[10, 15, 18, 12, 3, 3, 1, 7, 1, 352]
max = 352, min = 1
[3, 1, 7, 1, 352, 350, 344, 344, 344, 348]
max = 352, min = 1


The sliding window contains ten seconds worth of data and slides every five seconds. In the beginning, the number of values in the windows are incresing as the data accumulates, and after window 3, the size stays approximately the same. Since the window slides half as often as the size of the window, the second half of a window becomes the first half of the next window

In [10]:
ssc.stop()